1. Bibliographic references in BHL
Coordination and routes for
cooperation across organizations,
projects and e-infrastructures
23rd of May 2013
William Ulate R., Missouri Botanical Garden
2. Questions to Answer
1. Type of content we discuss (e.g., occurrences, genes, behaviour,
morphology, etc.)
2. Sources of content (from where)
3. Formats of content (formats, standards)
4. Methods of gathering information (e.g., harvesting, ftp uploads,
protocols)
5. Methods of delivery of information (e,g., free searches, API, web
services, automated exports, linking mechanisms, etc.; provide links
to API and web services documentation)
6. Identifiers used (type, persistence, dereferencing, resolvability)
7. Present or forthcoming interoperability features with other
platforms
8. Constraints, needs and expectations to:
a) Suppliers of content, and
b) Users of content
9. What is needed for Bibliographic References?
7. Open Data
• Downloads
– Simple tab-delimited exports of core data
– http://www.biodiversitylibrary.org/data/BHLExportSchema.pdf
• Data model
– DB schema as ERD
– http://bhl-bits.googlecode.com/files/20090930_BHLDataModel.pdf
8. Services
• Names Service
– Return all occurrences of a name throughout BHL digitized corpus
• Documentation: http://bit.ly/2e6sg9
– Access to 100+ million name strings using TaxonFinder & NetiNeti
• 1.5 million unique names
– Algorithm to detect nomenclatural & taxonomic acts
• OpenURL
– Facilitate links to citations: protologues, articles, references
• Documentation: http://www.biodiversitylibrary.org/openurlhelp.aspx
– Useful to Nomenclators, Reference Systems
• IPNI
• Tropicos
11. DOIs for Legacy Literature
• BHL member of CrossRef through Smithsonian
• Started assigning DOIs to BHL monographs
– Low hanging fruit: Easy, non-controversial
– 54,856 DOIs Approved to date
• Next, other publication types / articles?
– Process of automatically assigning CrossRef DOIs
to articles has a higher potential for collisions.
12. Article-level metadata
• Disambiguating and locating structural components
in the corpus
• Done by automated and crowdsourced means
– Thanks Rod Page! Welcome others!
• Greatly increases semantic value of the dataset
• Makes data addressable and thus linkable
Chapter-level metadataTreatment-level metadataPart-level metadata
13. Genesis: “BHL Article Repository”
• Idea first introduced at TDWG 2008, Fremantle
(by BHL, many have discussed for years)
• YouTube for biodiversity articles
• Needed (need) a way to access articles in BHL
– “BHL has no articles.”
– BHL has hundreds of thousands of articles but you
can’t search for them via author, article title search
– Can find via “article coordinates” using BHL’s UI &
OpenURL resolver: Journal / Volume / Start Page / Year
14. CiteBank
• Objectives
– Create a repository for community-vetted
taxonomic bibliographies.
– Ability to ingest, display, download, and index
articles so that the BHL can operate as an article
repository.
– Provide links to content published online through
other repositories.
• Launched on December 6th 2010
• 185609 bibliographic records to date
18. Lessons Learned
• Biblio/Drupal data model insufficient for mass of data
envisioned for all biodiversity, too flat and difficult to
expand in collaboration with Biblio development
community
• Data providers want their content findable and
managed in the Biodiversity Heritage Library, not a
system alongside BHL
• Maintaining two platforms for biodiversity literature
threatens sustainability of the literature resources over
the longer term
20. What have we done?
• Articles
– Extended BHL data model to store article metadata
– Built process to harvest data from BioStor
• Created user interfaces for adding article metadata
and associated files
– Defined functional requirements as improvements to
Drupal-based Citebank
– Defined process flow for adding article metadata and
associated files
– Implemented UI changes
• Changed BHL UI to accommodate article search
• Changed BHL UI to accommodate article display (TOC)
25. Requirements for a citation repository?
Admin. Interface
– IMPORT AND MAPPING TOOL
• Preview/Accept/Reject/Undo/Report on Import
• No standard schema, MODS or Bibtex
• Drag & drop GUI or mapped source and target field config.
– USER MANAGEMENT
• Self-Registration
• Admin. Approval & Deletion
• User Roles Assignment
– GLOBAL UPDATES
26. Requirements for a citation repository?
General User Interface
– IMPORT
• Upload/Preview/Accept/Reject/Undo/Report on Import
– CREATE CITATION
• By filling a Form, via BibTex
– BROWSE
• Faceted: title,author,subject, year, contributor, my citations
27. Requirements for a citation repository?
• CITATION TYPES
– Journal Article, Book Chapter, Conference Proceedings,
Conference Paper, Thesis, Government Report, Note, etc.
• OAI HARVESTING
– Harvest and serve data through OAI-PMH
• SPECIFICATIONS FOR DATA PROVIDERS PAGE
• CONTRIBUTORS PAGE
– Recognize ALL contributions
• REPORTING
– Statistics Page by Citation and Publication type
– Recent/Latest Uploads
28. What are we doing?
• Integrate BHL’s Services with ZooBank, IPNI & IF
• Authoritative list of titles in common use for
nomenclatural acts (“TL3”)
• Harvest relevant content from Mendeley
• Integrate services and interfaces with the GNUB
data model
• Interoperate with citation parsing tools & services
29. Support citation reconciliation
.
.
.
.
.
.
.
L. Sp. Pl. 2: 971. 1753
Linneaus, C. Species Plantarum, vol. 2 p. 971. 1753
Linné, Carl von. Sp. Pl. Vol. 2 Page 971. 1753
Caroli Linnaei, Species Plantarum exhibentes plantas rite cognitas, ad genera
relatas, cum Differentis Specificis, Nominibus Trivialibus, Synonymis Selectis,
Locis Natalibus, secundum SYSTEMA SEXUALE digestas.. 2:971. 1753
Zea mays
30. Questions to Answer
1. Type of content - Literature, Images, OCR Text
and Bibliographic Citations
2. Sources of content - BHL, CB & other Repositories
3. Formats of content - BibTex, MODS, DC
4. Methods of gathering info - Harvesting, FTP Uploads
5. Methods of delivery of info - Free Searches, API, web
services, exports, linking
mechanisms
6. Identifiers used - CrossRef DOIs for Monographs
7. Interoperability with
other platforms - Zoobank, IPNI, IF
8. Constraints, needs and expectations to suppliers of content
and users of content
31. Thank you
pro-iBiosphere Meeting 3
Coordination and routes for cooperation across organizations, projects and e-infrastructures
Berlin, Germany
May 23rd, 2013
William.Ulate@mobot.org
Global BHL Project Manager
BHL Technical Director
Senior Project Manager
Missouri Botanical Garden
Notas do Editor
Guidelines for speakers giving presentationsPresentation are limited to 15 minutes for each speaker plus 5 minutes for discussion.Presentations should clearly answer the following questions (7-8 slides), definitely focusing on the interoperability problem:Type of content we discuss (e.g., occurrences, genes, behaviour, morphology, etc.)Sources of content (from where)Formats of content (formats, standards)Methods of gathering information (e.g., harvesting, ftp uploads, protocols)Methods of delivery of information (e,g., free searches, API, web services, automated exports, linking mechanisms, etc.; provide links to API and web services documentation)Identifiers used (type, persistence, dereferencing, resolvability)’Present or forthcoming interoperability features with other platformsConstraints, needs and expectations to: a) Suppliers of content, and b) Users of contentOverall picture of what is needed within a certain domain (e.g., for names, references, genes, images, etc.) (2-3-slides)The final outputs of presentations and discussions should be two-fold:Summary table encompassing the answers to the above questions, that will be a basis for the whitepaper and future workMoU draft discussedProposing an Advisory Board of key stakeholders that will form the ground for a consortium to develop and launch the future BKMSTasks involved:Task 2.1. Coordination and routes for cooperation across organizations, projects and e-infrastructures (lead: Plazi). Encompassing the information gathered at Workshop 1 (Leiden, February 2013) and through the online questionnaire.Task 4.1 Improve technical cooperation and interoperability at the e-infrastructure level (lead: FUB-BGBM).Task 4.2 Promote and monitor the development and adoption of common mark-up standards and interoperability between schemas by identifying technical and societal constraints and needs to increase collaboration and interoperability between e-platforms and projects, and by envisioning practical solutions towards the Biodiversity Knowledge Management System (lead: Plazi).=============Concrete examples of ideas for potential points in a draft MoUA primary purpose of the “Routes towards cooperation” meeting is to increase our reciprocal understanding and progress towards a multi-institutional Memorandum of Understanding(MoU). The following points are potential points in a draft MoU. It is welcome to comment them here on the wiki before the meeting takes place, or to add further points. The results would then have to be further discussed by the appropriate levels.Establishment of a multi-institutional focus group to coordinate software development to improve the efficiency of resource use by means of common Open Source based development projects using Open Source methodology.Agreements on specialization, e.g., one institution specializes in geographical analysis and visualization, providing services to other institutions or projectsAgreement on long-term management procedures to provide stable identifiers. This agreement may be technology neutral (except that some way to use the identifiers in the human readable as well as semantic web should be specified). Both stable http-URIs (preferred in semantic web) and DOI technology (publishing industry) are possible implementations.Agreement on following the Linked Open Data example. (Note: Edinburgh may be a best practices example?)Agreement to communicate the data policies according to the Linked Open Data five star scoringPolicy agreements on Open AccessAgreement to register all services that are provided to other Biodiversity institutions in the Biodiversity Catalogue (Univ. Manchester, myExperiment).Agreement to communicate the expected and planned stability of services by means of a standard vocabulary (e.g.: undecided, experimental, long-term service without fixed API, long-term service with stable and versioned API)Agreement to collaborate on the development of shared term definitions (glossary-style) with the understanding that new terms can be freely added, but an effort will be made to re-use or improve existing term definitions.Agreement on crowdsourcing activities to clean up data, e.g. bibliographic references, or markup content in legacy literature, e.g. scientific names, treatments, material citations.Paul Kirk: Centrally 'cached' data should have a clear mechanism for providing usage statistics back to sources.
Type of content we discuss (e.g., occurrences, genes, behaviour, morphology, etc.)Sources of content (from where)Formats of content (formats, standards)Methods of gathering information (e.g., harvesting, ftp uploads, protocols)Methods of delivery of information (e,g., free searches, API, web services, automated exports, linking mechanisms, etc.; provide links to API and web services documentation)Identifiers used (type, persistence, dereferencing, resolvability)Present or forthcoming interoperability features with other platformsConstraints, needs and expectations to: a) Suppliers of content, and b) Users of content
[PortalUser Interface]
[Book Viewer Interface]
We ask the user to provide metadata if they’re generating a chapter or book title
On legacy literature, what your plans are with BHL, and especially your move into content?GrowthMore Global ContentTaxon NamesArticle MetadataMicrocitations and COiNSAPIZoobankOCR improvements through GamingCrowdsource MarkupWFO?
[Citebank homepage]
[Citebank homepage]
[Citebank stats]
[World in which CiteBank lives]
[Citations in BHL and Sustainability Considerations]
[Citebank homepage]
[GNA Diagram]
[Define functional requirements]
We ask the user to provide metadata if they’re generating a chapter or book title
We ask the user to provide metadata if they’re generating a chapter or book title
[Where are we going?]
[Diagram of citations reconciliation]
Type of content we discuss (e.g., occurrences, genes, behaviour, morphology, etc.)Sources of content (from where)Formats of content (formats, standards)Methods of gathering information (e.g., harvesting, ftp uploads, protocols)Methods of delivery of information (e,g., free searches, API, web services, automated exports, linking mechanisms, etc.; provide links to API and web services documentation)Identifiers used (type, persistence, dereferencing, resolvability)Present or forthcoming interoperability features with other platformsConstraints, needs and expectations to: a) Suppliers of content, and b) Users of content