What has already been published?
What may still be needed?
How to do it?
This presentation is a part of the 3rd Session of the 1st International e-Conference on Germplasm Data Interoperability https://sites.google.com/site/germplasminteroperability/
1. Publishing germplasm vocabularies
as Linked Data
What has already been published?
What may still be needed?
How to do it?
This presentation is a part of the 3rd
Valeria Pesce (GFAR)
Session of the 1st International eGuntram Geser (Salzburg Research)
Conference on Germplasm Data
Caterina Caracciolo (FAO)
Interoperability
Vassilis
https://sites.google.com/site/germplasminteroperability/ Protonotarios (AgroKnow)
3. Ingredients for describing things
• Metadata elements to describe individual pieces of
information in the data sets
• Metadata sets, metadata element sets, vocabularies
• Sets of values for (some of) the metadata elements
• Controlled vocabularies, authority data, value
vocabularies, KOS
• They are often both called “vocabularies”
4. Various flavors of vocabularies
Type:
Bibliographic
resource
Entity to be described
Title
Author(s)
Abstract
Subject(s)
Publication date
Publication place
Type of document
other features…
5. Various flavors of vocabularies
“Description
vocabularies”
Type:
Bibliographic
resource
Entity to be described
Title
Author(s)
Abstract
Subject(s)
Publication date
Publication place
Type of document
other features…
Metadata
vocabulary
for describing
bibliographic
resources
6. Various flavors of vocabularies
Type?
Bibliographic
resource
Entity to be described
Title
Author(s)
Abstract
Subject(s)
Publication date
Publication place
Type of document
other features…
KOS
Concepts suitable for
organizing by Topic
Controlled list
“Description
vocabularies”
Concepts suitable for
organizing by Type
Metadata
vocabulary
for describing
bibliographic
resources
7. Various flavors of vocabularies
Type?
Bibliographic
resource
Entity to be described
Title
Author(s)
Abstract
Subject(s)
Publication date
Publication place
Type of document
other features…
Authority data
Data of type Person
KOS
Concepts suitable for
organizing by Topic
Controlled list
“Description
vocabularies”
Concepts suitable for
organizing by Type
Metadata
vocabulary
for describing
bibliographic
resources
Authority data
Data of type
Geographic location
8. Various flavors of vocabularies
“Value vocabularies”
Type?
Bibliographic
resource
Entity to be described
Title
Author(s)
Abstract
Subject(s)
Publication date
Publication place
Type of document
other features…
Authority data
Data of type Person
KOS
Concepts suitable for
organizing by Topic
Controlled list
“Description
vocabularies”
Concepts suitable for
organizing by Type
Metadata
vocabulary
for describing
bibliographic
resources
Authority data
Data of type
Geographic location
9. Various flavors of vocabularies
“Value vocabularies”
Type?
Bibliographic
resource
Entity to be described
Title
Author(s)
Abstract
Subject(s)
Publication date
Publication place
Type of document
other features…
Authority data
Data of type Person
KOS
Concepts suitable for
organizing by Topic
Controlled list
“Description
vocabularies”
Concepts suitable for
organizing by Type
Metadata
vocabulary
for describing
bibliographic
resources
Ontology
for describing
geographic places
Authority data
Data of type
Geographic location
Metadata
vocabulary
for describing people
10. Vocabularies in RDF LOD
• Resource Description Framework (RDF)
approach:
– formalize vocabularies assigning to each metadata
element and to each concept a Uniform Resource
Identifier (URI)
– RDF vocabularies have published URIs and published
machine-readable semantics. things described and
indexed with RDF vocabularies can be “understood”
by machines and automatically discovered
• Linking classes or concepts across vocabularies
makes them Linked Open Data (LOD)
vocabularies and allows machines to follow
semantic linkages across vocabularies and
discover more data.
11. The importance of LOD vocabularies
• Data exposed using a LOD vocabulary can for
this reason alone be considered “Linked Data”
the first thing to do for publishing Linked
Data is identifying or publishing the suitable
LOD vocabularies
• Data mash-ups rely on common and
semantically defined classes, properties and
concepts identifiable by URIs.
13. Metadata (1)
Reference standards:
• Multi-crop Passport Descriptors (MCPD)
(FAO/Bioversity)
– V.1 2006, V.2 2012
Data to EURISCO catalogue
• Darwin Core
(Biodiversity Information Standards Working Group, TDWG)
http://rs.tdwg.org/dwc/
Includes a glossary of terms (in other contexts these might be called
properties, elements, fields, columns, attributes, or concepts)
intended to facilitate the sharing of information about biological
diversity by providing reference definitions, examples, and
commentaries.
14. Metadata (2)
Standard extensions
•
•
The MCPD do not include descriptors for Characterization and Evaluation
(C&E) measurements of plant traits/scores
E.g. Morphological and agronomic traits as well as reaction to biotic and
abiotic stresses’ resistance to specific pathotypes, grain yield, and protein
content
An initial set of C&E descriptors for the utilization of 22 crops have been
developed by Bioversity International4 together with CGIAR and other
research centers
The DarwinCore Germplasm Extension (Biodiversity TDWG)
–
–
–
–
additional terms to describe germplasm samples
maintained by genebanks worldwide
Modelled starting from the Multi-Crop Passport standard (MCPD, 2001)
Includes the new terms for crop trait experiments developed as part of the
European EPGRIS3 project.
– Includes a few additional terms for new international crop treaty regulations.
15. RDF vocabularies for germplasm
•
TaxonConcept OWL Ontology
written by Peter J. DeVries from 2009 through 2012 was based on the
earlier GoeSpecies from 2007:
http://www.taxonconcept.org/
Biodiversity Information Standards (TDWG)
• Metadata: Darwin Core “SW” ontology in RDF OWL
Semantic web terms for biodiversity data, based on Darwin Core:
http://rs.tdwg.org/dwc/terms/
• DwC-germplasm = already represented in RDF SKOS
http://purl.org/germplasm/
•
Much activity around the semantic technologies to express major plant /
trait / gene ontologies (this overlaps with KOSs)
–
–
–
–
Plant Ontology (explicitly referenced in the DwC-germplasm)
Gene Ontology,
Trait Ontology
Phenotypic Quality Ontology.
16. Metadata: Darwin “SW” Core RDF classes
Semantic web terms for biodiversity data, based on Darwin Core
From: http://code.google.com/p/tdwg-rdf/wiki/BiodiversityOntologies
19. KOSs
Authoritative plant names and taxonomies
– Plant Ontology (OBO format)
(explicitly referenced in the DwC-germplasm)
http://www.plantontology.org
– Gene Ontology (RDF and OWL/RDF)
http://www.geneontology.org/
– Trait Ontology (OBO format)
http://www.gramene.org/db/ontology/search?id=TO:0000387
– Phenotypic Quality Ontology (OBO and OWL)
http://obofoundry.org/cgi-bin/detail.cgi?quality
Some of them are already inter-linked
20. KOSs: value lists
• The DwC-germplasm is mainly a KOS
http://purl.org/germplasm/
It defines concepts.
Foe example, http://purl.org/germplasm/germplasmType#
is a “List of controlled values for some of the germplasm
terms”
21. KOSs: value lists
• When it comes to ranges and controlled sets of values,
there are two typical scenarios:
– Ranges of values (numeric or not) that represent a continuum of
values (i.e. “From 1 to 10”, “From 10 to 20” etc. or percentages.
See table 2);
– Sets of controlled values (e.g. for “acquisition type”,
“measurement type”, color and other observed properties).
• The second case can even be split into two different cases:
– the values can come from a dedicated controlled list
– the values can come from an established taxonomy, from which
however only a subset of values are valid for that property.
22. KOSs: value lists
Value lists:
Examples of allowed values for some C&E properties
Young shoot: aperture of tip
1=closed, 3=half open, 5=fully open
Young shoot: intensity of
anthocyanin coloration on
prostrate hairs of tip
1=none or very low, 3=low, 5=medium,
7=high, 9=very high
B. Berry color
Color of the berry skin: green, green-grey,
green-rose, green-red, green-black, grey, greyrose, rose, red, red-violet, black, black-red,
black-grey
Example: green-rose
23. KOSs: value lists
• An interesting task would be the publication
of most of these lists as Linked Data, following
the example of the Dublin Core Types list.
http://dublincore.org/documents/dcmi-type-vocab
• Darwin Core Types:
http://rs.tdwg.org/dwc/terms/type-vocabulary/ind
24. KOSs: subsets of published KOSs
•
Special case:
values for which reference to a published thesaurus is recommended but
only a specific subset of terms is valid for a specific property.
Thesauri are rarely structured around “facets” (or the various properties
of entities that can be described by the terms in the thesaurus): they
usually have an internal logic that reflects the domain they represent.
Example from the DwC Germplasm extension: values can come from an existing
ontology
26. How to decide if and what to publish
1. Data set already uses some standard vocabularies published as LOD
–
No need to publish new vocabularies
1. Data set uses some local vocabularies
–
–
If it has the same intended meaning as some standard vocabulary and if the
data owners agree…
Then, replace local vocabulary with standard vocabularies (back to case 1)
1. Data set uses some local vocabularies
–
–
If it has the same intended meaning as some standard vocabulary, but data
owners need to keep the local ones…
Then, publish local vocabulary and map it to standard vocabularies
1. Data set uses some local vocabularies
–
–
If there is no matching or overlap with any standard vocabularies…
Then, publish local vocabulary for others to re-use
4b.
No existing vocabulary contains properties or concepts that
are deemed useful by the community
–
The community works on a new vocabulary to extend the existing ones
27. What vocabularies to publish for germplasm
data?
Good RDF metadata vocabularies / ontologies exist
• Need to further extend Darwin Core classes and properties?
Publish an extension to Darwin Core as an RDF or OWL vocabulary (see
how later)
Good domain KOSs exist
• Need to indicate subsets in domain KOSs to be used for specific properties?
a) Work with classification owners to identify subsets
b) Re-publish subsets as SKOS collections linking to concepts in original
KOS or as Application Profiles
Only a few value lists have been published
(e.g. in DwC-Germplasm or in DwC Types)
Publish value lists as SKOS
28. Publishing value lists
• Identify the most relevant controlled lists that
need to be published
• Check if anything similar has already been
published or if some existing lists of values can
be extended
• Publish them as LOD, linking to any similar
concepts already published in other
vocabularies.
30. LOD guidelines
•
The methodologies comply with the Linked Data rules (Berners Lee, 2006)
•
“Use URIs as names for things”
•
“Use HTTP URIs so that people can look up those names”
•
“When someone looks up a URI, provide useful information”
•
“Include links to other URIs, so that more things can be discovered”
concepts / values in value vocabularies and classes and properties in description
vocabularies, as well as the vocabularies themselves, have to be identified by URIs.
the URIs for concept / values, classes and properties, as well as vocabularies, have
to be resolved as HTTP URLs.
the URLs for concepts, classes and properties, as well as vocabularies, have to
return an HTML page with useful information when requested by browsers, or RDF
when requested by RDF software; besides, vocabularies should be available for
querying behind a SPARQL endpoint.
the URIs of concepts, classes and properties should whenever possible be linked to
URIs in other vocabularies, for instance as close match of another concept or subclass of another class.
31. Metadata vocabularies
•
•
As indicated by the W3C Library Linked Data Incubator Group, metadata elements
set are expressed as RDFS (RDF Schemas) or OWL (Web Ontology Language)
ontologies.
They define classes and properties used to describe something
Tools: listed in http://linkeddatabook.com/editions/1.0/
• The Neologism Drupal distribution (open source, easy to use, deployable online
and dedicated to the building and online publication of simple RDF vocabularies
• TopBraid Composer (a powerful commercial modeling environment)
• Protégé (open-source ontology editor)
• The NeOn Toolkit (open-source ontology engineering environment for networked
ontologies)
•
•
•
•
http://neologism.deri.ie/
http://www.topquadrant.com/products/TB_Composer.html
http://protege.stanford.edu/
http://neon-toolkit.org/
Heath, Tom and Bizer, Christian (2011). Linked Data: Evolving the Web into a Global Data Space (1st
edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan &
Claypool. http://linkeddatabook.com/editions/1.0/
32. KOSs
•
•
In RDF, KOSs are normally expressed using the SKOS vocabulary.
They define concepts
Tools:
• The VocBench: a multilingual editing and workflow tool developed by FAO for the
management of various types of KOS. It provides functionalities that facilitate both
collaborative editing and multilingual terminology.
• MoKi: based on MediaWiki, ontology editing tool where concepts can be added,
revised, translated and deleted.
• SKOSJS
• Protégé
• TemaTres Controlled Vocabulary server
• commercial tools like PoolParty or TopBraid Enterprise Vocabulary Net
•
•
•
•
•
•
•
http://aims.fao.org/tools/vocbench-2
https://moki.fbk.eu/website/index.php
https://github.com/tkurz/skosjs
http://protege.stanford.edu
http://www.vocabularyserver.com
http://poolparty.punkt.at/
http://www.topquadrant.com/solutions/ent_vocab_net.html
iPlant : the program has implemented the SSWAP service14, based on the SSWAP protocol15. Three major information resources (Gramene, SoyBase and the Legume Information System) use SSWAP to semantically describe selected data and web services. Moreover, the Gene Ontology and Plant Ontology will be soon incorporated into SoyBase:
The methodology adopted by agINFRA for the publication of vocabularies as LOD aims at reusing existing resources as much as possible. According to the methodology agreed in the project, the first step consists in analyzing the datasets available and the metadata sets and KOS used (presented in this paper). The table below summarizes the germplasm and soil data sets considered so far in agINFRA, together with the metadata sets and KOS used.