Bio2RDF is an open-source project that offers a large and
connected knowledge graph of Life Science Linked Data. Each dataset is expressed using its own vocabulary, thereby hindering integration, search, query, and browse data across similar or identical types of data. With growth and content changes in source data, a manual approach to maintain mappings has proven untenable. The aim of this work is to develop a (semi)automated procedure to generate high quality mappings
between Bio2RDF and SIO using BioPortal ontologies. Our preliminary results demonstrate that our approach is promising in that it can find new mappings using a transitive closure between ontology mappings. Further development of the methodology coupled with improvements in
the ontology will offer a better-integrated view of the Life Science Linked Data
1. ONTOLOGY MAPPING
FOR LIFE SCIENCE LINKED DATA
ISWC2016:::BMDID::Dumontier1
Amrapali Zaveri and Michel Dumontier
Stanford Center for Biomedical Informatics Research
Stanford University
2. Large and growing network of Linked Data
2 ISWC2016:::BMDID::DumontierLinking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
3. ISWC2016:::BMDID::Dumontier
Linked Data for the Life Sciences
3
Bio2RDF is an open source project to unify the
representation and interlinking of biological data using RDF.
chemicals/drugs/formulations,
genomes/genes/proteins, domains
Interactions, complexes & pathways
animal models and phenotypes
Disease, genetic markers, treatments
Terminologies & publications
• 11B+ interlinked statements from 35 biomedical
datasets and 400+ ontologies
• dataset description, provenance & statistics
• A growing interoperable ecosystem with the EBI,
NCBI, DBCLS, NCBO, OpenPHACTS, and
commercial tool providers
5. the lack of coordination to a global schema
makes Linked Data chaotic and unwieldy
ISWC2016:::BMDID::Dumontier5
6. Federated queries require intimate
knowledge of each dataset schema
Get all protein catabolic processes (and more specific GO terms) in biomodels
SELECT ?go ?label count(distinct ?x)
WHERE {
service <http://bioportal.bio2rdf.org/sparql> {
?go rdfs:label ?label .
?go rdfs:subClassOf+ ?tgo
?tgo rdfs:label ?tlabel .
FILTER regex(?tlabel, "^protein catabolic process")
}
service <http://biomodels.bio2rdf.org/sparql> {
?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go .
?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> .
}
}
ISWC2016:::BMDID::Dumontier6
7. uniprot:P05067
uniprot:Protein
is a
sio:gene
is a is a
Previous work involved manual mappings between
Bio2RDF types and relations and the Semanticscience
Integrated Ontology (SIO)
dataset
ontology
Knowledge Base
ISWC2016:::BMDID::Dumontier
pharmgkb:PA30917
refseq:Protein
is a
is a
omim:189931
omim:Gene pharmgkb:Gene
Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo and
Michel Dumontier. Bio-ontologies 2012.
7
12. Existing limitations
with Bio2RDF mappings
• New datasets have been added
• Existing datasets have changed
• The target ontology (SIO) has changed
• The target ontology (SIO) is incomplete and there
may be better ontologies to use
• These ontologies are evolving, today’s mappings
may be invalid or imprecise tomorrow
• Manual process -> not easy and not reproducible
-> must automate
ISWC2016:::BMDID::Dumontier12
13. Goal
Develop a semi-automated procedure to
generate high quality mappings between
Bio2RDF and SIO.
ISWC2016:::BMDID::Dumontier13
22. Mappings often occurred
to more than one class
22
sider:Drug-Indication-Association
sio:010038 (drug)
sio:010299 (disease)
sio:000897 (association)
ISWC2016:::BMDID::Dumontier
23. Manual validation of mappings
23
Bio2RDF Class SIO Class Annotation
drugbank:Biotech no match
clinicaltrials:Organization sio:00012 (organization) exact
drugbank:toxicity sio:001008 (toxicity) exact
sgd:GlycineCount sio:000794 (count) partial – is-a
wormbase:Genetic-
Interaction
sio:010035 (gene) partial – part-of
clinicaltrials:Serious-Event sio:000614 (attribute) incorrect
drugbank:Source sio:000510 (model) incorrect
All results available at https://goo.gl/eiijmQ ISWC2016:::BMDID::Dumontier
24. Conclusion
• Developed a semi-automated
methodology to map Bio2RDF classes to
SIO via BioPortal ontologies
• 245 of 319 Bio2RDF classes matched to
SIO
24 ISWC2016:::BMDID::Dumontier
25. Limitations
• Unmatched classes: neither SIO nor other
ontologies have complete coverage
• Overly general concepts: Semantically
incompatible classes
• Incorrect mappings: Matches to part of the
class
• Mappings are insufficient to precisely to
retrieve data across different datasets
25 ISWC2016:::BMDID::Dumontier
26. Future Work
• Extend SIO to include classes that are
ultimately not found
• Explore mid-level portion of SIO to eliminate
root level mappings
• Scalable validation by via crowdsourcing
• Pursue query rewriting
26 ISWC2016:::BMDID::Dumontier
Bio2RDF is an open-source project that offers a large and
connected knowledge graph of Life Science Linked Data. Each dataset
is expressed using its own vocabulary, thereby hindering integration,
search, query, and browse data across similar or identical types of data.
With growth and content changes in source data, a manual approach
to maintain mappings has proven untenable. The aim of this work is to
develop a (semi)automated procedure to generate high quality mappings
between Bio2RDF and SIO using BioPortal ontologies. Our preliminary
results demonstrate that our approach is promising in that it can find
new mappings using a transitive closure between ontology mappings.
Further development of the methodology coupled with improvements in
the ontology will offer a better-integrated view of the Life Science Linked
Data
The Bio2RDF project transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery.
Bio2RDF - 11 billion triples, 35 datasets with 6093 classes across all the datasets
Pruning - removing blank nodes, general resources, OWL vocabulary & other ontologies.
SIO - 1500 classes, 208 properties
LogMap - large-scale ontology mapping
Ontologies such as CPO, FAO had no mappings, while others (e.g. GAZ, COGPO) were inconsistent and could not be used by LogMap.
SIO-BioPortal & Bio2RDF-BioPortal
we traversed the ancestors of the mapped BioPortal class to the first super class that is mapped to a SIO class.
In this way, the Bio2RDF type becomes a candidate subclass of the SIO class.
\verb|sider:Drug-Indication-Association| mapped to three of the SIO classes \verb|sio:010038| (drug) and \verb|sio:010299| (disease) and \verb|sio:000897| (association).
evaluated the mappings manually
drugbank:Source -> SNOMED CT Model Component -> model