1) The document discusses EBI's efforts to facilitate semantic alignment of its resources through building ontologies and annotating data with ontologies.
2) It describes EBI's work developing ontologies like the Experiment Factor Ontology and using ontologies to enhance search, data visualization, and data integration.
3) The challenges of representing EBI data in RDF are discussed, and future directions are outlined that could make RDF deployment simpler and enable more interesting queries over EBI data.
1. 12th June, 2016
BioHackathon 2016 Symposium, Japan
Facilitating Semantic Alignment of EBI
Resources
Simon Jupp
Ontology Project Lead
Samples, Phenotypes and
Ontologies Team
www.ebi.ac.uk
2. SPOT team - Adding value with ontologies
Data
Exploration
and
Cleanup
Data
structuring
Ontology
Annotatio
n
Data cleaning
and mapping
Ontology
building
Structured data
3. Data Enrichment Services
• Building an interoperability
toolkit for Europe (Elixir)
• Micro-service architecture
• Technology-agnostic
• Pushing boundaries of ontology
“embedding”
New ontology lookup service!
4. Building an ontology toolkit
Data
Exploration
and
Cleanup
Data
structuring
Ontology
Annotatio
n
Data cleaning
and mapping
Ontology
building
Webulous
OxO mapping service
5. Building metadata rich resources
• Ontology markup of experimental
variables/samples
• Focus on Phenotype/Disease
annotation
• Linking common to rare disease
ArrayExpress
Gene Expression atlas
0
20
40
60
80
100
89
77 78
100 99
EFO mapped coverage
6. OpenTargets Data Mapping Process
Reactome Metabolic pathways DOID
GWAS catalog
Common Disease
(GWAS) EFO
Atlas Expression EFO
Uniprot
Rare Disease (Expert-
reviewed OMIM)
OMIM + own controlled
vocab
European Variation
Archive Rare Disease
OMIM + Orphanet +
SNOMED + Genetic
Alliance + HPO
ChEMBL Bioactivity data
ATC classification (14
terms)
EuropePMC Literature Mining UMLS
IMPC Mouse Models MPO + HPO
Cancer Gene Census Somatic Mutations
own controlled vocab +
NCIT
Acquire
Clean
Map to
Ontology
Curate
Add new
terms
Iterate
7. Experiment Factor Ontology – Data Driven
Application Ontology
• EFO is an application ontology, built for use in production services in
OWL
• Imports from ~10 ontologies, isolates us from external churn
• Cross referenced to 25 additional ontologies
• Continuous integration build process, reasoning, manual error checking, multi-
editor environment
Chemical Entities of
Biological Interest
(ChEBI)
Gene Ontology
Cell Type
Anatomy
Phenotype
Disease
8. Ontologies Data
Managing data evolution in production
Ontology
Annotation
Provenance: who, when, context
Disease
Anatomy
Cell types
Gene function
(GO, HP, MP,
UBERON, DO,
ORDO)
Phenotype
…
10. Open Targets
Which other diseases are associated with PDE4D?
View diseases
grouped in therapeutic
areas or organised in
a tree
View more information about
PDE4D
Filter by
therapeutic
area
11. BioSolr
“BioSolr aims to significantly advance the state of the art
with regards to indexing and querying biomedical data
with freely available open source software”
flaxsearch/BioSolr
Solr documents with
ontology annotation
Enriched Solr with ontology content
(synonyms, structure, relations)
Solr/Elastic plugin Query expansion and
hierarchical faceting
13. Data resources at EMBL-EBI
Genes, genomes & variation
RNA Central
Array
Express
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL SureChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene, protein & metabolite expression
Protein sequences, families &
motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
BioStudies
Gene Ontology
Experimental Factor
Ontology
Literature &
ontologies
Product of previous biohackathons
14. EBI RDF Platform
Successes
• Novel queries possible over
EBI datasets
• Production quality RDF
releases
• Community of users
• Highly available public
SPARQL endpoints
• 500+ users (10-50 million
hits per month)
• Lot of interest from industry
• Catalyst for new RDF efforts
Lessons
● Public SPARQL endpoints
problematic
● Query federation not
performant
● Inference support limited
● Not scalable for all EBI data
e.g. Variation, ENA
● Lack of expertise in service
teams
● Too much overhead to get
started quickly in this space
15. Challenges for RDF at EMBL-EBI
• Most EBI resources publish data in forms that support
common use cases (pre-integrated)
• Individuals teams do the hard work so you don’t have to
• RDF representation not optimised for performance
• Barrier to building real (killer) applications
• Technology not mature enough / developer frameworks
lacking
• Doing RDF shouldn’t mandate a technology choice anyway
• RDF not yet a “core” activity for EMBL-EBI
16. Where we are going next with RDF
• Virtualised infrastructure for RDF
• Simpler cloud deployment
• Building a single EBI RDF cache
• Simpler to manage
• More interesting queries
• Exploring cheaper paths to RDF
• RDF from REST + JSON-LD
• Via Wikidata
• RDFa and schema.org (bioschemas)
17. Acknowledgements
• Sample Phenotypes and Ontologies Team
• Olga Vrousgou, Thomas Liener, Dani Welter, Catherine
Leroy, Sira Sarntivijai, Ilinca Tudose, Tony Burdett, Helen
Parkinson
• Funding
• European Molecular Biology Laboratory (EMBL)
• European Union projects: DIACHRON, BioMedBridges and
CORBEL, Excelerate
18.
19. Topic and interest for the hackathon
• Ontology Mapping
• Disease (rare, common, phenotypes)
• Data annotation (automated, machine learning, text
mining)
• Virtualised RDF data deployment
• RDF on the fly
• RDF over Mongo, Neo4j, Solr, Elastic
• REST + JSON-LD