The Open PHACTS project delivers an online platform integrating a wide variety of data from across chemistry and the life sciences and an ecosystem of tools and services to query this data in support of pharmacological research, turning the semantic web from a research project into something that can be used by practising medicinal chemists in both academia and industry. In the summer of 2015 it was the first winner of the European Linked Data Award. At the Royal Society of Chemistry we have provided the chemical underpinnings to this system and in this talk we review its development over the past five years. We cover both our early work on semantic modelling of chemistry data for the Open PHACTS triplestore and more recent work building an all-purpose data platform, for which the Open PHACTS data has been an important test case, what has worked well, what's missing and where this is is likely to go in future.
1. Implementing chemistry platform for
OpenPHACTS: Lessons learned
Colin Batchelor, Alexey Pshenichnov, Jon Steele,
Valery Tkachenko
Royal Society of Chemistry
ACS Spring 2016
San Diego, CA
March 17th 2016
3. info@openphactsfoundation.org @Open_PHACTS
Open PHACTS Practical Semantics
Acknowledgements
GlaxoSmithKline – Coordinator
Universität Wien – Managing entity
Technical University of Denmark
University of Hamburg, Center for
Bioinformatics
BioSolveIT GmBH
Consorci Mar Parc de Salut de Barcelona
Leiden University Medical Centre
Royal Society of Chemistry
Vrije Universiteit Amsterdam
Novartis
Merck Serono
H. Lundbeck A/S
Eli Lilly
Netherlands Bioinformatics Centre
Swiss Institute of Bioinformatics
ConnectedDiscovery
EMBL-European Bioinformatics Institute
Janssen Esteve Almirall
OpenLink Scibite
The Open PHACTS Foundation
Spanish National Cancer Research Centre
University of Manchester
Maastricht University
Aqnowledge
University of Santiago de Compostela
Rheinische Friedrich-Wilhelms-Universität
Bonn
AstraZeneca
Pfizer
4. Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s
similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?
7. Patent annotations in Open PHACTS
• Huge amount of knowledge hidden in patent corpus
• Most of which will never be published elsewhere
• Substantial lag between patent and scientific literature
• SureChEMBL system already extracts chemical entities from full-text
patent documents
• Text (title, abstract, description, claims), images, molfiles
• Complemented with gene and disease entity annotations
• Using the Termite text-mining tool by SciBite
• Relevance scoring to reduce noise
• Tested for recall
• Patent, compound, gene, disease info available via API
9. • VM install of Open PHACTS
– Docker Image is now available
• Updating to ver 2.0 Open PHACTS
• Allows you to customise and load your own data into the
environment
Want to load your data into
Open PHACTS?
Want to run Open PHACTS
within your environment?
18. Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps
19. We integrate, standardize and host the chemical
compound collection underpinning Open
PHACTS.
We have developed a structure validation and
normalization platform (CVSP) to ensure chemical
structures are normalized to rules derived from the
FDA structure normalization guidelines and
modified based on input from members of EFPIA.
http://cvsp.chemspider.com/
The Royal Society of Chemistry’s
role in Open PHACTS
20. Freely-available (requires logging in)
chemical validation system for:
• Structure validation: warning on query
atoms, pseudoatoms, nonsensical or
unclear stereo
• Standardization workflows.
CVSP and the Open
Pharmacological Space Chemical
Registration System (OPS CRS)
21. Chemical data sources
Data source Number of records in
source
DrugBank 6828
PDB ligands 18681
MeSH (extracted by text
mining)
24381
ChEBI 40503
HMDB 41494
ChEMBL 20 1456020
SureChEMBL 1.0 14228299
22. We generate RDF that:
1. Describes synonyms and identifiers
2. Provides linksets between our data
sources and the OPS identifiers
3. Describes molecule–molecule relations of
interest to the pharma industry
4. Delivers calculated physicochemical
properties of compounds
5. Lists the validation and standardization
issues found by CVSP.
Royal Society of Chemistry data
provided to Open PHACTS
23. • Use standard ontologies where possible
(CHEMINF for cheminformatics
properties, QUDT for units, OBO
ontologies elsewhere)
• Use an event-based pattern for
cheminformatics outputs. This enables
us to add arbitrary provenance
information.
Principles
24. Use the CHEMINF ontology:
https://github.com/semanticchemistry
Validated ChemSpider synonyms,
Unvalidated ChemSpider synonyms,
Validated database identifiers, Unvalidated
database identifiers, InChI, InChIKey,
SMILES, preferred ChemSpider name
1. Synonyms and identifiers
25. Metadata describing the RDF:
• Can be used to build a directory of the
RDF available
• Find what’s there without having to
download all of it first
• Describes how Datasets are linked by the
Linksets using SKOS.
Recommendations here:
http://www.openphacts.org/specs/2013/WD-
datadesc-20130912/
2. Linksets:
Vocabulary of Interlinked
Datasets
26. We relate molecules to “parent” forms,
variously, those which are:
• uncharged
• not isotopically-specified
• not stereochemically-specified
• the preferred tautomer
• the largest fragment
• the “superparent” (all of the above)
3. Molecule–molecule relations in
CHEMINF
27. log P, log D (at pH 5.5 and 7.4),
bioconcentration factor, KOC (at pH 5.5 and
7.4), index of refraction, polar surface
area, molar refractivity, molar volume,
polarizability, surface tension, density at
STP, flash point a 1 atm, enthalpy of
vaporization at STP, vapour pressure at
STP.
4. Calculated physicochemical
properties
28. 5. Issues from validation and
standardization
We use the CHEMINF ontology again.
We distinguish between information,
warnings and errors. Only serious failures
to process, such as a structure having an
invalid atom, count as errors.
30. Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
• ChemSpider
• PubChem
• DrugBank
• KEGG
• ChEBI/ChEMBL
– Automated quality control system
Remember this, some of these questions are easier to answer than others
Using available public data is critical to drug discovery
10
Can go get everything
Open PHACTS not a repo of the world, specific sources
8
Open PHACTS was developed to support the key questions of drug discovery
Business questions have been at the heart of Open PHACTS and have driven the development of the platform
Mx/psa, how calculated who did it?
Mash up. With your data too,
- top layer join together but need them all
commercial
Data provided by many publishers
Originally in many formats: relational, SD files and RDF
Worked closely with publishers
Data licensing was a major issue
Over 5 billion triples – 14 datasets & growing
Hosted on beefy hardware; data in memory (aim)
Extensive memcaching
Pose complex queries to extract data
4 million full-text patent documents annotated
• USPTO, WIPO, EPO – English language
• Life-sciences relevant
• Patents mapped to SureChEMBL IDs (e.g. EP-1339685-A2)
• Title, publication date, classification codes
• Compounds mapped to SCHEMBL IDs (e.g. SCHEMBL15064)
• Genes mapped to HGNC symbols (e.g. FDFT1)
• Diseases mapped to MeSH terms (e.g. D009765)
Db
Stds :which ones (later)
Access (API). Driven by the API. Acelerate bulding if apps
Open PHACTS discover platform is now supported by a Foundation
Seen a growing usage of the platform both in volume and in registered applications
API remains the cornerstone of the delivery
Connected to different consumer groups
Once users get connected to an API, they tend to stick with it.
We are listening to this and will have a version independent URL
Here we see the RSC dataset in Open PHACT’s data repository, which is running the open source Artifactory.
The repository understands Maven metadata, and also maintain and verifies checksums of data artifacts.
We see here it includes the suggested <dependency> setting for using the dataset from a different Maven project – while this would be a bit exotic perhaps, doing so would put the dataset directly on the classloader without any worrying about downloads or file paths.
We can see the hierarchy of the dataset on the left – the repository has expanded the archive for us. The .ro folder contains the Research Object manifest, the void file is the Dataset description – the rest is the “actual data”.
One power of Maven is the ease of setting up mirroring – the dataset above is actually from a mirror of the Maven repository of the build server in Manchester.
All have probably seen this slide.
Want to pick out some of the key changes and tomorrow will here more