Implementing chemistry platform for OpenPHACTS

Implementing chemistry platform for
OpenPHACTS: Lessons learned
Colin Batchelor, Alexey Pshenichnov, Jon Steele,
Valery Tkachenko
Royal Society of Chemistry
ACS Spring 2016
San Diego, CA
March 17th 2016

Open PHACTS Mission:
Integrate Multiple Research
Biomedical Data Resources
Into A Single Open & Sustainable
Access Point

info@openphactsfoundation.org @Open_PHACTS
Open PHACTS Practical Semantics
Acknowledgements
GlaxoSmithKline – Coordinator
Universität Wien – Managing entity
Technical University of Denmark
University of Hamburg, Center for
Bioinformatics
BioSolveIT GmBH
Consorci Mar Parc de Salut de Barcelona
Leiden University Medical Centre
Royal Society of Chemistry
Vrije Universiteit Amsterdam
Novartis
Merck Serono
H. Lundbeck A/S
Eli Lilly
Netherlands Bioinformatics Centre
Swiss Institute of Bioinformatics
ConnectedDiscovery
EMBL-European Bioinformatics Institute
Janssen Esteve Almirall
OpenLink Scibite
The Open PHACTS Foundation
Spanish National Cancer Research Centre
University of Manchester
Maastricht University
Aqnowledge
University of Santiago de Compostela
Rheinische Friedrich-Wilhelms-Universität
Bonn
AstraZeneca
Pfizer

Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s
similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?

Literature
PubChem
Genbank
Patents
Databases
Downloads
Data Analysis Data Integration Firewalled Databases
How do R&D companies use public data?

@gray_alasdair Big Data Integration 9

Patent annotations in Open PHACTS
• Huge amount of knowledge hidden in patent corpus
• Most of which will never be published elsewhere
• Substantial lag between patent and scientific literature
• SureChEMBL system already extracts chemical entities from full-text
patent documents
• Text (title, abstract, description, claims), images, molfiles
• Complemented with gene and disease entity annotations
• Using the Termite text-mining tool by SciBite
• Relevance scoring to reduce noise
• Tested for recall
• Patent, compound, gene, disease info available via API

Open PHACTS Expanding EcoSystem
Further
Apps
Data
Warrior

• VM install of Open PHACTS
– Docker Image is now available
• Updating to ver 2.0 Open PHACTS
• Allows you to customise and load your own data into the
environment
Want to load your data into
Open PHACTS?
Want to run Open PHACTS
within your environment?

Challenge
of
migrating
between
versions of
the API
Upgrading

Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna

openphactsfoundation.org/apps.html
Explorer.openphacts.org

http://data.openphacts.org
/artifactory/

Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps

We integrate, standardize and host the chemical
compound collection underpinning Open
PHACTS.
We have developed a structure validation and
normalization platform (CVSP) to ensure chemical
structures are normalized to rules derived from the
FDA structure normalization guidelines and
modified based on input from members of EFPIA.
http://cvsp.chemspider.com/
The Royal Society of Chemistry’s
role in Open PHACTS

Freely-available (requires logging in)
chemical validation system for:
• Structure validation: warning on query
atoms, pseudoatoms, nonsensical or
unclear stereo
• Standardization workflows.
CVSP and the Open
Pharmacological Space Chemical
Registration System (OPS CRS)

Chemical data sources
Data source Number of records in
source
DrugBank 6828
PDB ligands 18681
MeSH (extracted by text
mining)
24381
ChEBI 40503
HMDB 41494
ChEMBL 20 1456020
SureChEMBL 1.0 14228299

We generate RDF that:
1. Describes synonyms and identifiers
2. Provides linksets between our data
sources and the OPS identifiers
3. Describes molecule–molecule relations of
interest to the pharma industry
4. Delivers calculated physicochemical
properties of compounds
5. Lists the validation and standardization
issues found by CVSP.
Royal Society of Chemistry data
provided to Open PHACTS

• Use standard ontologies where possible
(CHEMINF for cheminformatics
properties, QUDT for units, OBO
ontologies elsewhere)
• Use an event-based pattern for
cheminformatics outputs. This enables
us to add arbitrary provenance
information.
Principles

Use the CHEMINF ontology:
https://github.com/semanticchemistry
Validated ChemSpider synonyms,
Unvalidated ChemSpider synonyms,
Validated database identifiers, Unvalidated
database identifiers, InChI, InChIKey,
SMILES, preferred ChemSpider name
1. Synonyms and identifiers

Metadata describing the RDF:
• Can be used to build a directory of the
RDF available
• Find what’s there without having to
download all of it first
• Describes how Datasets are linked by the
Linksets using SKOS.
Recommendations here:
http://www.openphacts.org/specs/2013/WD-
datadesc-20130912/
2. Linksets:
Vocabulary of Interlinked
Datasets

We relate molecules to “parent” forms,
variously, those which are:
• uncharged
• not isotopically-specified
• not stereochemically-specified
• the preferred tautomer
• the largest fragment
• the “superparent” (all of the above)
3. Molecule–molecule relations in
CHEMINF

log P, log D (at pH 5.5 and 7.4),
bioconcentration factor, KOC (at pH 5.5 and
7.4), index of refraction, polar surface
area, molar refractivity, molar volume,
polarizability, surface tension, density at
STP, flash point a 1 atm, enthalpy of
vaporization at STP, vapour pressure at
STP.
4. Calculated physicochemical
properties

5. Issues from validation and
standardization
We use the CHEMINF ontology again.
We distinguish between information,
warnings and errors. Only serious failures
to process, such as a structure having an
invalid atom, count as errors.

Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
• ChemSpider
• PubChem
• DrugBank
• KEGG
• ChEBI/ChEMBL
– Automated quality control system

Chemistry Validation and Standardization Platform

Thank you
Email: tkachenkov@rsc.org
Slides:
http://www.slideshare.net/valerytkachenko16

Implementing chemistry platform for OpenPHACTS

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (19)

Semelhante a Implementing chemistry platform for OpenPHACTS

Semelhante a Implementing chemistry platform for OpenPHACTS (20)

Mais de Valery Tkachenko

Mais de Valery Tkachenko (20)

Último

Último (20)

Implementing chemistry platform for OpenPHACTS

Notas do Editor