We provide an overview of the use we make of ontologies at the Royal Society of Chemistry. Our engagement with the ontology community began in 2006 with preparations for Project Prospect, which used ChEBI and other Open Biomedical Ontologies to mark up journal articles. Subsequently Project Prospect has evolved into DERA (Digitally Enhancing the RSC Archive) and we have developed further ontologies for text markup, covering analytical methods and name reactions. Most recently we have been contributing to CHEMINF, an open-source cheminformatics ontology, as part of our work on disseminating calculated physicochemical properties of molecules via the Open PHACTS. We show how we represent these properties and how it can serve as a template for disseminating different sorts of chemical information.
1. Ontology work at the Royal
Society of Chemistry
Antony J. Williams, Colin
Batchelor, Peter Corbett, Jon Steele
and Valery Tkachenko
ACS Dallas
March 16th
2014
2. Royal Society of Chemistry
• You know us as a publisher and society but
• We are a host of chemistry databases
• We are a charity and community support
• We are a provider of grant-based services
• We are an innovator in cheminformatics
3. We have data to manage…
• Compounds
• Reactions
• Spectra
• Crystals
• Materials
• Assays
• Algorithms
• …
4. We have data to manage…
• Compounds
• Reactions
• Spectra
• Crystals
• Materials
• Assays
• Algorithms
• …
7. Physicochemical properties
LONG LIST: log P, log D (at pH 5.5, at pH
7.4), bioconcentration factor, KOC (at pH
5.5, at pH 7.4), index of refraction, polar
surface area, molar refractivity, molar
volume, polarizability, surface tension,
density at STP, flash point at 1 atm, boiling
point at 1 atm, enthalpy of vaporization at
STP, vapour pressure at STP…
8. All are amenable to ontologies
and should blend standards
• Compounds and properties are handled
(InChIs are important)
• Reactions are covered (and RInChIs help)
• Spectra (JCAMP, AnIML, NetCDF, mzML)
• Crystals (CIFs)
• Materials (MatML)
• Assays (MIAME)
• Algorithms
• …
11. ChemSpider is 7 years old
• When ChemSpider was developed ontologies
were not directly implemented
• The ontologies and technologies have
developed and more accepted in seven years
• Some efforts have been made to include
ontologies – layer on MeSH. We support a lot
of standards – InChI, RInChI, JCAMP, CIF
• The ChemSpider architecture is being rebuilt
and considering new standards and
ontologies
12. Some available ontologies…
• RSC has built and opened in-house ontologies:
• Chemical methods (CHMO)
• Name reactions (RXNO)
• Molecular processes (MOP), largely auto-generated
from the corresponding ChEBI classes
• We have contributed to external ontologies:
• Small molecules (ChEBI)
• Cheminformatics (CHEMINF)
13. Chemistry ontologies 1
ChEBI (molecules, families of molecules,
parts of molecules, 32128 fully annotated
classes) (http://www.ebi.ac.uk/chebi/)
perylene (CHEBI:29861) a perylene (CHEBI:60201)
perylene skeleton
(CHEBI:60200)
16. Chemistry ontologies 2
Chemical Methods Ontology (http://rsc-cmo.googlecode.com)
2745 classes describes methods used to:
•collect data in chemical experiments, such as MS and NMR
•prepare and separate material for further analysis, such as
sample ionisation, chromatography, and electrophoresis
•synthesise materials, such as continuous vapour deposition
•also describes the instruments used in these experiments,
such as mass spectrometers and chromatography columns and
their outputs
•Should be of value to chemical hazards and safety data
19. Limits of ontologies
Chemical space is very big:
‘The “small molecule universe” (SMU), the set of
all synthetically feasible organic molecules of 500
Daltons molecular weight or less, is estimated to
contain over 1060
structures, making exhaustive
searches for structures of interest impractical.”
Virshup et al., J. Am. Chem. Soc.,
doi:10.1021/ja401184g
20. Why a named reaction ontology?
• Despite attempts to introduce systematic
nomenclature for organic reactions, lots of
chemists still prefer to attach human
names.
21. A big challenge
• Classification is based on what the experimenter
intends
• Build the ontology around intended product
molecules rather than might be by-products
• (Carbon dioxide, water, hydrolysed protecting
groups, protons, etc. etc.)
24. Limits of reaction classification
• Much of RXNO is still classified by hand
• Example: we can’t just define a cyclization as
a reaction where a cyclic compound is formed.
The Friedel–Crafts acylation produces a cyclic
compound but is not a cyclization!
25. RXNO in the wild
510 classes in the RXNO namespace
… and RXNO is built in to NextMove
Software’s reaction identification tool.
26. RXNO: next steps
• More reactions!
• More cross-references!
• More example reactions!
• Links to graphical versions! (All drawn, just
awaiting uploading.)
• More SMIRKS strings!
27. Using ontologies in text mining
• To provide a controlled vocabulary of terms
found in text and a common identifier.
• This identifier hopefully is a resolvable HTTP
URI, for example, for chemical compounds
http://purl.obolibrary.org/obo/CHEBI_36063 )
and to methods terminology
28.
29. Ontologies as synonym sets for
text-mining
• We have text-mined the whole 21st century
RSC archive with a myriad of ontologies.
Results are on the publishing platform
• We have looked for correlations between
molecules and ontology terms.
• Two examples follow…
32. Projects and Ontologies
• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using
semantic web technologies
• Open source code, open data and open
standards
• Academics, Pharmas, Publishers…
• To put medicines in the pipeline…
35. Our RDF schema
Two dozen calculated properties >106
molecules
•CHEMINF ontology for cheminformatics
•QUDT for units and numeric values
•ChemSpider IDs for molecules
Calculation
connection table
has_input
benzene
is_about
calculated log P
has_output
dimensionless
has_unit 2.177
has_value
0.234has
standard
uncertainty
36. RSC data in Open PHACTS
1. Molecule synonyms and identifiers
2. Linksets between ChEBI, ChEMBL, DrugBank
and OPS identifiers
3. Molecule–molecule relations (“parent–child”) of
interest for drug discovery
4. Calculated physicochemical properties for
compounds (both molecular and macroscopic)
37. Synonyms and identifiers
Newly added to the CHEMINF ontology:
•Validated ChemSpider synonyms
•Unvalidated ChemSpider synonyms
•Validated database identifiers
•Unvalidated database identifiers
•InChI, InChIKey, SMILES
•Preferred ChemSpider name
38. Physicochemical properties
log P log D (at pH 5.5, at pH 7.4)
bioconcentration factor KOC (at pH 5.5, at
pH 7.4) index of refraction polar surface
area molar refractivity molar volume
polarizability surface tension density at
STP flash point at 1 atm boiling point at 1
atm enthalpy of vaporization at STP
vapour pressure at STP
39. It is actually more complicated..
benzene’s
connection table
OPS
benzene
calculation result
QUDT
dimensionless
quantity
“2.17”^^xsd:float
IAO
is about
OBI
has specified
output
OBI
has specified
input
QUDT
has value
QUDT
has standard
uncertainty
QUDT
has unit
CHEMINF
calculated log P
rdf:type
CHEMINF
connection table
rdf:type
“0.234”^^xsd:float
calculation
process
CHEMINF
execution of
ACD/Labs
PhysChem software
library version 12.01
rdf:type
41. Chemistry Data to manage…
• Compounds
• Reactions
• Spectra
• Crystals (in development)
• Materials
• Assays
• Algorithms
• …
42. Future Work
• Extending use of ontologies across all of our
work on databases and as an underpinning to
the Chemical Data Repository
• Adding ontologies to other grant-based projects
such as PharmaSea
• Continued collaborations with University of
Southampton on Labtrove for Chemistry
• RSC collaboration with Dr Stuart Chalk (UNF)
on data standards and ontologies
• Working with CHAS on hazard/safety data