The Royal Society of Chemistry provides access to a number of databases hosting chemicals data, reactions, spectroscopy data and prediction services. These databases and services can be accessed via web services utilizing queries using standard data formats such as InChI and molfiles. Data can then be downloaded in standard structure and spectral formats allowing for reuse and repurposing. The ChemSpider database integrates to a number of projects external to RSC including Open PHACTS that integrates chemical and biological data. This project utilizes semantic web data standards including RDF. This presentation will provide an overview of how structure and spectral data standards have been critical in allowing us to integrate many open source tools, ease of integration to a myriad of services and underpin many of our future developments.
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
The importance of standards for data exchange and interchange on the Royal Society of Chemistry e science platforms
1. The importance of standards for
data exchange and interchange
on the Royal Society of
Chemistry eScience platforms
Valery Tkachenko, Colin Batchelor,
Jon Steele and Antony Williams*
ACS Indianapolis
September 12th
2013
2. RSC Projects in Action
• Many RSC projects underway, underpinned by
ChemSpider, and very dependent on standards
• ChemSpider
• ChemSpider Reactions
• Open PHACTS
• PharmaSea
• Chemical Database Service
• Open Source Drug Discovery
3. • 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using
semantic web technologies
• Open source code, open data and open
standards
• Academics, Pharmas, Publishers…
• To put medicines in the pipeline…
7. Compound Data
• The standards of chemical structure handling
are primarily molfile, SDfile, SMILES, InChI
• We primarily depend on molfiles and SDF
files for data deposition and interchange
• We use InChI a lot – especially for integrated
searching across the web
• There ARE data interchange problems
associated with structures….
12. Compound Data
• The standards of chemical structure handling
are primarily molfile, SDfile, SMILES, InChI
• We primarily depend on molfiles and SDF
files for data deposition and interchange
• We use InChI a lot – especially for integrated
searching across the web
• There ARE data interchange problems
associated with structures….
13. CVSP : chemical validation
Free chemistry validation platform that performs:
•Structure validation
• Atoms
• Bonds
• Valence
• Stereo
• If aromatic - check that uniquely dearomatized
• Strongest acid not ionized first in partially-ionized system
•Cross-matching of SDF fields
• synonyms
• InChIs
• Smiles
17. Reaction Data
• ChemSpider is built for compounds – but
how are they made???
• ChemSpider Reactions is our attempt to
answer the question..
• Integrating both commercial and open data
• RSC Databases, data extracted from our
publications on the DERA project and Open
Data sources of reactions
• Molfiles, CDX files, RXN files
20. RSC Journal Content
• Many 10s/100s of thousands of reactions
contained in our journals
• Electronic Supplementary information data
contains lots more
25. Spectral Data
• ChemSpider requires spectral data to be
deposited in standard formats – JCAMP or
images
• All spectra available at: http://
www.chemspider.com/spectra.aspx
• Data are deposited on a regular basis
• Students
• Chemical vendors
• Growing collection now
30. JCAMP file downloads
• When NMR spectra are stored as JCAMP
then downloads into offline packages are
feasible – MestreLabs, ACD/Labs etc
• Open Data – download versus view
• Store spectra locally and reuse
• Java is increasingly a pain!
• Need to move to HTML5 viewing on
ChemSpider, especially for Mobile Viewing
32. Challenges with Spectra
• JCAMP is good for a lot of spectral data – IR,
Raman, 1D NMR
• MS data is rarely made available in JCAMP
• We would love a ratified JCAMP 6.0 for 2D
data exchange – allows third parties to build
support for download
• ASSIGNED JCAMP spectra can be
supported but no real standards here
34. DERA to digitize documents?
• We want to get data out of our historical archive
• What could we do?
• Find chemical names and generate structures
• Find chemical images and generate structures
• Find reactions – and make a database!
• Find data (MP, BP, LogP) and deposit
• Find figures and database them
• Find spectra (and link to structures)
46. It’s exactly the WRONG WAY!
• We should NOT be mining data out of future
publications
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
47. APIs and Standards
• We follow the standard expectations in terms
of how people would want to access our
APIs: RESTful services, JSON handling etc.
• We allow people to pass in queries using
molfiles, SMILES, InChI/Keys etc
• Future will include JCAMP searching
• APIs in use by MANY organizations and of
value to our Open PHACTS, PharmaSea,
Chemical Database Service etc. Also Mobile
48. Conclusions
• Data Interchange standards are all over our
projects!
• We are grateful to companies, organizations,
contributors who have helped define:
• Structure – Mol,SDF,InChI etc
• Spectra – JCAMP, SPC, NetCDF etc
• W3C standards
49. For the Next ACS hopefully…
• Build out our ChemSpider Reaction collection
• Grab spectral data out of our ESI!
• Get more submissions in STANDARD formats
• Integrate to spectroscopy handling systems
for deposition in JCAMP
• Push molfiles directly into ChemSpider with
improved deposition platform
• Build out the chemical data repository…