SlideShare uma empresa Scribd logo
1 de 35
Data enhancing the Royal
Society of Chemistry
publication archive
Antony Williams, Colin Batchelor,
Peter Corbett, Ken Karapetyan and
Valery Tkachenko
ACS Dallas
March 2014
Data Enhancing the RSC
Archive
• Publications summarise
data acquisition, analysis
and conclusions.
• Much detail in the data
• Improved navigation
includes data access
• Reanalysis of data is
limited in PDFs
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
How is DERA going? TEXT
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Mostly marked up with XML, more structured,
easier to handle. Markup mostly published onto
the HTML forms of the articles
• Required multiple iterations based on
dictionaries, markup, OSCAR extraction
• New visualization approaches in development
Chemical Validation and
Standardization
The RSC Data
Repository
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds
Module
Spectra
Module
Reactions
Module
Materials
Module
Textmining
Module
͙
Module
Web UI for unified depositions
DropBox, Google Drive,
SkyDrive, etc
LabTroveand other templated
data
Documents
API, FTP, etc
Raw data Validated data
Staging
databases
Alldatabases are
sliced by data
sources/data
collections and
havesimple
security model
where each data
slice/sourceis
private, public or
embargoed
Text-Mining
ChemSpider Reactions
Reactions
• We will put reactions from our databases into
the Reactions Repository
• We will use “Reaction Validation” procedures
to clean up Daniel Lowe’s USPTO patent set
of over a million extracted reactions
• We will move ChemSpider SyntheticPages
content to the Reactions Repository
• We will use the RXNO Ontology to classify
the reactions
Reaction Deposition/Validation
ESI – Text Spectra
Lots of “Textual Spectra”
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35
(t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8
Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19,
120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42,
123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
How is DERA going? Text Spectra
• Overall progress is good
• Improved algorithms for extraction of spectra
• Extraction of associated compound name
with spectrum – name to structure
conversion now
• MestreLabs have provided us with batch
conversion tool
• Work in progress – manual and automated
validation. In theory auto-assignment also
Visualization of Spectra
• For spectra associated with compounds we
would like to view “interactive spectra”
Javascript viewer with JMol
Figure Spectra into “Real
Spectra”?
• We are turning text into structures
• We are turning text into spectra
• And we are turning figures into spectra
Turn “Figures” Into Data
EXTRACTED
DATA
FIGURE
EXTRACTED
DATA
FIGURE
How is DERA going? Figures
• Validation tests performed with William
Brouwer. Good enough to proceed with
larger test set
• Ready to run process across larger collection
• Focus on 21st
century articles only for now
Early Test Experiments

Input : 74 supplementary data documents/ 3444 pages

Output : p2t extracted content in 1069 page instances
− 578 molecules

~ 10% false positives eg., classifies Bruker logo as
chemical object

~ 20% false negatives eg., missing some symbols
from structure
− 1151 spectra

> 80% of peaks extracted to within 1-2 decimal
places (ppm)
Validating Spectra
• How will we check data consistency?
• How do we know the structure and the
spectra match? Comparing image to
spectrum is NOT enough!!!
• Predict spectra, use spectral verification, use
algorithmic checking.
• Flag “dodgy data” and use crowdsourcing for
data checking
• MULTIPLE prediction technologies now
available – VERIFICATION is tougher
What are we extracting?
• Compounds from compound names
• Reactions from the text
• Spectral extraction – from figures and text
• Extraction of data from “tables” – not only
CSV files but literal tables in the publication –
specifically data from MedChemComm as
proof of concept
Building out the technology
• We are presently Open-Sourcing a chemical
registration system developed for OpenPHACTS
• We will then Open Source the Chemical
Validation and Standardization Platform
• We are working with Bob Hanson and Bob
Lancashire on Jmol/JSpecView Open Source
• We will deliver a set of Open Source widgets for
structure handling/visualization
Javascript viewer NMR, MS, IR
Grand Target
• Fingers crossed to get 21st
century spectra
converted
• Spectra associated with compounds will go
into ChemSpider
• Spectra converted from Figures but without
compound association will be captured with
Figures into the Data Repository
• Focus on IR, Raman, UV-Vis & 1D NMR
DERA is FINE for an archive
The WRONG WAY otherwise!
• We should NOT be mining data out of future
publications
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
• Data should be open, available, with meta
data and provenance
We can solve for Authors here
Will it be used though???
Advanced ESI
Conclusions
• Great progress in mining the archive and 21st
century articles are being enhanced on the
publishing platform iteratively
• Spectral Data is the next focus – directly
connected to our work on the data repository
• Reaction extraction, processing and
validation from articles is progressing more
slowly
• Results are content, software components
and and Open Source Contributions
Acknowledgments
• Bill Brouwer – Plot2Txt Development
• Carlos Cobas and Santi Dominguez
• Bob Hanson and Bob Lancashire for
Jmol/JSpecView Javascript version
• Leah McEwan and Will Dichtel
• ACD/Labs – Provider of spectroscopy tools
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Mais conteúdo relacionado

Semelhante a Data enhancing the royal society of chemistry publication archive

Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...InsideScientific
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distddm314
 
Chemical Analysis Facility
Chemical Analysis FacilityChemical Analysis Facility
Chemical Analysis Facilitychristinejcardin
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...Ken Karapetyan
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectMaho Nakata
 
Preservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore MelePreservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore MeleDigitalPreservationEurope
 
Chap1 intro to-accelerators_final
Chap1 intro to-accelerators_finalChap1 intro to-accelerators_final
Chap1 intro to-accelerators_finalSanjay Dubey
 

Semelhante a Data enhancing the royal society of chemistry publication archive (20)

Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
Resume2015 research
Resume2015 researchResume2015 research
Resume2015 research
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 dist
 
Chemical Analysis Facility
Chemical Analysis FacilityChemical Analysis Facility
Chemical Analysis Facility
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc project
 
NOMAD
NOMADNOMAD
NOMAD
 
Preservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore MelePreservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore Mele
 
Chap1 intro to-accelerators_final
Chap1 intro to-accelerators_finalChap1 intro to-accelerators_final
Chap1 intro to-accelerators_final
 

Último

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Último (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Data enhancing the royal society of chemistry publication archive

  • 1. Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko ACS Dallas March 2014
  • 2. Data Enhancing the RSC Archive • Publications summarise data acquisition, analysis and conclusions. • Much detail in the data • Improved navigation includes data access • Reanalysis of data is limited in PDFs
  • 3.
  • 4. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 5. How is DERA going? TEXT • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Mostly marked up with XML, more structured, easier to handle. Markup mostly published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, OSCAR extraction • New visualization approaches in development
  • 7. The RSC Data Repository Deposition Gateway Staging databases Compounds Reactions Spectra Materials Articles / CSSP Compounds Module Spectra Module Reactions Module Materials Module Textmining Module ͙ Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc LabTroveand other templated data Documents API, FTP, etc Raw data Validated data Staging databases Alldatabases are sliced by data sources/data collections and havesimple security model where each data slice/sourceis private, public or embargoed
  • 10. Reactions • We will put reactions from our databases into the Reactions Repository • We will use “Reaction Validation” procedures to clean up Daniel Lowe’s USPTO patent set of over a million extracted reactions • We will move ChemSpider SyntheticPages content to the Reactions Repository • We will use the RXNO Ontology to classify the reactions
  • 12. ESI – Text Spectra
  • 13. Lots of “Textual Spectra”
  • 14. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 15. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  • 16. How is DERA going? Text Spectra • Overall progress is good • Improved algorithms for extraction of spectra • Extraction of associated compound name with spectrum – name to structure conversion now • MestreLabs have provided us with batch conversion tool • Work in progress – manual and automated validation. In theory auto-assignment also
  • 17. Visualization of Spectra • For spectra associated with compounds we would like to view “interactive spectra”
  • 19. Figure Spectra into “Real Spectra”? • We are turning text into structures • We are turning text into spectra • And we are turning figures into spectra
  • 20. Turn “Figures” Into Data EXTRACTED DATA FIGURE
  • 21.
  • 23. How is DERA going? Figures • Validation tests performed with William Brouwer. Good enough to proceed with larger test set • Ready to run process across larger collection • Focus on 21st century articles only for now
  • 24. Early Test Experiments  Input : 74 supplementary data documents/ 3444 pages  Output : p2t extracted content in 1069 page instances − 578 molecules  ~ 10% false positives eg., classifies Bruker logo as chemical object  ~ 20% false negatives eg., missing some symbols from structure − 1151 spectra  > 80% of peaks extracted to within 1-2 decimal places (ppm)
  • 25. Validating Spectra • How will we check data consistency? • How do we know the structure and the spectra match? Comparing image to spectrum is NOT enough!!! • Predict spectra, use spectral verification, use algorithmic checking. • Flag “dodgy data” and use crowdsourcing for data checking • MULTIPLE prediction technologies now available – VERIFICATION is tougher
  • 26. What are we extracting? • Compounds from compound names • Reactions from the text • Spectral extraction – from figures and text • Extraction of data from “tables” – not only CSV files but literal tables in the publication – specifically data from MedChemComm as proof of concept
  • 27. Building out the technology • We are presently Open-Sourcing a chemical registration system developed for OpenPHACTS • We will then Open Source the Chemical Validation and Standardization Platform • We are working with Bob Hanson and Bob Lancashire on Jmol/JSpecView Open Source • We will deliver a set of Open Source widgets for structure handling/visualization
  • 29. Grand Target • Fingers crossed to get 21st century spectra converted • Spectra associated with compounds will go into ChemSpider • Spectra converted from Figures but without compound association will be captured with Figures into the Data Repository • Focus on IR, Raman, UV-Vis & 1D NMR
  • 30. DERA is FINE for an archive The WRONG WAY otherwise! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  • 31. We can solve for Authors here Will it be used though???
  • 33. Conclusions • Great progress in mining the archive and 21st century articles are being enhanced on the publishing platform iteratively • Spectral Data is the next focus – directly connected to our work on the data repository • Reaction extraction, processing and validation from articles is progressing more slowly • Results are content, software components and and Open Source Contributions
  • 34. Acknowledgments • Bill Brouwer – Plot2Txt Development • Carlos Cobas and Santi Dominguez • Bob Hanson and Bob Lancashire for Jmol/JSpecView Javascript version • Leah McEwan and Will Dichtel • ACD/Labs – Provider of spectroscopy tools
  • 35. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams