SlideShare uma empresa Scribd logo
1 de 66
The Importance of the InChI Identifier
as a Foundation Technology for
eScience Platforms at RSC
Antony Williams
Bio-IT,
Boston, April 27th
2014
Without the InChI…
• ChemSpider is unlikely to have been built
• It would not have grown into one of the
domains primary online chemistry resources
• The Royal Society of Chemistry would not
have it as an online database, would not
have a large cheminformatics team and
would not be involved in a number of large
scale funded projects around chemistry data
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our journals
and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
So what is Yohimbine?
Of course it is out there…
Drugbox: 3001/5080 with InChIs
Chembox:5436/7690 with InChIs
Tell me more…
• Where can I find the molfile for Yohimbine?
• Papers/Patents about Yohimbine?
• What are the side effects of Yohimbine?
• Where can I order Yohimbine?
• What are the physicochemical properties?
• Metabolic pathways?
• Different synonyms of Yohimbine?
• Synthesis of Yohimbine?
• Side effects of Yohimbine?
• Etc….
Quantity!
Yohimbine on ChemSpider
Downsides of Overall Approach
• Meshing data together based on InChIs
worked for simple molecules
• 2D layout errors inherited or limited by
algorithm
• Complex molecules that are meant to be the
same thing were NOT deduplicated.
Compounds differing by one stereocenter,
named the same, meant to be the same, are
not the same
Yohimbine on
ChemSpider..Quality?
So where can we travel???
So where can we travel???
InChI String Search via Google
Give me InChIKeys…
And where can we travel???
ChemSpider
BRENDA
Wikipedia
ChEMBL
ChEBI
DrugBank
Aggregator
Enzymes
Encyclopedia
Pharmacology
Curated Chemicals
Drug-Drug Target
How do we build it?
• We deal in Molfiles or SDF files – with coordinates
• Deposit anything that has an InChI – we support
what InChI can handle, good and bad
• Standardization based on “InChI standardization”
• InChIs aggregate (certain) tautomers
• We link out to external sites using their IDs
Downsides of InChI
• InChI was a moving target (multi versions) but
overall worked as planned.
• Good for small molecules – but no polymers, issues
with inorganics, organometallics, imperfect
stereochemistry. ChemSpider is “small molecules”
• InChI used as the “deduplicator” – FIRST version of
a compound into the database becomes THE
structure to deduplicate against…
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization Issues
Depiction based on molfile
Standardize
Use the SRS as a guidance document for
standardization
Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
Ammonium salts
CVSP
NPC Browser Set
Checking include InChI
• Many SDF files contain InChIs and SMILES
– comparing the structure contained within
the file with the associated InChI is useful –
turned up a number of errors in checking
online databases
So, I’m writing an article…
With these…I will lose data 
But linking with InChI …
Structure Searching the Web
Data in Publications
• This is not new, you know the story…
• So much data of value is contained within a
publication and delivered in a PDF form
• PDF files, and unclear licensing/copyright, limit
access to data so I can rework, reuse,
repurpose, text mine etc.
• “I specialize in XXXX. I want a database of
YYYY extracted from publications and made
available, for free, with the capabilities I need,
and the publishers should just do it”
“Data enable” publications?
• We would LOVE to bring data out of our archive
• What could we do?
• Find chemical names and generate structures
• Find chemical images and generate structures
• Find reactions – and make a database!
• Find data (MP, BP, LogP) and host. Build
models!
• Find figures and database them
• Find spectra (and link to structures)
• Validate the data algorithmically
RSC Archive – since 1841
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
But names = structures
• Systematic names can be generated FROM
chemical structures algorithmically
But names = structures
• …and structures from systematic names
But what of trivial names?
• What about trivial names, trade names, CAS
numbers, multilingual names etc.?
Searching that lipid in patents
Aspirin on ChemSpider
Work in Progress
Work in Progress
Work in Progress
Work in Progress
But Context Gives Reactions
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
ChemSpider Reactions
ChemSpider as a Foundation
• >30 million chemicals (and growing)
• ChemSpider is free to access for everyone –
and the API means people program against it
• What projects can we benefit?
Support grant-based services
• Multiple European consortium-based grants
• PharmaSea (FP7 funded)
• Open PHACTS (IMI funded)
• UK National Chemical Database Service
(http://cds.rsc.org) – developing data repository
for lab data, integrate Electronic Lab Notebooks
• Open Drug Discovery projects
PharmaSea
• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using
semantic web technologies
• Open code, open data, open standards
• Academics, Pharmas, Publishers…
• To put medicines in the pipeline…
Open PHACTS
All Databases We Generate…
• All databases and systems we build now
include generated InChIs
• InChIs are facilitating discoverability via
searching on Google (see Chris’ talk) but
also for querying and linking
But we are still VERY LIMITED
• RSC deals with way more than organics,
inorganics, organometallics – we are building a
data repository to include materials, polymers,
ambiguous materials etc.
• There are many plans for InChI moving forward
– Markush, polymers, organometallics etc
The great promise should be
obvious
• InChIs are here to stay
• They will evolve, they will encompass, we will
adopt and adapt
• Public and private databases will federate &
build a linked environment of validated data!
• Data validation and standardization is
needed
• Open Data will continue to proliferate
• InChIs are in the “Semantic Web” already
If InChI never existed …
• ChemSpider would never have been built
• Database linking would suffer dramatically
• The web would not be “structure searchable”
• Cheminformatics tools would likely not be
linking to public domain databases in the
same way
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Mais conteúdo relacionado

Mais procurados

Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jSimon Jupp
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinSimon Jupp
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionNextMove Software
 
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
Graph DB + Bioinformatics:  Bio4j, recent applications and future directions Graph DB + Bioinformatics:  Bio4j, recent applications and future directions
Graph DB + Bioinformatics: Bio4j, recent applications and future directions Pablo Pareja Tobes
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...dan2097
 

Mais procurados (20)

Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4j
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
The Benefits of Participation in the Social Web of Science
The Benefits of Participation in the Social Web of ScienceThe Benefits of Participation in the Social Web of Science
The Benefits of Participation in the Social Web of Science
 
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
 
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
Graph DB + Bioinformatics:  Bio4j, recent applications and future directions Graph DB + Bioinformatics:  Bio4j, recent applications and future directions
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
 
Bio4j
Bio4jBio4j
Bio4j
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open ChemistryCrowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
 
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
 
Improving online chemistry one structure at a time
Improving online chemistry one structure at a timeImproving online chemistry one structure at a time
Improving online chemistry one structure at a time
 
Building a data repository to manage chemistry research data
Building a data repository to manage chemistry research dataBuilding a data repository to manage chemistry research data
Building a data repository to manage chemistry research data
 

Semelhante a The importance of the InChI identifier as a foundation technology for eScience platforms

How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...Ken Karapetyan
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 

Semelhante a The importance of the InChI identifier as a foundation technology for eScience platforms (20)

Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
The expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry communityThe expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry community
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
Ontology work at the Royal Society of Chemistry
Ontology work at the Royal Society of ChemistryOntology work at the Royal Society of Chemistry
Ontology work at the Royal Society of Chemistry
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Current initiatives in developing research data repositories at the Royal Soc...
Current initiatives in developing research data repositories at the Royal Soc...Current initiatives in developing research data repositories at the Royal Soc...
Current initiatives in developing research data repositories at the Royal Soc...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data DashboardsAccessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 

Último

ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 

Último (20)

ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 

The importance of the InChI identifier as a foundation technology for eScience platforms

  • 1. The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at RSC Antony Williams Bio-IT, Boston, April 27th 2014
  • 2. Without the InChI… • ChemSpider is unlikely to have been built • It would not have grown into one of the domains primary online chemistry resources • The Royal Society of Chemistry would not have it as an online database, would not have a large cheminformatics team and would not be involved in a number of large scale funded projects around chemistry data
  • 3. • ~30 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  • 9. So what is Yohimbine?
  • 10. Of course it is out there… Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs
  • 11. Tell me more… • Where can I find the molfile for Yohimbine? • Papers/Patents about Yohimbine? • What are the side effects of Yohimbine? • Where can I order Yohimbine? • What are the physicochemical properties? • Metabolic pathways? • Different synonyms of Yohimbine? • Synthesis of Yohimbine? • Side effects of Yohimbine? • Etc….
  • 14. Downsides of Overall Approach • Meshing data together based on InChIs worked for simple molecules • 2D layout errors inherited or limited by algorithm • Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
  • 16. So where can we travel???
  • 17. So where can we travel???
  • 18.
  • 19. InChI String Search via Google Give me InChIKeys…
  • 20. And where can we travel???
  • 23. How do we build it? • We deal in Molfiles or SDF files – with coordinates • Deposit anything that has an InChI – we support what InChI can handle, good and bad • Standardization based on “InChI standardization” • InChIs aggregate (certain) tautomers • We link out to external sites using their IDs
  • 24. Downsides of InChI • InChI was a moving target (multi versions) but overall worked as planned. • Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules” • InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
  • 25. Side Effects of InChI Usage
  • 27. Side Effects of InChI Usage
  • 29. Standardize Use the SRS as a guidance document for standardization Adjust as necessary to our needs
  • 31. Salt and Ionic Bonds
  • 33. CVSP
  • 35. Checking include InChI • Many SDF files contain InChIs and SMILES – comparing the structure contained within the file with the associated InChI is useful – turned up a number of errors in checking online databases
  • 36. So, I’m writing an article…
  • 37. With these…I will lose data 
  • 38. But linking with InChI …
  • 40. Data in Publications • This is not new, you know the story… • So much data of value is contained within a publication and delivered in a PDF form • PDF files, and unclear licensing/copyright, limit access to data so I can rework, reuse, repurpose, text mine etc. • “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with the capabilities I need, and the publishers should just do it”
  • 41. “Data enable” publications? • We would LOVE to bring data out of our archive • What could we do? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and host. Build models! • Find figures and database them • Find spectra (and link to structures) • Validate the data algorithmically
  • 42. RSC Archive – since 1841
  • 43. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 44. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 45. But names = structures • Systematic names can be generated FROM chemical structures algorithmically
  • 46. But names = structures • …and structures from systematic names
  • 47. But what of trivial names? • What about trivial names, trade names, CAS numbers, multilingual names etc.?
  • 48. Searching that lipid in patents
  • 54. But Context Gives Reactions The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 56. ChemSpider as a Foundation • >30 million chemicals (and growing) • ChemSpider is free to access for everyone – and the API means people program against it • What projects can we benefit?
  • 57. Support grant-based services • Multiple European consortium-based grants • PharmaSea (FP7 funded) • Open PHACTS (IMI funded) • UK National Chemical Database Service (http://cds.rsc.org) – developing data repository for lab data, integrate Electronic Lab Notebooks • Open Drug Discovery projects
  • 58.
  • 60. • 3-year Innovative Medicines Initiative project • Integrating chemistry and biology data using semantic web technologies • Open code, open data, open standards • Academics, Pharmas, Publishers… • To put medicines in the pipeline…
  • 62. All Databases We Generate… • All databases and systems we build now include generated InChIs • InChIs are facilitating discoverability via searching on Google (see Chris’ talk) but also for querying and linking
  • 63. But we are still VERY LIMITED • RSC deals with way more than organics, inorganics, organometallics – we are building a data repository to include materials, polymers, ambiguous materials etc. • There are many plans for InChI moving forward – Markush, polymers, organometallics etc
  • 64. The great promise should be obvious • InChIs are here to stay • They will evolve, they will encompass, we will adopt and adapt • Public and private databases will federate & build a linked environment of validated data! • Data validation and standardization is needed • Open Data will continue to proliferate • InChIs are in the “Semantic Web” already
  • 65. If InChI never existed … • ChemSpider would never have been built • Database linking would suffer dramatically • The web would not be “structure searchable” • Cheminformatics tools would likely not be linking to public domain databases in the same way
  • 66. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams