SlideShare uma empresa Scribd logo
1 de 20
Baixar para ler offline
Inchi for large molecules: 
The nextmove software perspective 
Roger Sayle & Noel O’Boyle 
Nextmove software, cambridge, uk 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
“this house believes…” 
• The most important distinction in life science 
informatics is between molecular and non-molecular 
(bio)chemistry, not between chemistry and biology. 
• Fuzzy distinctions such as “small molecules”, lipids, 
proteins, nucleic acids, peptides, oligosaccharides, or 
terpenes are like asking how many colors are there in 
a rainbow? (c.f. The Sapir-Whorf hypothesis). 
• Schemes that encode these distinctions (such as 
HELM and ISO 11238 even RasMol) break down 
when (poorly defined) categories overlap. 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Peptide or not? 
cyclo[OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val] 
valinomycin 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Saccharide or not? 
D-Glucopyranose 
D-gluco-hexopyranose 
(2S)-2-methyloxane 
(2S)-2-methyl-tetrahydropyran 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Saccharide or not? 
D-Glucopyranose 
D-gluco-hexopyranose 
D-Quinovopyranose 
6-deoxy-Glucopyranose 
6-deoxy-D-gluco-hexopyranose 
D-Paratopyranose 
3,6-dideoxy-Glucopyranose 
3,6-dideoxy-D-ribo-hexopyranose 
D-Amicetopyranose 
2,3,6-trideoxy-Glucopyranose 
2,3,6-trideoxy-D-erythro-hexopyranose 
(2S)-2-methyloxane 
(2S)-2-methyl-tetrahydropyran
The cutting edge of biosimilarity 
• The high prevalence of potentially life-threatening 
hypersensitivity reactions to the antibody cetuximab 
(Erbitux) in some US states has been traced to its 
glycosylation [containing a Gal(a1-3)Gal epitope]. 
Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for 
galactose-alpha-1,3-galactose”, New England Journal of Medicine, 
Vol. 358, No. 11, pp. 1109-1117, 13th March 2008. 
• Similarly, Human Erythropoietin (EPO) alpha, beta, 
delta and omega share the same primary sequence, 
but differ in their glycosylation patterns. 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Destructive suggestion… 
• Systems based upon monomer dictionaries (such as 
HELM and PDB) are notoriously difficult to maintain. 
• The limited number of monomers in proteinogenic 
peptides and natural nucleic acid sequences leads to 
a false sense of security; that monomers are finite. 
• In practice, the number of monomers, post-translational 
and chemical modifications is infinite. 
• Even more difficult than standardizing monomer 
definitions via a central repository, like PDB, is 
allowing local custom definitions. 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
48 hexopyranoses 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
264 deoxy-hexopyranoses
9540 substituted 
hexopyranoses (4 most common 
substituents)
Constructive suggestion… 
• Ideally, a chemical identifier should be independent 
of the input representation or file format. 
• Duplicates between small molecules, peptide and 
proteins are best determined by a single identifier, 
preferably the existing InChI. 
• This is possible as increases in computer power and 
storage mean that cheminformatics toolkits can 
handle huge biopolymers on modern hardware. 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Proof-of-concept 
• I’ve previously reported on Tanimoto chemical search 
of PDB (80K) represented as canonical SMILES (1Gb). 
• To test for duplicates and InChI key hash collisions, 
we attempted to generate InChI keys for uniprot. 
• OpenBabel source tree already contains patches to 
InChI library to increase the official 1024 atom limit. 
• A few additional source changes also helped. 
• Ultimately, InChI keys could be generated for ~99.4% 
of the ~450K unique sequences in swissprot division. 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
Record breaking inchi-key 
• Sequence Identifier: UTP10_KLULA 
• Sequence Length: 1774 amino acids 
• Molecule size: 28509 atoms 
• InChI Length: 119699 characters 
• InChI key: PHBRSEQMAKHFGD-ZBXWIJJNSA-N 
• InChI Canonicalization Time: 73.2s 
• Canonical SMILES Length: 35408 chars 
• SMILES Canonicalization Time: 0.4s 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
protein Canonicalization time 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
protein Canonicalization time 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
protein Canonicalization time 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
conclusions 
• “InChI for large molecules” simply requires 
fixing the bugs in standard InChI. 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
acknowledgements 
• Lisa Sach-Peltason, Hoffmann-La Roche, Basel. 
• Joann Prescott-Roy, Novartis, Boston, MA. 
• Greg Landrum, Novatis, Basel, Switzerland. 
• Evan Bolton, NCBI PubChem project, Bethesda, MD. 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
PDB 
IUPAC NAME 
L-Cys(1)-L-Tyr-L-Ile-L-Gln-L-Asp-L-Cys(1)-L-Pro-L-Leu-Gly-NH2 
IUPAC Condensed 
[C@H]1(CCCN1C(=O)[C@@H]1CSSC[C@@H](C(=O)N[C@ 
@H](Cc2ccc(cc2)O)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[ 
C@@H](CCC(=O)N)C(=O)N[C@@H](CC(=O)O)C(=O)N1)N) 
C(=O)N[C@@H](CC(C)C)C(=O)NCC(=O)N 
SMILES 
DEPICTIONS 
Sugar & 
SPLICE 
L-cysteinyl-L-tyrosyl-L-isoleucyl-L-glutaminyl-L-alpha-aspartyl-L-cysteinyl- 
L-prolyl-L-leucyl-glycinamide (1->6)-disulfide 
common NAME 
[5-L-aspartic acid]oxytocin 
OH 
PLN 
H-C(1)YIQDC(1)PLG-[NH2] 
PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$ 
helm 
Competing interests statement
Peptide names imply architecture 
• Named peptides imply not only sequence but also 
N-terminal acetylation, C-terminal amidation and 
disulfide bridge topology. 
• Example named derivatives: 
– gastrin (14-17) 
– motilin amide 
– oxytocin free-acid 
– acetyl-oxytocin 
– deacetyl-abarelix 
– oxytocin reduced 
– endothelin-1 (1→3),(11 → 15)-bis(disulfide) 
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

Mais conteúdo relacionado

Destaque

Revising the Topliss Decision Tree
Revising the Topliss Decision TreeRevising the Topliss Decision Tree
Revising the Topliss Decision TreeNextMove Software
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsNextMove Software
 
Representation and display of non-standard peptides using semi-systematic ami...
Representation and display of non-standard peptides using semi-systematic ami...Representation and display of non-standard peptides using semi-systematic ami...
Representation and display of non-standard peptides using semi-systematic ami...NextMove Software
 
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...NextMove Software
 
Functional Foods Infographic
Functional Foods InfographicFunctional Foods Infographic
Functional Foods InfographicFood Insight
 
Using Matched Molecular Series as a Predictive Tool To Optimize Biological Ac...
Using Matched Molecular Series as a Predictive Tool To Optimize Biological Ac...Using Matched Molecular Series as a Predictive Tool To Optimize Biological Ac...
Using Matched Molecular Series as a Predictive Tool To Optimize Biological Ac...NextMove Software
 
Peptide line notations for biologics registration and patent filings
Peptide line notations for biologics registration and patent filingsPeptide line notations for biologics registration and patent filings
Peptide line notations for biologics registration and patent filingsNextMove Software
 
Using Matched Series to decide what compound to make next
Using Matched Series to decide what compound to make nextUsing Matched Series to decide what compound to make next
Using Matched Series to decide what compound to make nextNextMove Software
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...NextMove Software
 
Javaエンジニアのためのアーキテクト講座-JJUG CCC 2014 Fall
Javaエンジニアのためのアーキテクト講座-JJUG CCC 2014 FallJavaエンジニアのためのアーキテクト講座-JJUG CCC 2014 Fall
Javaエンジニアのためのアーキテクト講座-JJUG CCC 2014 FallYusuke Suzuki
 

Destaque (11)

Revising the Topliss Decision Tree
Revising the Topliss Decision TreeRevising the Topliss Decision Tree
Revising the Topliss Decision Tree
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patents
 
Representation and display of non-standard peptides using semi-systematic ami...
Representation and display of non-standard peptides using semi-systematic ami...Representation and display of non-standard peptides using semi-systematic ami...
Representation and display of non-standard peptides using semi-systematic ami...
 
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...Peptide Informatics - Bridging the gap between small-molecule and large-molec...
Peptide Informatics - Bridging the gap between small-molecule and large-molec...
 
Functional Foods Infographic
Functional Foods InfographicFunctional Foods Infographic
Functional Foods Infographic
 
Using Matched Molecular Series as a Predictive Tool To Optimize Biological Ac...
Using Matched Molecular Series as a Predictive Tool To Optimize Biological Ac...Using Matched Molecular Series as a Predictive Tool To Optimize Biological Ac...
Using Matched Molecular Series as a Predictive Tool To Optimize Biological Ac...
 
Peptide line notations for biologics registration and patent filings
Peptide line notations for biologics registration and patent filingsPeptide line notations for biologics registration and patent filings
Peptide line notations for biologics registration and patent filings
 
Using Matched Series to decide what compound to make next
Using Matched Series to decide what compound to make nextUsing Matched Series to decide what compound to make next
Using Matched Series to decide what compound to make next
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
 
Javaエンジニアのためのアーキテクト講座-JJUG CCC 2014 Fall
Javaエンジニアのためのアーキテクト講座-JJUG CCC 2014 FallJavaエンジニアのためのアーキテクト講座-JJUG CCC 2014 Fall
Javaエンジニアのためのアーキテクト講座-JJUG CCC 2014 Fall
 
Data modeling for the business
Data modeling for the businessData modeling for the business
Data modeling for the business
 

Semelhante a InChI for Large Molecules

CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...NextMove Software
 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesLeighton Pritchard
 
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5ChemAxon
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biologyrobertstevens65
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesConnected Data World
 
Hu cal platnimm alis adds
Hu cal platnimm alis addsHu cal platnimm alis adds
Hu cal platnimm alis addsBrandon Chackel
 
Primary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxPrimary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxVandana Yadav03
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Lee Larcombe
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
Pistoia Alliance webinar on Antibody structures in the PDB
Pistoia Alliance webinar on Antibody structures in the PDBPistoia Alliance webinar on Antibody structures in the PDB
Pistoia Alliance webinar on Antibody structures in the PDBPistoia Alliance
 
Immunotherapy: Novel Immunomodulatory Targets
Immunotherapy: Novel Immunomodulatory TargetsImmunotherapy: Novel Immunomodulatory Targets
Immunotherapy: Novel Immunomodulatory TargetsPaul D. Rennert
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
Types of biological databases-protein database
Types of biological databases-protein databaseTypes of biological databases-protein database
Types of biological databases-protein databasechinmayeec
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeChris Mungall
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...Jan Aerts
 
JulieKlein_Bosc2012
JulieKlein_Bosc2012JulieKlein_Bosc2012
JulieKlein_Bosc2012KUPKB_Team
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary databaseKAUSHAL SAHU
 

Semelhante a InChI for Large Molecules (20)

CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
 
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
Pasteur Institute User Story - Cheminfo Stories 2020 Day 5
 
The Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in BiologyThe Past, Present and Future of Knowledge in Biology
The Past, Present and Future of Knowledge in Biology
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
Ontology Services for the Biomedical Sciences
Ontology Services for the Biomedical SciencesOntology Services for the Biomedical Sciences
Ontology Services for the Biomedical Sciences
 
Hu cal platnimm alis adds
Hu cal platnimm alis addsHu cal platnimm alis adds
Hu cal platnimm alis adds
 
Primary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxPrimary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptx
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014
 
Protein database
Protein databaseProtein database
Protein database
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
Pistoia Alliance webinar on Antibody structures in the PDB
Pistoia Alliance webinar on Antibody structures in the PDBPistoia Alliance webinar on Antibody structures in the PDB
Pistoia Alliance webinar on Antibody structures in the PDB
 
Immunotherapy: Novel Immunomodulatory Targets
Immunotherapy: Novel Immunomodulatory TargetsImmunotherapy: Novel Immunomodulatory Targets
Immunotherapy: Novel Immunomodulatory Targets
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 PresentationSLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
 
Types of biological databases-protein database
Types of biological databases-protein databaseTypes of biological databases-protein database
Types of biological databases-protein database
 
Collaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of LifeCollaboratively Creating the Knowledge Graph of Life
Collaboratively Creating the Knowledge Graph of Life
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
JulieKlein_Bosc2012
JulieKlein_Bosc2012JulieKlein_Bosc2012
JulieKlein_Bosc2012
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 

Mais de NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 

Mais de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 

Último

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Último (20)

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

InChI for Large Molecules

  • 1. Inchi for large molecules: The nextmove software perspective Roger Sayle & Noel O’Boyle Nextmove software, cambridge, uk InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 2. “this house believes…” • The most important distinction in life science informatics is between molecular and non-molecular (bio)chemistry, not between chemistry and biology. • Fuzzy distinctions such as “small molecules”, lipids, proteins, nucleic acids, peptides, oligosaccharides, or terpenes are like asking how many colors are there in a rainbow? (c.f. The Sapir-Whorf hypothesis). • Schemes that encode these distinctions (such as HELM and ISO 11238 even RasMol) break down when (poorly defined) categories overlap. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 3. Peptide or not? cyclo[OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val] valinomycin InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 4. Saccharide or not? D-Glucopyranose D-gluco-hexopyranose (2S)-2-methyloxane (2S)-2-methyl-tetrahydropyran InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 5. Saccharide or not? D-Glucopyranose D-gluco-hexopyranose D-Quinovopyranose 6-deoxy-Glucopyranose 6-deoxy-D-gluco-hexopyranose D-Paratopyranose 3,6-dideoxy-Glucopyranose 3,6-dideoxy-D-ribo-hexopyranose D-Amicetopyranose 2,3,6-trideoxy-Glucopyranose 2,3,6-trideoxy-D-erythro-hexopyranose (2S)-2-methyloxane (2S)-2-methyl-tetrahydropyran
  • 6. The cutting edge of biosimilarity • The high prevalence of potentially life-threatening hypersensitivity reactions to the antibody cetuximab (Erbitux) in some US states has been traced to its glycosylation [containing a Gal(a1-3)Gal epitope]. Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for galactose-alpha-1,3-galactose”, New England Journal of Medicine, Vol. 358, No. 11, pp. 1109-1117, 13th March 2008. • Similarly, Human Erythropoietin (EPO) alpha, beta, delta and omega share the same primary sequence, but differ in their glycosylation patterns. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 7. Destructive suggestion… • Systems based upon monomer dictionaries (such as HELM and PDB) are notoriously difficult to maintain. • The limited number of monomers in proteinogenic peptides and natural nucleic acid sequences leads to a false sense of security; that monomers are finite. • In practice, the number of monomers, post-translational and chemical modifications is infinite. • Even more difficult than standardizing monomer definitions via a central repository, like PDB, is allowing local custom definitions. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 8. 48 hexopyranoses InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 10. 9540 substituted hexopyranoses (4 most common substituents)
  • 11. Constructive suggestion… • Ideally, a chemical identifier should be independent of the input representation or file format. • Duplicates between small molecules, peptide and proteins are best determined by a single identifier, preferably the existing InChI. • This is possible as increases in computer power and storage mean that cheminformatics toolkits can handle huge biopolymers on modern hardware. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 12. Proof-of-concept • I’ve previously reported on Tanimoto chemical search of PDB (80K) represented as canonical SMILES (1Gb). • To test for duplicates and InChI key hash collisions, we attempted to generate InChI keys for uniprot. • OpenBabel source tree already contains patches to InChI library to increase the official 1024 atom limit. • A few additional source changes also helped. • Ultimately, InChI keys could be generated for ~99.4% of the ~450K unique sequences in swissprot division. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 13. Record breaking inchi-key • Sequence Identifier: UTP10_KLULA • Sequence Length: 1774 amino acids • Molecule size: 28509 atoms • InChI Length: 119699 characters • InChI key: PHBRSEQMAKHFGD-ZBXWIJJNSA-N • InChI Canonicalization Time: 73.2s • Canonical SMILES Length: 35408 chars • SMILES Canonicalization Time: 0.4s InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 14. protein Canonicalization time InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 15. protein Canonicalization time InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 16. protein Canonicalization time InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 17. conclusions • “InChI for large molecules” simply requires fixing the bugs in standard InChI. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 18. acknowledgements • Lisa Sach-Peltason, Hoffmann-La Roche, Basel. • Joann Prescott-Roy, Novartis, Boston, MA. • Greg Landrum, Novatis, Basel, Switzerland. • Evan Bolton, NCBI PubChem project, Bethesda, MD. InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
  • 19. PDB IUPAC NAME L-Cys(1)-L-Tyr-L-Ile-L-Gln-L-Asp-L-Cys(1)-L-Pro-L-Leu-Gly-NH2 IUPAC Condensed [C@H]1(CCCN1C(=O)[C@@H]1CSSC[C@@H](C(=O)N[C@ @H](Cc2ccc(cc2)O)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[ C@@H](CCC(=O)N)C(=O)N[C@@H](CC(=O)O)C(=O)N1)N) C(=O)N[C@@H](CC(C)C)C(=O)NCC(=O)N SMILES DEPICTIONS Sugar & SPLICE L-cysteinyl-L-tyrosyl-L-isoleucyl-L-glutaminyl-L-alpha-aspartyl-L-cysteinyl- L-prolyl-L-leucyl-glycinamide (1->6)-disulfide common NAME [5-L-aspartic acid]oxytocin OH PLN H-C(1)YIQDC(1)PLG-[NH2] PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$ helm Competing interests statement
  • 20. Peptide names imply architecture • Named peptides imply not only sequence but also N-terminal acetylation, C-terminal amidation and disulfide bridge topology. • Example named derivatives: – gastrin (14-17) – motilin amide – oxytocin free-acid – acetyl-oxytocin – deacetyl-abarelix – oxytocin reduced – endothelin-1 (1→3),(11 → 15)-bis(disulfide) InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014