Formation of low mass protostars and their circumstellar disks
InChI for Large Molecules
1. Inchi for large molecules:
The nextmove software perspective
Roger Sayle & Noel O’Boyle
Nextmove software, cambridge, uk
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
2. “this house believes…”
• The most important distinction in life science
informatics is between molecular and non-molecular
(bio)chemistry, not between chemistry and biology.
• Fuzzy distinctions such as “small molecules”, lipids,
proteins, nucleic acids, peptides, oligosaccharides, or
terpenes are like asking how many colors are there in
a rainbow? (c.f. The Sapir-Whorf hypothesis).
• Schemes that encode these distinctions (such as
HELM and ISO 11238 even RasMol) break down
when (poorly defined) categories overlap.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
3. Peptide or not?
cyclo[OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val]
valinomycin
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
4. Saccharide or not?
D-Glucopyranose
D-gluco-hexopyranose
(2S)-2-methyloxane
(2S)-2-methyl-tetrahydropyran
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
6. The cutting edge of biosimilarity
• The high prevalence of potentially life-threatening
hypersensitivity reactions to the antibody cetuximab
(Erbitux) in some US states has been traced to its
glycosylation [containing a Gal(a1-3)Gal epitope].
Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for
galactose-alpha-1,3-galactose”, New England Journal of Medicine,
Vol. 358, No. 11, pp. 1109-1117, 13th March 2008.
• Similarly, Human Erythropoietin (EPO) alpha, beta,
delta and omega share the same primary sequence,
but differ in their glycosylation patterns.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
7. Destructive suggestion…
• Systems based upon monomer dictionaries (such as
HELM and PDB) are notoriously difficult to maintain.
• The limited number of monomers in proteinogenic
peptides and natural nucleic acid sequences leads to
a false sense of security; that monomers are finite.
• In practice, the number of monomers, post-translational
and chemical modifications is infinite.
• Even more difficult than standardizing monomer
definitions via a central repository, like PDB, is
allowing local custom definitions.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
8. 48 hexopyranoses
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
11. Constructive suggestion…
• Ideally, a chemical identifier should be independent
of the input representation or file format.
• Duplicates between small molecules, peptide and
proteins are best determined by a single identifier,
preferably the existing InChI.
• This is possible as increases in computer power and
storage mean that cheminformatics toolkits can
handle huge biopolymers on modern hardware.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
12. Proof-of-concept
• I’ve previously reported on Tanimoto chemical search
of PDB (80K) represented as canonical SMILES (1Gb).
• To test for duplicates and InChI key hash collisions,
we attempted to generate InChI keys for uniprot.
• OpenBabel source tree already contains patches to
InChI library to increase the official 1024 atom limit.
• A few additional source changes also helped.
• Ultimately, InChI keys could be generated for ~99.4%
of the ~450K unique sequences in swissprot division.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
17. conclusions
• “InChI for large molecules” simply requires
fixing the bugs in standard InChI.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014
18. acknowledgements
• Lisa Sach-Peltason, Hoffmann-La Roche, Basel.
• Joann Prescott-Roy, Novartis, Boston, MA.
• Greg Landrum, Novatis, Basel, Switzerland.
• Evan Bolton, NCBI PubChem project, Bethesda, MD.
InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014