Presented to David Gloriam's Group, Copenhagen, Feb 2020
**********************************
The theme will be presented from the perspective of both past involvement in peptide curation in the Guide to Pharmacology (GtoPdb) and in current searching for bioactive peptides in the wider ecosystem that includes ChEMBL and PubChem. The core problem is that peptides hang in limbo land between bioinformatics (BLAST) and cheminformatics (Tanimoto) neither of which provide optimal searching. Curating peptides in GtoPdb presents many challenges, including mapping endogenous peptides to Swiss-Prot cleavage annotations. For synthetic peptides, equivocal specification of modifications and exact positions of radiolabels are also problematic However, target-mapped citation-supported quantitative binding parameters are curated where possible. For those peptides falling below the PubChem CID SMILES limit of approximately 70 residues, GtoPdb has been using Sugar and Splice from NextMove Software to convert into CIDs. Specific problems associated with finding bioactive peptides in databases will be outlined.
1. Trials and tribulations of curating and
searching bioactive peptides in
databases
Christopher Southan
University of Copenhagen, Feb 2020
Host: David Gloriam
1
2. Abstract
The theme will be presented from the perspective of both past
involvement in peptide curation in the Guide to Pharmacology
(GtoPdb) and in current searching for bioactive peptides in the wider
ecosystem that includes ChEMBL and PubChem. The core problem
is that peptides hang in limbo land between bioinformatics (BLAST)
and cheminformatics (Tanimoto) neither of which provide optimal
searching. Curating peptides in GtoPdb presents many challenges,
including mapping endogenous peptides to Swiss-Prot cleavage
annotations. For synthetic peptides, equivocal specification of
modifications and exact positions of radiolabels are also problematic
However, target-mapped citation-supported quantitative binding
parameters are curated where possible. For those peptides falling
below the PubChem CID SMILES limit of approximately 70 residues,
GtoPdb has been using Sugar and Splice from NextMove Software to
convert into CIDs. Specific problems associated with finding
bioactive peptides in databases will be outlined.
2
3. Outline
• Peptide tribulations
• Intoducing GtoPdb
• GtoPdb peptide content and stats
• PubChem peptidic pros and cons
• Getting more peptides > SMILES
3
4. Bad news: neither GtoPdb nor ChEMBL nor PubChem
seach-index their peptides
4
5. Tribulations with peptides
• Dificult to define structurally
• Endogenous peptide activities can be complex many-to-many systems
• Author specifications often insuficient for complete molecular definition
• Structural equivocalties slip through the editor/referee net
• Correct IUPAC peptide nomenclature use for modifications is rare
• Exact location of radiolable often not specified
• Absence of purity verification and/or in vivo stability against proteolytic clipping
• Noisy peptide name-to-structure (n2s) mappings
• SMILES only adequate for ~ 70 residues
• Image rendering not standardised
• Searching patents for peptide prior art more difficult than small-molecules
• Literature extraction > databases proportionally lower than small molecules
• Author database submissions for bioactive peptides non existant
• Species ”zoo” for venom peptides and their names
• Conjugates (e.g. peptide + linker + protein) even more difficult
• The PIR RESID Database of Protein Modifications is no longer maintained
5
7. Introducing the IUPHAR/BPS Guide to
PHARMACOLOGY (GtoPdb)
• IUPHAR = International Union of Basic and Clinical Pharmacology, BPS = British
Pharmacological Society
• Molecular mechanism of action (mmoa) mapping primary & secondary targets
• Release cycle time (with PubChem refreshes) ~ 2 months
• Seven NAR Annual Database issues, latest as PMID: 31691834 (2020)
• Every 2 years distilled into the BritishJournal of Pharmacology “Concise Guide
to PHARMACOLOGY” as a nine-paper series (see PMID 29055037) with outlinks
• Curates selected quality compounds for pharmacology research in silico, in
vitro, in cellulo, in vivo, in clinico
• An ELIXIR UK Node resource since 2016
7
8. 8
Expert-curated, citation provenanced,
quantitative binding data
Document > assay > result > compound > location > protein target
D- A- R - C- L- P
Where “C” is not a small molecule, GtoP has ~ 2000 peptides included in
the ~ 9000 substances we submit to PubChem
14. The peptidic triple-whammy
14
Endothelin-1, CID 91928636, 1470 ”Similar Compounds” and top-100 BLAST hits
1. Too big to search or cluster by SMILES
2. Too small to BLAST cleanly (and sans PTMs)
3. Too many species splits for precursors
15. Swiss-Prot precursor annotation
15
• Evidence support for endogenous processing curated from the primary literature
• PTMs are indicated but text-only
• Very low Mass-spec verification of existence in vivo
• No standardised accession identifiers
• Difficult to query across (mixed feature keys)
• No secondary bioactivity annotation (e.g. from most of PubMed)
• No cross-pointers (e.g. to PubChem or RefSeq)
16. Will the real Endothelin please stand up?
16
• Submissions mixed between SMILES (CIDs) and sequence strings (SIDs)
• "endothelin 1"[CompleteSynonym] > 6 CIDs > 36 SIDs (10 SID-only)
• “MW 2491.9140 NOT endothelin 1“ > 16 CIDs > 23 SIDs (some unnamed)
• BioAssay spliting is problematic
19. The Next Move move (Noel O'Boyle)
19
https://www.nextmovesoftware.com/talks/OBoyle_PubChemBiologics_ACS_201708.pdf
20. NextMove Biologics 8699 SIDs > 4969 CIDs
Low bioactivity annotation (e.g. 259 in ChEMBL
from 1.9 million CIDs, 36 in GtoPdb from 7674
21. Acknowledgments and info
21
• Past and present GtoPdb curators working on peptide entries
• The NextMove team for Sugar &Splice support and their peptide processing in PubChem
• Lin Yikai, M.Sc. project; ”Developing bio/cheminformatics methods for converting
bioactive peptide structures into machine-readable formats”
• Anna Gaulton for ChEMBL FASTA sequences
• Paul Thiessen for PubChem for peptide CIDs