Peptide tribulations

Trials and tribulations of curating and
searching bioactive peptides in
databases
Christopher Southan
University of Copenhagen, Feb 2020
Host: David Gloriam
1

Abstract
The theme will be presented from the perspective of both past
involvement in peptide curation in the Guide to Pharmacology
(GtoPdb) and in current searching for bioactive peptides in the wider
ecosystem that includes ChEMBL and PubChem. The core problem
is that peptides hang in limbo land between bioinformatics (BLAST)
and cheminformatics (Tanimoto) neither of which provide optimal
searching. Curating peptides in GtoPdb presents many challenges,
including mapping endogenous peptides to Swiss-Prot cleavage
annotations. For synthetic peptides, equivocal specification of
modifications and exact positions of radiolabels are also problematic
However, target-mapped citation-supported quantitative binding
parameters are curated where possible. For those peptides falling
below the PubChem CID SMILES limit of approximately 70 residues,
GtoPdb has been using Sugar and Splice from NextMove Software to
convert into CIDs. Specific problems associated with finding
bioactive peptides in databases will be outlined.
2

Outline
• Peptide tribulations
• Intoducing GtoPdb
• GtoPdb peptide content and stats
• PubChem peptidic pros and cons
• Getting more peptides > SMILES
3

Bad news: neither GtoPdb nor ChEMBL nor PubChem
seach-index their peptides
4

Tribulations with peptides
• Dificult to define structurally
• Endogenous peptide activities can be complex many-to-many systems
• Author specifications often insuficient for complete molecular definition
• Structural equivocalties slip through the editor/referee net
• Correct IUPAC peptide nomenclature use for modifications is rare
• Exact location of radiolable often not specified
• Absence of purity verification and/or in vivo stability against proteolytic clipping
• Noisy peptide name-to-structure (n2s) mappings
• SMILES only adequate for ~ 70 residues
• Image rendering not standardised
• Searching patents for peptide prior art more difficult than small-molecules
• Literature extraction > databases proportionally lower than small molecules
• Author database submissions for bioactive peptides non existant
• Species ”zoo” for venom peptides and their names
• Conjugates (e.g. peptide + linker + protein) even more difficult
• The PIR RESID Database of Protein Modifications is no longer maintained
5

GtoPdb > NCBI Entrez PubMed < > PubChem
6

Introducing the IUPHAR/BPS Guide to
PHARMACOLOGY (GtoPdb)
• IUPHAR = International Union of Basic and Clinical Pharmacology, BPS = British
Pharmacological Society
• Molecular mechanism of action (mmoa) mapping primary & secondary targets
• Release cycle time (with PubChem refreshes) ~ 2 months
• Seven NAR Annual Database issues, latest as PMID: 31691834 (2020)
• Every 2 years distilled into the BritishJournal of Pharmacology “Concise Guide
to PHARMACOLOGY” as a nine-paper series (see PMID 29055037) with outlinks
• Curates selected quality compounds for pharmacology research in silico, in
vitro, in cellulo, in vivo, in clinico
• An ELIXIR UK Node resource since 2016
7

8
Expert-curated, citation provenanced,
quantitative binding data
Document > assay > result > compound > location > protein target
D- A- R - C- L- P
Where “C” is not a small molecule, GtoP has ~ 2000 peptides included in
the ~ 9000 substances we submit to PubChem

Endogenous peptides (786)
9
http://www.guidetopharmacology.org/GRAC/LigandListForward?type=Endogenous-peptide&database=all

Non-endogenous peptides (1310)
10http://www.guidetopharmacology.org/GRAC/LigandListForward?type=Peptide&database=all

GtoPdb peptide stats (release 2019.4)
• Peptide ligands/all ligands = 22%.
• Ligands with quantitative binding data/all ligs = 75%
• Peptides with quantitative binding data/all peps = 63%
• CID quantitative binding data peptides/all CID peps = 89%
11

Endothelin-1 in GtoPdb (before the SMILES backfill)
12

GtoPdb Entrez linkage (after 2019 back-fill
13

The peptidic triple-whammy
14
Endothelin-1, CID 91928636, 1470 ”Similar Compounds” and top-100 BLAST hits
1. Too big to search or cluster by SMILES
2. Too small to BLAST cleanly (and sans PTMs)
3. Too many species splits for precursors

Swiss-Prot precursor annotation
15
• Evidence support for endogenous processing curated from the primary literature
• PTMs are indicated but text-only
• Very low Mass-spec verification of existence in vivo
• No standardised accession identifiers
• Difficult to query across (mixed feature keys)
• No secondary bioactivity annotation (e.g. from most of PubMed)
• No cross-pointers (e.g. to PubChem or RefSeq)

Will the real Endothelin please stand up?
16
• Submissions mixed between SMILES (CIDs) and sequence strings (SIDs)
• "endothelin 1"[CompleteSynonym] > 6 CIDs > 36 SIDs (10 SID-only)
• “MW 2491.9140 NOT endothelin 1“ > 16 CIDs > 23 SIDs (some unnamed)
• BioAssay spliting is problematic

Hierarchical Editing Language for Macromolecules (HELM)
17

GtoPdb push:
Peptides > S&S > SMILES > SIDs > CIDs
18
http://www.guidetopharmacology.org/GRAC/LigandDisplayForward?ligandId=3854

The Next Move move (Noel O'Boyle)
19
https://www.nextmovesoftware.com/talks/OBoyle_PubChemBiologics_ACS_201708.pdf

NextMove Biologics 8699 SIDs > 4969 CIDs
Low bioactivity annotation (e.g. 259 in ChEMBL
from 1.9 million CIDs, 36 in GtoPdb from 7674

Acknowledgments and info
21
• Past and present GtoPdb curators working on peptide entries
• The NextMove team for Sugar &Splice support and their peptide processing in PubChem
• Lin Yikai, M.Sc. project; ”Developing bio/cheminformatics methods for converting
bioactive peptide structures into machine-readable formats”
• Anna Gaulton for ChEMBL FASTA sequences
• Paul Thiessen for PubChem for peptide CIDs

Peptide tribulations

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Peptide tribulations

Semelhante a Peptide tribulations (20)

Mais de Chris Southan

Mais de Chris Southan (20)

Último

Último (20)

Peptide tribulations