Protein Sequence, Structure, and Functional Databases: UniProtKB, Swiss-Prot, TrEMBL, PIR, MIPS, PROSITE, PRINTS, BLOCKS, Pfam, NDRB, OWL, PDB, SCOP, CATH, NDB, PQS, SYSTERS, and Motif. Presented at UGC Sponsored National Workshop on Bioinformatics and Sequence Analysis conducted by Nesamony Memorial Christian College, Marthandam on 9th and 10th October, 2017 by Prof. T. Ashok Kumar
2. Computational Terms & Definitions
Protein Sequence – 20 AA characters [A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y] in sequence
Protein Structure – 3D of atomic co-ordinates [x-axis, y-axis, z-axis]
Types of Biological Databases – [Raw Database = Plain text, Object-oriented Database = Table (Records),
Relational Database = Table of tables]
3D Atom Model – [Sphere = Atom, Cylinder = Bond, Dotted Line = Bond Interaction]
Sequence Alignment – [Match = Similar Character, Mismatch = Different Character, Gap = No Substitute
Character, Word = Sub-string, Sequence = Super-string, Score = Rating, Identity = Similar in function]
Motif – Short, conserved sequence associated with a distinct function
Domain – Evolutionarily conserved sequence region that corresponds to a structurally independent 3D
unit associated with a particular functional role. It is usually much larger than a motif
Pattern – Sequence with symbol representation for a expression. Example: N{P}-[ST]{P}A(2,3).
Regular Expression – Representation format for a sequence motif, which includes positional information
for conserved and partly conserved residues. Similar to Pattern, but applies to MSA
Profile – Scoring matrix that represents a multiple sequence alignment. It contains probability or
frequency values of residues for each aligned position in the alignment including gaps
3. UniProtKB/Swiss-Prot/TrEMBL
Universal Protein Resource (UniProt) is a
comprehensive and non-redundant resource for
protein sequence and annotation data
The UniProt databases are the UniProt
Knowledgebase (UniProtKB), the UniProt
Reference Clusters (UniRef), and the UniProt
Archive (UniParc)
UniProt Metagenomic and Environmental
Sequences (UniMES) database is a repository
specifically developed for metagenomic and
environmental data
http://www.uniprot.org/
4. Background of UniProtKB
• UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI),
the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR)
• EMBL-EBI and SIB together used to produce Swiss-Prot and TrEMBL, while PIR produced
the Protein Sequence Database (PIR-PSD)
• Translated EMBL Nucleotide Sequence Data Library (TrEMBL) was originally created
because sequence data was being generated at a pace that exceeded Swiss-Prot's ability
to keep up
• PIR maintained the PIR-PSD and related databases, including iProClass, a database of
protein sequences and curated families
7. NBRF/PIR
The Protein Information Resource (PIR) is an integrated bioinformatics resource for
genomic, proteomic and systems biology research and scientific studies, established by
the National Biomedical Research Foundation (NBRF). PIR offers a wide variety of
resources mainly oriented to assist the propagation and standardization of protein
annotation:
PRO – Protein related ontology
iProClass – Integrated protein knowledgebase
iProLINK – Literature information and knowledgebase
iPTMnet – Integrated protein post-translational modification resource
iProXpress – Integrated protein expression analysis system
RESID Database - Comprehensive collection of annotations and structures for protein
modifications
http://pir.georgetown.edu/
8.
9.
10. MIPS
• Munich Information Center for Protein Sequences (MIPS) is a research center hosted by
Institute of Bioinformatics and Systems Biology (IBIS) and it is part of the Helmholtz
Research Center for Environmental Health, Germany
• MIPS focus on the systematic analysis of genome information including the
development and application of bioinformatics methods in genome annotation, gene
expression analysis and proteomics
• MIPS supports and maintains a set of generic databases as well as the systematic
comparative analysis of microbial, fungal, and plant genomes
• MIPS offers different Databases, Web Services, and Platforms in Genomics, Proteins,
Metabolomics and multi-omics integration, chemical screening, and Disease annotation
HOME PAGE: https://www.helmholtz-muenchen.de/ibis/
PPI: http://mips.helmholtz-muenchen.de/proj/ppi/
11.
12.
13.
14. PROSITE
• PROSITE, a protein domain database for functional characterization and annotation.
• PROSITE consists of entries describing the protein families, domains and functional sites as
well as amino acid patterns and profiles in them.
• PROSITE is manually curated by a team of the Swiss Institute of Bioinformatics and tightly
integrated into Swiss-Prot protein annotation.
• PROSITE is complemented by ProRule, a collection of rules based on profiles and patterns.
• The rules contain information about biologically meaningful residues, like active sites,
substrate- or co-factor-binding sites, posttranslational modification sites or disulfide bonds,
to help function determination.
http://prosite.expasy.org/
18. PRINTS
• PRINTS database is a collection of protein motif fingerprints
• Fingerprint is a group of conserved motifs used to characterize a protein family
• Motifs do not overlap, but are separated along a sequence, though they may be
contiguous in 3D-space to define molecular binding sites or interaction surfaces
• Fingerprints can encode protein folds and functionalities more flexibly and powerfully
than can single motifs
• PRINTS provides detailed annotation resource for protein families, and a diagnostic
tool for newly determined sequences
• PRINTS is a founding partner of the integrated resource, InterPro, a widely used
database of protein families, domains and functional sites
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/
http://130.88.97.239/PRINTS/
19.
20.
21.
22.
23. BLOCKS
• BLOCKS Database is based on InterPro entries with sequences from Swiss-Prot and
TrEMBL
• Blocks are multiple aligned ungapped segments corresponding to the most highly
conserved regions of proteins
• BLOCKS cross-references to PROSITE and/or PRINTS and/or SMART, and/or Pfam
and/or ProDom entries.
• BLOCKS Database was constructed by the PROTOMAT system using the MOTIF
algorithm
http://blocks.fhcrc.org/
24.
25.
26.
27. Pfam
• The Pfam database is a large collection of protein families, each represented by multiple
sequence alignments and hidden Markov models (HMMs).
• Pfam version 31.0 was produced at the EBI using a sequence database called Pfamseq,
which is based on UniProtKB.
• Pfam 31.0 has 16,712 families
• The descriptions of Pfam families are managed by the general public using Wikipedia.
• The Pfam database contains information about protein domains and families.
• Pfam-A is the manually curated portion of the database
• Pfam-B contains a large number of small families derived from clusters produced by an
algorithm called ADDA (for automatic generation).
• Pfam-B families can be useful when no Pfam-A families are found (but lower quality).
http://pfam.xfam.org/
28. Classification of Pfam Entries
• Family - A collection of related protein regions
• Domain - A structural unit
• Repeat - A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
• Motifs - A short unit found outside globular domains
• Coiled-Coil - Regions that predominantly contain coiled-coil motifs, regions that typically
contain alpha-helices that are coiled together in bundles of 2-7.
• Disordered - Regions that are conserved, yet are either shown or predicted to contain
bias sequence composition and/or are intrinsically disordered (non-globular).
• Clans - A collection of families that have arisen from a single evolutionary origin
• Related Pfam entries are grouped together into clans; the relationship may be defined
by similarity of sequence, structure or profile-HMM.
29.
30.
31.
32. NRDB/NRDB90
• NRDB (Non-Redundant DataBase) is a so-called non-redundant composite of the following
sources: PDB, RefSeq, UniProtKB/Swiss-Prot, DDBJ, EMBL, GenBank, and PIR
• NRDB is similar in content to OWL, but contains non-redundant and more up-to-date
information
• NRDB is not non-redundant, but non-identical - i.e., only identical sequence copies are
removed from the database
• NRDB algorithm was written by Warren Gish at Washington University to construct database
called NRDB90
• NRDB contains sequences which do not have homologues with sequence identity of 90% or
more
• NRDB is currently maintained by NCBI
http://www.ebi.ac.uk/~holm/nrdb90/ [MOVED]
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
33.
34.
35. OWL
• OWL is a non-redundant composite of 4 publicly-available primary sources: Swiss-
Prot, PIR, GenBank (translation) and NRL-3D
• Swiss-Prot is the highest priority source, all others being compared against it to
eliminate identical and trivially-different sequences
• The strict redundancy criteria render OWL relatively “small” and hence efficient in
similarity searches
http://www.bioinf.man.ac.uk/dbbrowser/OWL
http://130.88.97.239/OWL/
36.
37.
38.
39. PDB
• The Protein Data Bank (PDB) archive is the single worldwide repository of information
about the 3D structures of large biological molecules, including proteins and nucleic
acids.
• The PDB was established in 1971 at Brookhaven National Laboratory (BNL) under the
leadership of Walter Hamilton and originally contained 7 structures.
• In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) became
responsible for the management of the PDB.
• In 2003, the wwPDB was formed to maintain a single PDB archive of macromolecular
structural data that is freely and publicly available to the global community.
• The RCSB PDB supports a website where visitors can perform simple and complex queries
on the data, analyze, and visualize the results.
• Members of wwPDB are: RCSB PDB (USA), PDBe (Europe) and PDBj (Japan), and
Biological Magnetic Resonance Data Bank BMRB (USA).
http://rcsb.org/pdb/
40.
41.
42. SCOP2
• The SCOP (Structural Classification of Proteins) database is a large manual classification of protein
structural domains based on similarities of their structures and amino acid sequences.
• A motivation for this classification is to determine the evolutionary relationship between proteins.
• Proteins with the same shapes but having little sequence or functional similarity are placed in
different “superfamilies”, and are assumed to have only a very distant common ancestor.
• Proteins having the same shape and some similarity of sequence and/or function are placed in
“families”, and are assumed to have a closer common ancestor.
• SCOP has been discontinued and the last official version of SCOP is 1.75. SCOP1.75 is also known as
SCOP2.
• SCOP2 offers two different ways for accessing data: SCOP2-browser, and SCOP2-graph.
• SCOP2-browser allows navigation in a traditional way by browsing pages displaying the node
information.
• SCOP2-graph is a graph-based web tool for display and navigation.
• The source of protein structures is the Protein Data Bank.
http://scop2.mrc-lmb.cam.ac.uk/
43. Classification of SCOP Entries
• The unit of classification of structure in SCOP is the protein domain.
• The levels of SCOP are as follows.
1. Class: Types of folds, e.g., all α, all β, α/β, α+β, α&β, etc.
2. Fold: The different shapes of domains within a class, e.g., 2 helices; antiparallel hairpin,
left-handed twist, etc.
3. Superfamily: The domains in a fold are grouped into superfamilies, which have at least
distant common ancestor.
4. Family: The domains in a superfamily are grouped into families, which have recent
common ancestor.
5. Protein domain: The domains in families are grouped into protein domains, which are
essentially the same protein.
6. Species: The domains in “protein domains” are grouped according to species.
7. Domain: It is part of a protein. For simple proteins, it can be the entire protein.
47. CATH
• The CATH (Class, Architecture, Topology, and Homologous superfamily) is a semi-
automatic, hierarchical classification of protein domains.
• CATH shares many broad features with its principal rival, SCOP.
• The four main levels of the CATH hierarchy are as follows:
1. Class: the overall secondary-structure content of the domain. e.g., all α, all β, α/β,
α+β, α&β, etc.
2. Architecture: high structural similarity but no evidence of homology. Equivalent to
a fold in SCOP.
3. Topology: a large-scale grouping of topologies which share particular structural
features
4. Homologous superfamily: indicative of a demonstrable evolutionary relationship.
Equivalent to the superfamily level of SCOP.
http://www.cathdb.info/
48.
49.
50.
51.
52. NDB
Nucleic Acid Database (NDB) is a repository of 3D nucleic acid structures and their complexes
Structures available in the NDB include RNA and DNA oligonucleotides with two or more bases
either alone or complexed with proteins or small molecule ligands
NDB contains both primary and derived information about the structures
• Primary information include X-ray crystallography or NMR coordinate data
• Derived information include valence geometry, torsion angles and intermolecular contacts
data
NDB offers varieties of online and offline tools for analyzing nucleic acid structures. The featured
tools include
• RNA 3D Motif Atlas, a representative collection of RNA 3D internal and hairpin loop motifs
• Non-redundant Lists of RNA-containing 3D structures
• RNA Base Triple Atlas, a collection of motifs consisting of two RNA basepairs
• WebFR3D, a webserver for symbolic and geometric searching of RNA 3D structures
• R3D Align, an application for detailed nucleotide to nucleotide alignments of RNA 3D
structures
http://ndbserver.rutgers.edu
53.
54.
55. PQS/PDBePISA/PISA
PISA (Proteins, Interfaces, Structures, and Assemblies), formerly known as PQS (Protein
Quaternary Structure) database, was constructed by EMBL-EBI
PISA is an interactive tool for the exploration of macromolecular interfaces
PISA presents results calculated by certain physico-chemical models for PDB and/or uploaded
macromolecular structures
PISA provides probable quaternary structures (assemblies), their structural and chemical
properties and probable dissociation pattern
http://www.ebi.ac.uk/pdbe/pisa/
56.
57.
58.
59.
60. SYSTERS
SYSTERS (SYSTEmatic Re-Searching) is a collection of graph-based algorithms to hierarchically
partition a large set of protein sequences into homologous families and super-families
SYSTERS are based on an all-against-all database search (using Smith-Waterman comparisons
on a GeneMatcher machine)
The resulting set of protein families contains four different types of clusters based on the
connectivity within their family distance graph with decreasing reliability:
Perfect Clusters (P): all sequences are connected to all other sequences in the cluster
Single Sequence Cluster (S): a special case of perfect cluster
Nested Clusters (N): at least one sequence is connected to all other sequences in the cluster
Overlapping Clusters (O): no sequence is connected to all other sequences in the cluster
http://systers.molgen.mpg.de/ [DISCONTINUED]
61.
62.
63.
64. Motif
• Motif is a search service provided by GenomeNet to search with a protein query sequence
against Motif Libraries
• Supports several motif databases such as Prosite, BLOCKS, ProDom, Pfam, and PRINTS
• Allows you to search protein sequence libraries with your patterns
• Each residue must be separated with - (minus sign)
• x represents any amino acids
• [DE] means either D or E
• {FWY} means any amino acids except for F, W and Y
• A(2,3) means that A appears 2 to 3 times consecutively
• The pattern string must be terminated with . (period)
For example, C-x-{C}-[DN]-x(2)-C-x(5)-C-C.
• Generates a profile from a set of multiple aligned sequences using PFMake or HMMBuild
http://www.genome.jp/tools/motif/