COMSATS UNIVERSITY Islamabad
SAHIWAL Campus, Punjab, Pakistan
• Sections of Database
• Importance of GenBank
• Protein databases have become a crucial part of modern biology. Huge
amounts of data for protein structures, functions, and particularly
sequences are being generated.
• These data cannot be handled without using computer databases.
Searching databases is often the first step in the study of a new protein.
• Without the prior knowledge obtained from such searches, known
information about the protein could be missed, or an experiment could be
• Comparison between proteins and protein classification provide
information about the relationship between proteins within a genome or
across different species, and hence offer much more information than can
be obtained by studying only an isolated protein.
• Thanks to the Human Genome Project and other sequencing efforts, new
sequences have been generated at a prodigious rate.
• These sequences provide a rich information source and are the core of the
revolutionary movement toward “large-scale biology.”
• The protein sequences can be computationally annotated from these genomic
sequences. Various databases contain protein sequences with different focuses.
• Most protein databases have interactive search engines so that users can specify
their needs and obtain the related information interactively.
• Many protein databases also allow submitters to deposit data, and database
servers can check the format of the data and provide immediate feedback.
Protein Sequence Databases
• Protein bioinformatics databases can be primarily classified as sequence
databases, 2D gel databases, 3D structure databases, chemistry
databases, enzyme and pathway databases, family and domain databases,
gene expression databases, genome annotation databases, organism
specific databases, phylogenomic databases, polymorphism and mutation
database , protein-protein interaction databases, proteomic databases,
PTM databases, ontologies, specialized protein databases, and other
Protein Sequence Databases
• Among all protein sequence databases, UniProt is the most widely used
one. It provides more annotations than any other sequence database with
a minimal level of redundancy through human input or integration with
other databases. UniProtKB has three components:
1. Protein knowledgebase, including Swiss-Prot (manually annotated and
reviewed) and TrEMBL (automatically annotated).
2. UniRef (sequence clusters for fast sequence similarity searches).
3. UniParc (sequence archive for keeping track of sequences and their
• In addition to Swiss-Prot and TrEMBL, UniProtKB includes information from Protein
Sequence Database (PSD) in the Protein Identification Resource, which builds a
complete and non-redundant database from a number of protein and nucleic acid
sequence databases together with bibliographic and annotated information.
• The National Center for Biotechnology Information (NCBI;
http://www.ncbi.nlm.nih.gov) also provides rich information and a number of
useful tools for protein sequences.
• It includes entries from the non-redundant GenBank translations, UniProt, PIR,
Protein Research Foundation (PRF) in Japan, and the Protein Data Bank (PDB).
• UniProt, as a curated protein sequence database, offers a portal to
a wide range of annotations, covering areas such as function,
family, domain parsing, post-translational modifications, and
variants. UniProt can be accessed at http://www.uniprot.org.
• Human vitronectin is used here as an example for searching protein
sequence databases. To locate the UniProt entry for this protein,
one can search either the entry name (VTNC_HUMAN) or the
accession number (P04004) obtained from a BLAST search.
• Each entry contains the following items shown in table format in the NiceProt View layout:
1. Name and origin
2. Protein attributes
3. General annotation
4. Ontologies (gene functions)
5. Binary protein-protein interactions
6. Sequence annotation (features)
8. References (literature citation)
9. Web resources
10. Cross-references (links to other databases)
11. Entry information, and
12. Relevant documents.
• The National Center for Biotechnology Information Reference Sequence (NCBI
RefSeq) database provides curated non-redundant sequences of genomic regions,
transcripts and proteins for taxonomically diverse organisms including Archaea,
Bacteria, Eukaryotes, and Viruses.
• RefSeq database is derived from the sequence data available in the redundant
archival database GenBank. RefSeq sequences include coding regions, conserved
domains, variations etc. and enhanced annotations such as publications, names,
symbols, aliases, Gene IDs, and database cross-references.
• The sequences and annotations are generated using a combined approach of
collaboration, automated prediction, and manual curation.
• The RefSeq records can be directly accessed from NCBI web sites by
search of the Nucleotide or Protein databases, BLAST searches
against selected databases and FTP downloads.
• RefSeq records are also available through indirect links from other
NCBI resources such as Gene, Genome, BioProject, dbSNP, ClinVar
and Map Viewer etc.
• In addition, RefSeq supports programmatic access through Entrez
PROTEIN STRUCTURAL DATABASES
• Searching structure databases is becoming more and more popular in molecular
• The three-dimensional structures of proteins not only define their biological
functions, but also hold a key in rational drug design.
• Traditionally, protein structures were solved at a low throughput mode.
• However, advances in new technologies, such as synchrotron radiation sources and
high-resolution nuclear magnetic resonance (NMR), accelerate the rate of protein
structure determination substantially.
• The only international repository for the processing and distribution of protein
structures is the PDB.
• The worldwide PDB (wwPDB, http://www.wwpdb.org) was established in 2003
as an international collaboration to maintain a single and publicly available
Protein Data Bank Archive (PDB Archive) of macro-molecular structural data.
• The wwPDB member includes Protein Data Bank in Europe (PDBe), Protein
Data Bank Japan (PDBj), Research Collaboratory for Structural Bioinformatics
Protein Data Bank (RCSB PDB), and Biological Magnetic Resonance Bank
• The “PDB Archive” is a collection of flat files in three different formats: the
legacy PDB format; the PDBx/mmCIF (http:// deposit.pdb.org/mmcif/) format;
and the Protein Data Bank Markup Language (PDBML) format.
• Each member site serves as a deposition, data processing and distribution site
for the PDB Archive, and each provides its own view of the primary data and a
variety of tools and resources.
Protein Family Databases
• Proteins can be classified according to their sequence, evolutionary,
structural, or functional relationships.
• A protein in the context of its family is much more informative than the
single protein itself.
• For example, residues conserved across the family often indicate special
• Two proteins classified in the same functional family may suggest that
they share similar structures, even when their sequences do not have
• There is no unique way to classify proteins into families. Boundaries between
different families may be subjective.
• The choice of classification system depends in part on the problem; in general, the
author suggests looking into classification systems from different databases and
• Three types of classification methods are widely adopted based upon the similarity
of sequence, structure, or function.
• Sequence-based methods are applicable to any proteins whose sequences are
known, while structure-based methods are limited to the proteins of known
structures, and function-based methods depend on the functions of proteins being
• Sequence- and structure-based classifications can be automated and are scalable
to high-throughput data, whereas function-based classification is typically carried
• Structure- and function-based methods are more reliable, while sequence-based
methods may result in a false positive result when sequence similarity is weak (i.e.,
two proteins are classified into one family by chance rather than by any biological
• In addition, since protein structure and function are better conserved than
sequence, two proteins having similar structures or similar functions may not be
identified through sequence-based methods
• InterPro is an integrated resource of predictive models or ‘signatures’ representing
protein domains, families, regions, repeats and sites from major protein signature
databases including CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom ,
PROSITE, SMART, SUPERFAMILY and TIGRFAMs.
• Each entry in the InterPro database is annotated with a descriptive abstract name and
cross references to the original data sources, as well as to specialized functional
• The search by sequence or domain architecture is provided by InterPro web site. The
InterPro signatures in XML format are available via anonymous FTP download.
• InterPro also provides a software package InterProScan that can be used locally to scan
protein sequences against InterPro’s signatures.
• Pfam is a database of protein families represented as multiple sequence alignments and
Hidden Markov Models (HMMs).
• Pfam entries can be classified as Family (related protein regions), Domain (protein
structural unit), Repeat (multiple short protein structural units), Motifs (short protein
structural unit outside global domains).
• Related Pfam entries are grouped into clans based on sequence, structure or profile-
• The Pfam database web site provides search interface for querying by sequence,
keyword, domain architecture, taxonomy, and browse interfaces for analyzing protein
sequences for Pfam matches and viewing Pfam annotations in domain architectures,
sequence alignments, interactions, species and protein structures in PDB.
• The PIRSF classification system provides comprehensive and non overlapping clustering
of UniProtKB sequences into a hierarchical order to reflect their evolutionary
relationships based on whole proteins rather than on the component domains.
• The PIRSF system classifies the protein sequences into families, whose members are
both homologous (evolved from a common ancestor) and homeomorphic (sharing full-
length sequence similarity and a common domain architecture).
• The PIRSF family classification results are expert-curated based on literature review and
integrative sequence and functional analysis.
• The current release of PIRSF contains 11,800 families, which cover 5,407,000 UniProtKB
• PROSITE is a database of documentation entries describing protein domains, families
and functional sites as well as associated patterns and profiles to identify them.
• The entries are derived from multiple alignments of homologous sequences and have
the advantage of identifying distant relationships between sequences.
• PROSITE includes a collection of ProRules based on profiles and patterns of functionally
and/or structurally critical amino acids that can be used to increase PROSITE’s
• The PROSITE web site provides keyword-based search and allows browsing by
documentation entry, ProRule description, taxonomic scope and number of positive
• The PRoteomics IDEntifications database (PRIDE) is a repository for mass-
spectrometry based proteomics data including identifications of proteins, peptides
and post-translational modifications that have been described in the scientific
literature, together with supporting mass spectra and related technical and
• PRIDE supports tandem MS (MS/MS) and Peptide Fingerprinting datasets with
search/analysis workflows originally analyzed by the submitters.
• PIRDE provides several services such as the Protein Identifier Cross-Reference
(PICR), the Ontology Lookup Service (OLS) and Database on Demand.
MEROPS (Metalloprotease) Database
• MEROPS is an integrated database of information about peptidases (also termed proteases,
proteinases and proteolytic enzymes) and the proteins that inhibit them.
• A homologous set of peptidases and protein inhibitors are grouped into peptidase and
• Protein species are grouped into family that contains statistically significant similarities in
amino acid sequence. Families are grouped into clans that contain related structures.
• Both family (sub-family) and clan can be browsed by index page with links to their summary
page. Each peptidase has a summary page that can be browsed by Name, Identifier, Gene
name, Organism and Substrates.
• The peptidase summary page includes information of Gene Structure, Alignment, Tree,
Sequences and their features, Distribution, Structure, Literature, Human EST, Mouse EST,
Substrates, Inhibitors and Pharma.
Importance of Databases
• Protein databases have become a crucial part of modern biology. Huge amounts of data for
protein structures, functions, and particularly sequences are being generated.
• Searching databases is often the first step in the study of a new protein. Comparison
between proteins or between protein families provides information about the relationship
between proteins within a genome or across different species, and hence offers much more
information than can be obtained by studying only an isolated protein.
• In addition, secondary databases derived from experimental databases are also
widely available. These databases reorganize and annotate the data or provide predictions.
The use of multiple databases often helps researchers understand the structure and function
of a protein.
• Although some protein databases are widely known, they are far from being fully utilized in
the protein science community.
Parece que tem um bloqueador de anúncios ativo. Ao listar o SlideShare no seu bloqueador de anúncios, está a apoiar a nossa comunidade de criadores de conteúdo.
Atualizámos a nossa política de privacidade.
Atualizámos a nossa política de privacidade de modo a estarmos em conformidade com os regulamentos de privacidade em constante mutação a nível mundial e para lhe fornecer uma visão sobre as formas limitadas de utilização dos seus dados.
Pode ler os detalhes abaixo. Ao aceitar, está a concordar com a política de privacidade atualizada.