SlideShare uma empresa Scribd logo
1 de 67
LECTURE TOPIC: PROTEIN DATABASES
TOPICS COVERED: UniProtKB/Swiss-Prot/TrEMBL, PIR,
MIPS, PROSITE, PRINTS, BLOCKS,
Pfam, NDRB, OWL, PDB, SCOP,
CATH, NDB, PQS, SYSTERS, Motif
LECTURE BY: Ashok Kumar T
ashok
@biogem.org
Computational Terms & Definitions
 Protein Sequence – 20 AA characters [A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y] in sequence
 Protein Structure – 3D of atomic co-ordinates [x-axis, y-axis, z-axis]
 Types of Biological Databases – [Raw Database = Plain text, Object-oriented Database = Table (Records),
Relational Database = Table of tables]
 3D Atom Model – [Sphere = Atom, Cylinder = Bond, Dotted Line = Bond Interaction]
 Sequence Alignment – [Match = Similar Character, Mismatch = Different Character, Gap = No Substitute
Character, Word = Sub-string, Sequence = Super-string, Score = Rating, Identity = Similar in function]
 Motif – Short, conserved sequence associated with a distinct function
 Domain – Evolutionarily conserved sequence region that corresponds to a structurally independent 3D
unit associated with a particular functional role. It is usually much larger than a motif
 Pattern – Sequence with symbol representation for a expression. Example: N{P}-[ST]{P}A(2,3).
 Regular Expression – Representation format for a sequence motif, which includes positional information
for conserved and partly conserved residues. Similar to Pattern, but applies to MSA
 Profile – Scoring matrix that represents a multiple sequence alignment. It contains probability or
frequency values of residues for each aligned position in the alignment including gaps
UniProtKB/Swiss-Prot/TrEMBL
 Universal Protein Resource (UniProt) is a
comprehensive and non-redundant resource for
protein sequence and annotation data
 The UniProt databases are the UniProt
Knowledgebase (UniProtKB), the UniProt
Reference Clusters (UniRef), and the UniProt
Archive (UniParc)
 UniProt Metagenomic and Environmental
Sequences (UniMES) database is a repository
specifically developed for metagenomic and
environmental data
http://www.uniprot.org/
Background of UniProtKB
• UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI),
the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR)
• EMBL-EBI and SIB together used to produce Swiss-Prot and TrEMBL, while PIR produced
the Protein Sequence Database (PIR-PSD)
• Translated EMBL Nucleotide Sequence Data Library (TrEMBL) was originally created
because sequence data was being generated at a pace that exceeded Swiss-Prot's ability
to keep up
• PIR maintained the PIR-PSD and related databases, including iProClass, a database of
protein sequences and curated families
UniProtKB Search Result
NBRF/PIR
The Protein Information Resource (PIR) is an integrated bioinformatics resource for
genomic, proteomic and systems biology research and scientific studies, established by
the National Biomedical Research Foundation (NBRF). PIR offers a wide variety of
resources mainly oriented to assist the propagation and standardization of protein
annotation:
 PRO – Protein related ontology
 iProClass – Integrated protein knowledgebase
 iProLINK – Literature information and knowledgebase
 iPTMnet – Integrated protein post-translational modification resource
 iProXpress – Integrated protein expression analysis system
 RESID Database - Comprehensive collection of annotations and structures for protein
modifications
http://pir.georgetown.edu/
MIPS
• Munich Information Center for Protein Sequences (MIPS) is a research center hosted by
Institute of Bioinformatics and Systems Biology (IBIS) and it is part of the Helmholtz
Research Center for Environmental Health, Germany
• MIPS focus on the systematic analysis of genome information including the
development and application of bioinformatics methods in genome annotation, gene
expression analysis and proteomics
• MIPS supports and maintains a set of generic databases as well as the systematic
comparative analysis of microbial, fungal, and plant genomes
• MIPS offers different Databases, Web Services, and Platforms in Genomics, Proteins,
Metabolomics and multi-omics integration, chemical screening, and Disease annotation
HOME PAGE: https://www.helmholtz-muenchen.de/ibis/
PPI: http://mips.helmholtz-muenchen.de/proj/ppi/
PROSITE
• PROSITE, a protein domain database for functional characterization and annotation.
• PROSITE consists of entries describing the protein families, domains and functional sites as
well as amino acid patterns and profiles in them.
• PROSITE is manually curated by a team of the Swiss Institute of Bioinformatics and tightly
integrated into Swiss-Prot protein annotation.
• PROSITE is complemented by ProRule, a collection of rules based on profiles and patterns.
• The rules contain information about biologically meaningful residues, like active sites,
substrate- or co-factor-binding sites, posttranslational modification sites or disulfide bonds,
to help function determination.
http://prosite.expasy.org/
Result of PROSITE for Matching Pattern Hits
PRINTS
• PRINTS database is a collection of protein motif fingerprints
• Fingerprint is a group of conserved motifs used to characterize a protein family
• Motifs do not overlap, but are separated along a sequence, though they may be
contiguous in 3D-space to define molecular binding sites or interaction surfaces
• Fingerprints can encode protein folds and functionalities more flexibly and powerfully
than can single motifs
• PRINTS provides detailed annotation resource for protein families, and a diagnostic
tool for newly determined sequences
• PRINTS is a founding partner of the integrated resource, InterPro, a widely used
database of protein families, domains and functional sites
http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/
http://130.88.97.239/PRINTS/
BLOCKS
• BLOCKS Database is based on InterPro entries with sequences from Swiss-Prot and
TrEMBL
• Blocks are multiple aligned ungapped segments corresponding to the most highly
conserved regions of proteins
• BLOCKS cross-references to PROSITE and/or PRINTS and/or SMART, and/or Pfam
and/or ProDom entries.
• BLOCKS Database was constructed by the PROTOMAT system using the MOTIF
algorithm
http://blocks.fhcrc.org/
Pfam
• The Pfam database is a large collection of protein families, each represented by multiple
sequence alignments and hidden Markov models (HMMs).
• Pfam version 31.0 was produced at the EBI using a sequence database called Pfamseq,
which is based on UniProtKB.
• Pfam 31.0 has 16,712 families
• The descriptions of Pfam families are managed by the general public using Wikipedia.
• The Pfam database contains information about protein domains and families.
• Pfam-A is the manually curated portion of the database
• Pfam-B contains a large number of small families derived from clusters produced by an
algorithm called ADDA (for automatic generation).
• Pfam-B families can be useful when no Pfam-A families are found (but lower quality).
http://pfam.xfam.org/
Classification of Pfam Entries
• Family - A collection of related protein regions
• Domain - A structural unit
• Repeat - A short unit which is unstable in isolation but forms a stable structure when
multiple copies are present
• Motifs - A short unit found outside globular domains
• Coiled-Coil - Regions that predominantly contain coiled-coil motifs, regions that typically
contain alpha-helices that are coiled together in bundles of 2-7.
• Disordered - Regions that are conserved, yet are either shown or predicted to contain
bias sequence composition and/or are intrinsically disordered (non-globular).
• Clans - A collection of families that have arisen from a single evolutionary origin
• Related Pfam entries are grouped together into clans; the relationship may be defined
by similarity of sequence, structure or profile-HMM.
NRDB/NRDB90
• NRDB (Non-Redundant DataBase) is a so-called non-redundant composite of the following
sources: PDB, RefSeq, UniProtKB/Swiss-Prot, DDBJ, EMBL, GenBank, and PIR
• NRDB is similar in content to OWL, but contains non-redundant and more up-to-date
information
• NRDB is not non-redundant, but non-identical - i.e., only identical sequence copies are
removed from the database
• NRDB algorithm was written by Warren Gish at Washington University to construct database
called NRDB90
• NRDB contains sequences which do not have homologues with sequence identity of 90% or
more
• NRDB is currently maintained by NCBI
http://www.ebi.ac.uk/~holm/nrdb90/ [MOVED]
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
OWL
• OWL is a non-redundant composite of 4 publicly-available primary sources: Swiss-
Prot, PIR, GenBank (translation) and NRL-3D
• Swiss-Prot is the highest priority source, all others being compared against it to
eliminate identical and trivially-different sequences
• The strict redundancy criteria render OWL relatively “small” and hence efficient in
similarity searches
http://www.bioinf.man.ac.uk/dbbrowser/OWL
http://130.88.97.239/OWL/
PDB
• The Protein Data Bank (PDB) archive is the single worldwide repository of information
about the 3D structures of large biological molecules, including proteins and nucleic
acids.
• The PDB was established in 1971 at Brookhaven National Laboratory (BNL) under the
leadership of Walter Hamilton and originally contained 7 structures.
• In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) became
responsible for the management of the PDB.
• In 2003, the wwPDB was formed to maintain a single PDB archive of macromolecular
structural data that is freely and publicly available to the global community.
• The RCSB PDB supports a website where visitors can perform simple and complex queries
on the data, analyze, and visualize the results.
• Members of wwPDB are: RCSB PDB (USA), PDBe (Europe) and PDBj (Japan), and
Biological Magnetic Resonance Data Bank BMRB (USA).
http://rcsb.org/pdb/
SCOP2
• The SCOP (Structural Classification of Proteins) database is a large manual classification of protein
structural domains based on similarities of their structures and amino acid sequences.
• A motivation for this classification is to determine the evolutionary relationship between proteins.
• Proteins with the same shapes but having little sequence or functional similarity are placed in
different “superfamilies”, and are assumed to have only a very distant common ancestor.
• Proteins having the same shape and some similarity of sequence and/or function are placed in
“families”, and are assumed to have a closer common ancestor.
• SCOP has been discontinued and the last official version of SCOP is 1.75. SCOP1.75 is also known as
SCOP2.
• SCOP2 offers two different ways for accessing data: SCOP2-browser, and SCOP2-graph.
• SCOP2-browser allows navigation in a traditional way by browsing pages displaying the node
information.
• SCOP2-graph is a graph-based web tool for display and navigation.
• The source of protein structures is the Protein Data Bank.
http://scop2.mrc-lmb.cam.ac.uk/
Classification of SCOP Entries
• The unit of classification of structure in SCOP is the protein domain.
• The levels of SCOP are as follows.
1. Class: Types of folds, e.g., all α, all β, α/β, α+β, α&β, etc.
2. Fold: The different shapes of domains within a class, e.g., 2 helices; antiparallel hairpin,
left-handed twist, etc.
3. Superfamily: The domains in a fold are grouped into superfamilies, which have at least
distant common ancestor.
4. Family: The domains in a superfamily are grouped into families, which have recent
common ancestor.
5. Protein domain: The domains in families are grouped into protein domains, which are
essentially the same protein.
6. Species: The domains in “protein domains” are grouped according to species.
7. Domain: It is part of a protein. For simple proteins, it can be the entire protein.
Hierarchical structure of SCOP
Output of SCOP
CATH
• The CATH (Class, Architecture, Topology, and Homologous superfamily) is a semi-
automatic, hierarchical classification of protein domains.
• CATH shares many broad features with its principal rival, SCOP.
• The four main levels of the CATH hierarchy are as follows:
1. Class: the overall secondary-structure content of the domain. e.g., all α, all β, α/β,
α+β, α&β, etc.
2. Architecture: high structural similarity but no evidence of homology. Equivalent to
a fold in SCOP.
3. Topology: a large-scale grouping of topologies which share particular structural
features
4. Homologous superfamily: indicative of a demonstrable evolutionary relationship.
Equivalent to the superfamily level of SCOP.
http://www.cathdb.info/
NDB
 Nucleic Acid Database (NDB) is a repository of 3D nucleic acid structures and their complexes
 Structures available in the NDB include RNA and DNA oligonucleotides with two or more bases
either alone or complexed with proteins or small molecule ligands
 NDB contains both primary and derived information about the structures
• Primary information include X-ray crystallography or NMR coordinate data
• Derived information include valence geometry, torsion angles and intermolecular contacts
data
 NDB offers varieties of online and offline tools for analyzing nucleic acid structures. The featured
tools include
• RNA 3D Motif Atlas, a representative collection of RNA 3D internal and hairpin loop motifs
• Non-redundant Lists of RNA-containing 3D structures
• RNA Base Triple Atlas, a collection of motifs consisting of two RNA basepairs
• WebFR3D, a webserver for symbolic and geometric searching of RNA 3D structures
• R3D Align, an application for detailed nucleotide to nucleotide alignments of RNA 3D
structures
http://ndbserver.rutgers.edu
PQS/PDBePISA/PISA
 PISA (Proteins, Interfaces, Structures, and Assemblies), formerly known as PQS (Protein
Quaternary Structure) database, was constructed by EMBL-EBI
 PISA is an interactive tool for the exploration of macromolecular interfaces
 PISA presents results calculated by certain physico-chemical models for PDB and/or uploaded
macromolecular structures
 PISA provides probable quaternary structures (assemblies), their structural and chemical
properties and probable dissociation pattern
http://www.ebi.ac.uk/pdbe/pisa/
SYSTERS
 SYSTERS (SYSTEmatic Re-Searching) is a collection of graph-based algorithms to hierarchically
partition a large set of protein sequences into homologous families and super-families
 SYSTERS are based on an all-against-all database search (using Smith-Waterman comparisons
on a GeneMatcher machine)
 The resulting set of protein families contains four different types of clusters based on the
connectivity within their family distance graph with decreasing reliability:
 Perfect Clusters (P): all sequences are connected to all other sequences in the cluster
 Single Sequence Cluster (S): a special case of perfect cluster
 Nested Clusters (N): at least one sequence is connected to all other sequences in the cluster
 Overlapping Clusters (O): no sequence is connected to all other sequences in the cluster
http://systers.molgen.mpg.de/ [DISCONTINUED]
Motif
• Motif is a search service provided by GenomeNet to search with a protein query sequence
against Motif Libraries
• Supports several motif databases such as Prosite, BLOCKS, ProDom, Pfam, and PRINTS
• Allows you to search protein sequence libraries with your patterns
• Each residue must be separated with - (minus sign)
• x represents any amino acids
• [DE] means either D or E
• {FWY} means any amino acids except for F, W and Y
• A(2,3) means that A appears 2 to 3 times consecutively
• The pattern string must be terminated with . (period)
For example, C-x-{C}-[DN]-x(2)-C-x(5)-C-C.
• Generates a profile from a set of multiple aligned sequences using PFMake or HMMBuild
http://www.genome.jp/tools/motif/
Protein Databases
Protein Databases
Protein Databases

Mais conteúdo relacionado

Mais procurados (20)

Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Scop database
Scop databaseScop database
Scop database
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Structural databases
Structural databases Structural databases
Structural databases
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
Ncbi
NcbiNcbi
Ncbi
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
PROTEIN DATABASE
PROTEIN DATABASEPROTEIN DATABASE
PROTEIN DATABASE
 
Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Protein database
Protein databaseProtein database
Protein database
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Protein database
Protein  databaseProtein  database
Protein database
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 

Semelhante a Protein Databases

Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological databaseKAUSHAL SAHU
 
protein databases.ppt
protein databases.pptprotein databases.ppt
protein databases.pptSanthiyaAK
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptxscience lover
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...Elufer Akram
 
Primary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxPrimary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxVandana Yadav03
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBioinformaticsCentre
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...SBituila
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...BibiQuinah
 
Data Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptData Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptBangaluru
 
Bioinformatic databases 2
Bioinformatic databases 2Bioinformatic databases 2
Bioinformatic databases 2Razzaqe
 
Bioinformatic_Databases_2.ppt
Bioinformatic_Databases_2.pptBioinformatic_Databases_2.ppt
Bioinformatic_Databases_2.pptNaglaaFathy42
 
Bioinformatic_Databases_2xcxzczxcxzxcxzc
Bioinformatic_Databases_2xcxzczxcxzxcxzcBioinformatic_Databases_2xcxzczxcxzxcxzc
Bioinformatic_Databases_2xcxzczxcxzxcxzcAdiM27
 
Bioinformatic databases 2
Bioinformatic databases 2Bioinformatic databases 2
Bioinformatic databases 2Razzaqe
 

Semelhante a Protein Databases (20)

Protein Database
Protein DatabaseProtein Database
Protein Database
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
Important protein databases and proteomics softwares
Important protein databases and proteomics softwaresImportant protein databases and proteomics softwares
Important protein databases and proteomics softwares
 
Biological databases
Biological databasesBiological databases
Biological databases
 
protein databases.ppt
protein databases.pptprotein databases.ppt
protein databases.ppt
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptx
 
Biological databases
Biological databases Biological databases
Biological databases
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
 
Primary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxPrimary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptx
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Structural database and their classification by abdul qahar
Structural database and their classification by abdul qaharStructural database and their classification by abdul qahar
Structural database and their classification by abdul qahar
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
Biological database
Biological databaseBiological database
Biological database
 
Data Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptData Base in Bioinformatics.ppt
Data Base in Bioinformatics.ppt
 
Bioinformatic databases 2
Bioinformatic databases 2Bioinformatic databases 2
Bioinformatic databases 2
 
Bioinformatic_Databases_2.ppt
Bioinformatic_Databases_2.pptBioinformatic_Databases_2.ppt
Bioinformatic_Databases_2.ppt
 
Bioinformatic_Databases_2xcxzczxcxzxcxzc
Bioinformatic_Databases_2xcxzczxcxzxcxzcBioinformatic_Databases_2xcxzczxcxzxcxzc
Bioinformatic_Databases_2xcxzczxcxzxcxzc
 
Bioinformatic databases 2
Bioinformatic databases 2Bioinformatic databases 2
Bioinformatic databases 2
 

Último

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 

Último (20)

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 

Protein Databases

  • 1. LECTURE TOPIC: PROTEIN DATABASES TOPICS COVERED: UniProtKB/Swiss-Prot/TrEMBL, PIR, MIPS, PROSITE, PRINTS, BLOCKS, Pfam, NDRB, OWL, PDB, SCOP, CATH, NDB, PQS, SYSTERS, Motif LECTURE BY: Ashok Kumar T ashok @biogem.org
  • 2. Computational Terms & Definitions  Protein Sequence – 20 AA characters [A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y] in sequence  Protein Structure – 3D of atomic co-ordinates [x-axis, y-axis, z-axis]  Types of Biological Databases – [Raw Database = Plain text, Object-oriented Database = Table (Records), Relational Database = Table of tables]  3D Atom Model – [Sphere = Atom, Cylinder = Bond, Dotted Line = Bond Interaction]  Sequence Alignment – [Match = Similar Character, Mismatch = Different Character, Gap = No Substitute Character, Word = Sub-string, Sequence = Super-string, Score = Rating, Identity = Similar in function]  Motif – Short, conserved sequence associated with a distinct function  Domain – Evolutionarily conserved sequence region that corresponds to a structurally independent 3D unit associated with a particular functional role. It is usually much larger than a motif  Pattern – Sequence with symbol representation for a expression. Example: N{P}-[ST]{P}A(2,3).  Regular Expression – Representation format for a sequence motif, which includes positional information for conserved and partly conserved residues. Similar to Pattern, but applies to MSA  Profile – Scoring matrix that represents a multiple sequence alignment. It contains probability or frequency values of residues for each aligned position in the alignment including gaps
  • 3. UniProtKB/Swiss-Prot/TrEMBL  Universal Protein Resource (UniProt) is a comprehensive and non-redundant resource for protein sequence and annotation data  The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc)  UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data http://www.uniprot.org/
  • 4. Background of UniProtKB • UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) • EMBL-EBI and SIB together used to produce Swiss-Prot and TrEMBL, while PIR produced the Protein Sequence Database (PIR-PSD) • Translated EMBL Nucleotide Sequence Data Library (TrEMBL) was originally created because sequence data was being generated at a pace that exceeded Swiss-Prot's ability to keep up • PIR maintained the PIR-PSD and related databases, including iProClass, a database of protein sequences and curated families
  • 5.
  • 7. NBRF/PIR The Protein Information Resource (PIR) is an integrated bioinformatics resource for genomic, proteomic and systems biology research and scientific studies, established by the National Biomedical Research Foundation (NBRF). PIR offers a wide variety of resources mainly oriented to assist the propagation and standardization of protein annotation:  PRO – Protein related ontology  iProClass – Integrated protein knowledgebase  iProLINK – Literature information and knowledgebase  iPTMnet – Integrated protein post-translational modification resource  iProXpress – Integrated protein expression analysis system  RESID Database - Comprehensive collection of annotations and structures for protein modifications http://pir.georgetown.edu/
  • 8.
  • 9.
  • 10. MIPS • Munich Information Center for Protein Sequences (MIPS) is a research center hosted by Institute of Bioinformatics and Systems Biology (IBIS) and it is part of the Helmholtz Research Center for Environmental Health, Germany • MIPS focus on the systematic analysis of genome information including the development and application of bioinformatics methods in genome annotation, gene expression analysis and proteomics • MIPS supports and maintains a set of generic databases as well as the systematic comparative analysis of microbial, fungal, and plant genomes • MIPS offers different Databases, Web Services, and Platforms in Genomics, Proteins, Metabolomics and multi-omics integration, chemical screening, and Disease annotation HOME PAGE: https://www.helmholtz-muenchen.de/ibis/ PPI: http://mips.helmholtz-muenchen.de/proj/ppi/
  • 11.
  • 12.
  • 13.
  • 14. PROSITE • PROSITE, a protein domain database for functional characterization and annotation. • PROSITE consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles in them. • PROSITE is manually curated by a team of the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation. • PROSITE is complemented by ProRule, a collection of rules based on profiles and patterns. • The rules contain information about biologically meaningful residues, like active sites, substrate- or co-factor-binding sites, posttranslational modification sites or disulfide bonds, to help function determination. http://prosite.expasy.org/
  • 15.
  • 16.
  • 17. Result of PROSITE for Matching Pattern Hits
  • 18. PRINTS • PRINTS database is a collection of protein motif fingerprints • Fingerprint is a group of conserved motifs used to characterize a protein family • Motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space to define molecular binding sites or interaction surfaces • Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs • PRINTS provides detailed annotation resource for protein families, and a diagnostic tool for newly determined sequences • PRINTS is a founding partner of the integrated resource, InterPro, a widely used database of protein families, domains and functional sites http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/ http://130.88.97.239/PRINTS/
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. BLOCKS • BLOCKS Database is based on InterPro entries with sequences from Swiss-Prot and TrEMBL • Blocks are multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins • BLOCKS cross-references to PROSITE and/or PRINTS and/or SMART, and/or Pfam and/or ProDom entries. • BLOCKS Database was constructed by the PROTOMAT system using the MOTIF algorithm http://blocks.fhcrc.org/
  • 24.
  • 25.
  • 26.
  • 27. Pfam • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). • Pfam version 31.0 was produced at the EBI using a sequence database called Pfamseq, which is based on UniProtKB. • Pfam 31.0 has 16,712 families • The descriptions of Pfam families are managed by the general public using Wikipedia. • The Pfam database contains information about protein domains and families. • Pfam-A is the manually curated portion of the database • Pfam-B contains a large number of small families derived from clusters produced by an algorithm called ADDA (for automatic generation). • Pfam-B families can be useful when no Pfam-A families are found (but lower quality). http://pfam.xfam.org/
  • 28. Classification of Pfam Entries • Family - A collection of related protein regions • Domain - A structural unit • Repeat - A short unit which is unstable in isolation but forms a stable structure when multiple copies are present • Motifs - A short unit found outside globular domains • Coiled-Coil - Regions that predominantly contain coiled-coil motifs, regions that typically contain alpha-helices that are coiled together in bundles of 2-7. • Disordered - Regions that are conserved, yet are either shown or predicted to contain bias sequence composition and/or are intrinsically disordered (non-globular). • Clans - A collection of families that have arisen from a single evolutionary origin • Related Pfam entries are grouped together into clans; the relationship may be defined by similarity of sequence, structure or profile-HMM.
  • 29.
  • 30.
  • 31.
  • 32. NRDB/NRDB90 • NRDB (Non-Redundant DataBase) is a so-called non-redundant composite of the following sources: PDB, RefSeq, UniProtKB/Swiss-Prot, DDBJ, EMBL, GenBank, and PIR • NRDB is similar in content to OWL, but contains non-redundant and more up-to-date information • NRDB is not non-redundant, but non-identical - i.e., only identical sequence copies are removed from the database • NRDB algorithm was written by Warren Gish at Washington University to construct database called NRDB90 • NRDB contains sequences which do not have homologues with sequence identity of 90% or more • NRDB is currently maintained by NCBI http://www.ebi.ac.uk/~holm/nrdb90/ [MOVED] http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
  • 33.
  • 34.
  • 35. OWL • OWL is a non-redundant composite of 4 publicly-available primary sources: Swiss- Prot, PIR, GenBank (translation) and NRL-3D • Swiss-Prot is the highest priority source, all others being compared against it to eliminate identical and trivially-different sequences • The strict redundancy criteria render OWL relatively “small” and hence efficient in similarity searches http://www.bioinf.man.ac.uk/dbbrowser/OWL http://130.88.97.239/OWL/
  • 36.
  • 37.
  • 38.
  • 39. PDB • The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. • The PDB was established in 1971 at Brookhaven National Laboratory (BNL) under the leadership of Walter Hamilton and originally contained 7 structures. • In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) became responsible for the management of the PDB. • In 2003, the wwPDB was formed to maintain a single PDB archive of macromolecular structural data that is freely and publicly available to the global community. • The RCSB PDB supports a website where visitors can perform simple and complex queries on the data, analyze, and visualize the results. • Members of wwPDB are: RCSB PDB (USA), PDBe (Europe) and PDBj (Japan), and Biological Magnetic Resonance Data Bank BMRB (USA). http://rcsb.org/pdb/
  • 40.
  • 41.
  • 42. SCOP2 • The SCOP (Structural Classification of Proteins) database is a large manual classification of protein structural domains based on similarities of their structures and amino acid sequences. • A motivation for this classification is to determine the evolutionary relationship between proteins. • Proteins with the same shapes but having little sequence or functional similarity are placed in different “superfamilies”, and are assumed to have only a very distant common ancestor. • Proteins having the same shape and some similarity of sequence and/or function are placed in “families”, and are assumed to have a closer common ancestor. • SCOP has been discontinued and the last official version of SCOP is 1.75. SCOP1.75 is also known as SCOP2. • SCOP2 offers two different ways for accessing data: SCOP2-browser, and SCOP2-graph. • SCOP2-browser allows navigation in a traditional way by browsing pages displaying the node information. • SCOP2-graph is a graph-based web tool for display and navigation. • The source of protein structures is the Protein Data Bank. http://scop2.mrc-lmb.cam.ac.uk/
  • 43. Classification of SCOP Entries • The unit of classification of structure in SCOP is the protein domain. • The levels of SCOP are as follows. 1. Class: Types of folds, e.g., all α, all β, α/β, α+β, α&β, etc. 2. Fold: The different shapes of domains within a class, e.g., 2 helices; antiparallel hairpin, left-handed twist, etc. 3. Superfamily: The domains in a fold are grouped into superfamilies, which have at least distant common ancestor. 4. Family: The domains in a superfamily are grouped into families, which have recent common ancestor. 5. Protein domain: The domains in families are grouped into protein domains, which are essentially the same protein. 6. Species: The domains in “protein domains” are grouped according to species. 7. Domain: It is part of a protein. For simple proteins, it can be the entire protein.
  • 46.
  • 47. CATH • The CATH (Class, Architecture, Topology, and Homologous superfamily) is a semi- automatic, hierarchical classification of protein domains. • CATH shares many broad features with its principal rival, SCOP. • The four main levels of the CATH hierarchy are as follows: 1. Class: the overall secondary-structure content of the domain. e.g., all α, all β, α/β, α+β, α&β, etc. 2. Architecture: high structural similarity but no evidence of homology. Equivalent to a fold in SCOP. 3. Topology: a large-scale grouping of topologies which share particular structural features 4. Homologous superfamily: indicative of a demonstrable evolutionary relationship. Equivalent to the superfamily level of SCOP. http://www.cathdb.info/
  • 48.
  • 49.
  • 50.
  • 51.
  • 52. NDB  Nucleic Acid Database (NDB) is a repository of 3D nucleic acid structures and their complexes  Structures available in the NDB include RNA and DNA oligonucleotides with two or more bases either alone or complexed with proteins or small molecule ligands  NDB contains both primary and derived information about the structures • Primary information include X-ray crystallography or NMR coordinate data • Derived information include valence geometry, torsion angles and intermolecular contacts data  NDB offers varieties of online and offline tools for analyzing nucleic acid structures. The featured tools include • RNA 3D Motif Atlas, a representative collection of RNA 3D internal and hairpin loop motifs • Non-redundant Lists of RNA-containing 3D structures • RNA Base Triple Atlas, a collection of motifs consisting of two RNA basepairs • WebFR3D, a webserver for symbolic and geometric searching of RNA 3D structures • R3D Align, an application for detailed nucleotide to nucleotide alignments of RNA 3D structures http://ndbserver.rutgers.edu
  • 53.
  • 54.
  • 55. PQS/PDBePISA/PISA  PISA (Proteins, Interfaces, Structures, and Assemblies), formerly known as PQS (Protein Quaternary Structure) database, was constructed by EMBL-EBI  PISA is an interactive tool for the exploration of macromolecular interfaces  PISA presents results calculated by certain physico-chemical models for PDB and/or uploaded macromolecular structures  PISA provides probable quaternary structures (assemblies), their structural and chemical properties and probable dissociation pattern http://www.ebi.ac.uk/pdbe/pisa/
  • 56.
  • 57.
  • 58.
  • 59.
  • 60. SYSTERS  SYSTERS (SYSTEmatic Re-Searching) is a collection of graph-based algorithms to hierarchically partition a large set of protein sequences into homologous families and super-families  SYSTERS are based on an all-against-all database search (using Smith-Waterman comparisons on a GeneMatcher machine)  The resulting set of protein families contains four different types of clusters based on the connectivity within their family distance graph with decreasing reliability:  Perfect Clusters (P): all sequences are connected to all other sequences in the cluster  Single Sequence Cluster (S): a special case of perfect cluster  Nested Clusters (N): at least one sequence is connected to all other sequences in the cluster  Overlapping Clusters (O): no sequence is connected to all other sequences in the cluster http://systers.molgen.mpg.de/ [DISCONTINUED]
  • 61.
  • 62.
  • 63.
  • 64. Motif • Motif is a search service provided by GenomeNet to search with a protein query sequence against Motif Libraries • Supports several motif databases such as Prosite, BLOCKS, ProDom, Pfam, and PRINTS • Allows you to search protein sequence libraries with your patterns • Each residue must be separated with - (minus sign) • x represents any amino acids • [DE] means either D or E • {FWY} means any amino acids except for F, W and Y • A(2,3) means that A appears 2 to 3 times consecutively • The pattern string must be terminated with . (period) For example, C-x-{C}-[DN]-x(2)-C-x(5)-C-C. • Generates a profile from a set of multiple aligned sequences using PFMake or HMMBuild http://www.genome.jp/tools/motif/