O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Nucleic Acid Sequence databases
Nucleic Acid Sequence databases
Carregando em…3
×

Confira estes a seguir

1 de 24 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Proteins databases (20)

Anúncio

Mais de Hafiz Muhammad Zeeshan Raza (12)

Mais recentes (20)

Anúncio

Proteins databases

  1. 1. Proteins Databases Hafiz.M.Zeeshan.Raza Research Assistant_HEC_NRPU hafizraza26@gmail.com COMSATS UNIVERSITY Islamabad SAHIWAL Campus, Punjab, Pakistan
  2. 2. Overview • Introduction • Sections of Database • Importance of GenBank
  3. 3. Introduction • Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. • These data cannot be handled without using computer databases. Searching databases is often the first step in the study of a new protein. • Without the prior knowledge obtained from such searches, known information about the protein could be missed, or an experiment could be repeated unnecessarily. • Comparison between proteins and protein classification provide information about the relationship between proteins within a genome or across different species, and hence offer much more information than can be obtained by studying only an isolated protein.
  4. 4. Continue… • Thanks to the Human Genome Project and other sequencing efforts, new sequences have been generated at a prodigious rate. • These sequences provide a rich information source and are the core of the revolutionary movement toward “large-scale biology.” • The protein sequences can be computationally annotated from these genomic sequences. Various databases contain protein sequences with different focuses. • Most protein databases have interactive search engines so that users can specify their needs and obtain the related information interactively. • Many protein databases also allow submitters to deposit data, and database servers can check the format of the data and provide immediate feedback.
  5. 5. Protein Sequence Databases • Protein bioinformatics databases can be primarily classified as sequence databases, 2D gel databases, 3D structure databases, chemistry databases, enzyme and pathway databases, family and domain databases, gene expression databases, genome annotation databases, organism specific databases, phylogenomic databases, polymorphism and mutation database , protein-protein interaction databases, proteomic databases, PTM databases, ontologies, specialized protein databases, and other (miscellaneous) databases.
  6. 6. Protein Sequence Databases • Among all protein sequence databases, UniProt is the most widely used one. It provides more annotations than any other sequence database with a minimal level of redundancy through human input or integration with other databases. UniProtKB has three components: 1. Protein knowledgebase, including Swiss-Prot (manually annotated and reviewed) and TrEMBL (automatically annotated). 2. UniRef (sequence clusters for fast sequence similarity searches). 3. UniParc (sequence archive for keeping track of sequences and their identifiers).
  7. 7. Continue… • In addition to Swiss-Prot and TrEMBL, UniProtKB includes information from Protein Sequence Database (PSD) in the Protein Identification Resource, which builds a complete and non-redundant database from a number of protein and nucleic acid sequence databases together with bibliographic and annotated information. • The National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov) also provides rich information and a number of useful tools for protein sequences. • It includes entries from the non-redundant GenBank translations, UniProt, PIR, Protein Research Foundation (PRF) in Japan, and the Protein Data Bank (PDB).
  8. 8. Continue… • UniProt, as a curated protein sequence database, offers a portal to a wide range of annotations, covering areas such as function, family, domain parsing, post-translational modifications, and variants. UniProt can be accessed at http://www.uniprot.org. • Human vitronectin is used here as an example for searching protein sequence databases. To locate the UniProt entry for this protein, one can search either the entry name (VTNC_HUMAN) or the accession number (P04004) obtained from a BLAST search.
  9. 9. Continue… • Each entry contains the following items shown in table format in the NiceProt View layout: 1. Name and origin 2. Protein attributes 3. General annotation 4. Ontologies (gene functions) 5. Binary protein-protein interactions 6. Sequence annotation (features) 7. Sequence 8. References (literature citation) 9. Web resources 10. Cross-references (links to other databases) 11. Entry information, and 12. Relevant documents.
  10. 10. RefSeq Database • The National Center for Biotechnology Information Reference Sequence (NCBI RefSeq) database provides curated non-redundant sequences of genomic regions, transcripts and proteins for taxonomically diverse organisms including Archaea, Bacteria, Eukaryotes, and Viruses. • RefSeq database is derived from the sequence data available in the redundant archival database GenBank. RefSeq sequences include coding regions, conserved domains, variations etc. and enhanced annotations such as publications, names, symbols, aliases, Gene IDs, and database cross-references. • The sequences and annotations are generated using a combined approach of collaboration, automated prediction, and manual curation.
  11. 11. Continue… • The RefSeq records can be directly accessed from NCBI web sites by search of the Nucleotide or Protein databases, BLAST searches against selected databases and FTP downloads. • RefSeq records are also available through indirect links from other NCBI resources such as Gene, Genome, BioProject, dbSNP, ClinVar and Map Viewer etc. • In addition, RefSeq supports programmatic access through Entrez Programming Utilities.
  12. 12. PROTEIN STRUCTURAL DATABASES • Searching structure databases is becoming more and more popular in molecular biology. • The three-dimensional structures of proteins not only define their biological functions, but also hold a key in rational drug design. • Traditionally, protein structures were solved at a low throughput mode. • However, advances in new technologies, such as synchrotron radiation sources and high-resolution nuclear magnetic resonance (NMR), accelerate the rate of protein structure determination substantially. • The only international repository for the processing and distribution of protein structures is the PDB.
  13. 13. Continue… • The worldwide PDB (wwPDB, http://www.wwpdb.org) was established in 2003 as an international collaboration to maintain a single and publicly available Protein Data Bank Archive (PDB Archive) of macro-molecular structural data. • The wwPDB member includes Protein Data Bank in Europe (PDBe), Protein Data Bank Japan (PDBj), Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), and Biological Magnetic Resonance Bank (BMRB). • The “PDB Archive” is a collection of flat files in three different formats: the legacy PDB format; the PDBx/mmCIF (http:// deposit.pdb.org/mmcif/) format; and the Protein Data Bank Markup Language (PDBML) format. • Each member site serves as a deposition, data processing and distribution site for the PDB Archive, and each provides its own view of the primary data and a variety of tools and resources.
  14. 14. Protein Family Databases • Proteins can be classified according to their sequence, evolutionary, structural, or functional relationships. • A protein in the context of its family is much more informative than the single protein itself. • For example, residues conserved across the family often indicate special functional roles. • Two proteins classified in the same functional family may suggest that they share similar structures, even when their sequences do not have significant similarity.
  15. 15. Continue… • There is no unique way to classify proteins into families. Boundaries between different families may be subjective. • The choice of classification system depends in part on the problem; in general, the author suggests looking into classification systems from different databases and comparing them. • Three types of classification methods are widely adopted based upon the similarity of sequence, structure, or function. • Sequence-based methods are applicable to any proteins whose sequences are known, while structure-based methods are limited to the proteins of known structures, and function-based methods depend on the functions of proteins being annotated.
  16. 16. Continue… • Sequence- and structure-based classifications can be automated and are scalable to high-throughput data, whereas function-based classification is typically carried out manually. • Structure- and function-based methods are more reliable, while sequence-based methods may result in a false positive result when sequence similarity is weak (i.e., two proteins are classified into one family by chance rather than by any biological significance). • In addition, since protein structure and function are better conserved than sequence, two proteins having similar structures or similar functions may not be identified through sequence-based methods
  17. 17. InterPro Database • InterPro is an integrated resource of predictive models or ‘signatures’ representing protein domains, families, regions, repeats and sites from major protein signature databases including CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom , PROSITE, SMART, SUPERFAMILY and TIGRFAMs. • Each entry in the InterPro database is annotated with a descriptive abstract name and cross references to the original data sources, as well as to specialized functional databases. • The search by sequence or domain architecture is provided by InterPro web site. The InterPro signatures in XML format are available via anonymous FTP download. • InterPro also provides a software package InterProScan that can be used locally to scan protein sequences against InterPro’s signatures.
  18. 18. Pfam Database • Pfam is a database of protein families represented as multiple sequence alignments and Hidden Markov Models (HMMs). • Pfam entries can be classified as Family (related protein regions), Domain (protein structural unit), Repeat (multiple short protein structural units), Motifs (short protein structural unit outside global domains). • Related Pfam entries are grouped into clans based on sequence, structure or profile- HMM similarity. • The Pfam database web site provides search interface for querying by sequence, keyword, domain architecture, taxonomy, and browse interfaces for analyzing protein sequences for Pfam matches and viewing Pfam annotations in domain architectures, sequence alignments, interactions, species and protein structures in PDB.
  19. 19. PIRSF Database • The PIRSF classification system provides comprehensive and non overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships based on whole proteins rather than on the component domains. • The PIRSF system classifies the protein sequences into families, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full- length sequence similarity and a common domain architecture). • The PIRSF family classification results are expert-curated based on literature review and integrative sequence and functional analysis. • The current release of PIRSF contains 11,800 families, which cover 5,407,000 UniProtKB protein sequences.
  20. 20. PROSITE • PROSITE is a database of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them. • The entries are derived from multiple alignments of homologous sequences and have the advantage of identifying distant relationships between sequences. • PROSITE includes a collection of ProRules based on profiles and patterns of functionally and/or structurally critical amino acids that can be used to increase PROSITE’s discriminatory power. • The PROSITE web site provides keyword-based search and allows browsing by documentation entry, ProRule description, taxonomic scope and number of positive hits.
  21. 21. PRIDE Database • The PRoteomics IDEntifications database (PRIDE) is a repository for mass- spectrometry based proteomics data including identifications of proteins, peptides and post-translational modifications that have been described in the scientific literature, together with supporting mass spectra and related technical and biological metadata. • PRIDE supports tandem MS (MS/MS) and Peptide Fingerprinting datasets with search/analysis workflows originally analyzed by the submitters. • PIRDE provides several services such as the Protein Identifier Cross-Reference (PICR), the Ontology Lookup Service (OLS) and Database on Demand.
  22. 22. MEROPS (Metalloprotease) Database • MEROPS is an integrated database of information about peptidases (also termed proteases, proteinases and proteolytic enzymes) and the proteins that inhibit them. • A homologous set of peptidases and protein inhibitors are grouped into peptidase and inhibitor species. • Protein species are grouped into family that contains statistically significant similarities in amino acid sequence. Families are grouped into clans that contain related structures. • Both family (sub-family) and clan can be browsed by index page with links to their summary page. Each peptidase has a summary page that can be browsed by Name, Identifier, Gene name, Organism and Substrates. • The peptidase summary page includes information of Gene Structure, Alignment, Tree, Sequences and their features, Distribution, Structure, Literature, Human EST, Mouse EST, Substrates, Inhibitors and Pharma.
  23. 23. Importance of Databases • Protein databases have become a crucial part of modern biology. Huge amounts of data for protein structures, functions, and particularly sequences are being generated. • Searching databases is often the first step in the study of a new protein. Comparison between proteins or between protein families provides information about the relationship between proteins within a genome or across different species, and hence offers much more information than can be obtained by studying only an isolated protein. • In addition, secondary databases derived from experimental databases are also widely available. These databases reorganize and annotate the data or provide predictions. The use of multiple databases often helps researchers understand the structure and function of a protein. • Although some protein databases are widely known, they are far from being fully utilized in the protein science community.

×