THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
Biological databases
1. Types of Biological data, Biological
databases: Nucleic acid and Protein
sequences and Protein structure databases
Presented By :
Syeda Tamanna Yasmin
Doctoral Research Scholar
Department of Microbiology
2. INTRODUCTION
Data : A collection of facts from which conclusions may be drawn
Biological Data: Relating to, caused by, or affecting life or living organisms
TYPES OF BIOLOGICAL DATA
3.
4. BIOLOGICAL DATABASES
■ Database: A collection of ,structured ,searchable, updated
periodically data
■ Biological databases : libraries of life sciences information,
collected from scientific experiments, published literature, high-
throughput experiment technology, and computational analysis.
■ The data stored in biological databases consists of two types:
o Raw and
o Curated (or annotated)
■ Type and Content of Data
o Sequence or Structure
o Nucleic acid or protein
5. ■ The databases can be classified into three categories on the basis of the information
stored. They are
Primary Databases: It contains data that is derived experimentally.
■ They can be further divided into protein or nucleotide databases which can be further
divided as sequence or structure databases.
■ The most commonly used primary databases are:
o DNA Data Bank of Japan (DDBJ),
o European Molecular Biology Laboratory (EMBL)
o Nucleotide Sequence Database,
o GenBank, and
o Protein Data Bank (PDB)
o SWISS-PROT
o Protein information Resource (PIR)
6. Secondary Databases: It contains the data that is obtained through the
analysis or treatment of data present in primary databases.
■ It can contain conserved protein sequence, signature sequence active site
residues of protein families.
■ These databases can be further classified as
o metabolic pathways database,
o protein family database, etc.
■ The most common examples are :
o Class Architecture Topology Homology (CATH),
o Kyoto Encyclopedia of Genes and Genomics (KEGG),
o Protein Families (Pfam) and
o Structural Classification of Proteins (SCOP).
7. Composite Databases: Composite databases are collections of several
(usually more than two) primary database resources.
■ This helps in the lessening the tedious task of searching through multiple
databases referring to the same data.
■ For example
o DrugBank offers details on drug and their targets,
o BioGraph incorporates assorted knowledge of biomedical science
o Bio Model is a storehouse of computational models of the biological
developments, etc.
o NCBI being a composite database has stored a lot of sequence of
nucleotide and protein within its server and thereby suffers from high
redundancy in the data deposited (IASRI, (N.D.).
9. Primary Nucleotide databases:
GenBank
■ The GenBank sequence database is open access, annotated collection of all publicly
available nucleotide sequences and their protein translations.
■ This database is produced and maintained by the National Center for Biotechnology Information (NCBI)
as part of the International Nucleotide Sequence Database Collaboration (INSDC).
■ The database started in 1982 by Walter Goad and Los Alamos National Laboratory.
EMBL (European Molecular Biology Laboratory)
■ The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is a
comprehensive collection of primary nucleotide sequences maintained at the European Bioinformatics
Institute (EBI).
■ Data are received from genome sequencing centres, individual scientists and patent offices.
■ EMBL was created in 1974 and is an intergovernmental organization funded by public research money
from its member states. It was the idea of Leó Szilárd, James Watson and John Kendrew.
DDBJ (DNA databank of Japan)
■ It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is the only
nucleotide sequence data bank in Asia.
■ DDBJ began data bank activities in 1986 at NIG and funded by the Japanese Ministry of Education, Culture,
Sports, Science and Technology.
10.
11.
12.
13. Secondary Nucleotide databases
Omniome Database:
■ Omniome Database is a comprehensive microbial resource maintained by TIGR (The
Institute for Genomic Research).
■ It facilitates the meaningful multi-genome searches and analysis, for instance,
alignment of entire genomes, and comparison of the physical proper of proteins and
genes from different genomes etc.
FlyBase Database:
■ A consortium sequenced the entire genome of the fruit fly D. Melanogaster to a high
degree of completeness and quality.
■ FlyBase is one of the organizations contributing to the Generic Model Organism
Database (GMOD).
14.
15.
16. Primary databases of protein
Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):
• The PIR-PSD is a collaborative endeavor between the PIR, the MIPS (Munich Information Centre for Protein Sequences, Germany)
and the JIPID (Japan International Protein Information Database, Japan).
• A unique characteristic of the PIR-PSD is its classification of protein sequences based on the superfamily concept and also classified
based on homology domain and sequence motifs.
Protein Databank (PDB):
• It is a crystallographic database for the three-dimensional structure of large biological molecules, such as proteins.
• The PDB was established in 1971 at Brookhaven National Laboratory under the leadership of Walter Hamilton and originally
contained 7 structures. After Hamilton's untimely death, Tom Koetzle began to lead the PDB in 1973, and then Joel Sussman in 1994.
• The database holds data derived from mainly three sources: Structure determined by X-ray crystallography, NMR experiments, and
molecular modeling.
SWISS-PROT
• UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase .
• It is a high quality annotated and non-redundant protein sequence database, Since 2002, it is maintained by the UniProt
consortium and is accessible via the UniProt website.
• The data in each entry can be considered separately as core data and annotation.
TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is released as a supplement to SWISS-PROT.
■ It contains the translation of all coding sequences present in the EMBL Nucleotide database, which have not been fully annotated.
17.
18.
19.
20. The secondary databases of protein
PROSITE:
• A set of databases collects together patterns found in protein sequences rather than the complete
sequences.
• PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since
July 2018, the director of PROSITE and Swiss-Prot is Alan Bridge.
• The protein motif and pattern are encoded as “regular expressions”.
PRINTS:
• In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’.
• The information contained in the PRINT entry may be divided into three sections.
o the first section contains cross-links to other databases that have more information about the
characterized family.
o The second section provides a table showing how many of the motifs that make up the fingerprint
occurs in the how many of the sequences in that family.
o The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets
of sequences.
21.
22.
23. MHCPep:
• MHCPep is a database comprising over 13000 peptide sequences known to bind the
Major Histocompatibility Complex of the immune system.
• It was established in 1994.
Pfam
• Pfam contains the profiles used using Hidden Markov models.
• Pfam consists of the four elements.
o The first is the annotation, which has the information on the source to make the entry, the method used and
some numbers that serve as figures of merit.
o The second is the seed alignment that is used to bootstrap the rest of the sequences .
o The third is the HMM profile.
o The fourth element is the complete alignment of all the sequences identified in that family.
• The most recent version, Pfam 33.1, was released in May 2020 and contains 18,259
families.
24.
25.
26. The Cambridge Structural Database (CSD)
■ It was originally a project of the University of Cambridge, which is set up to collect together the
published three-dimensional structure of small organic molecules.
■ All these crystal structures have been obtained using X-ray or neuron diffraction technique.
■ For each entry in the CSD there are three distinct types of information stored. These are categorized
as bibliographic information, chemical connectivity information and the three- dimensional
coordinates.
The Structural Classification of Proteins database (SCOP)
■ It is a largely manual classification of protein structural domains based on
similarities of their structures and amino acid sequences.
■ SCOP was created in 1994 in the Centre for Protein Engineering and the Laboratory of Molecular
Biology. It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein
Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in
Cambridge,England.
Example of some structural databases
27. CATH
■ The CATH Protein Structure Classification database is a free, publicly available online resource that
provides information on the evolutionary relationships of proteindomains.
■ It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet
Thornton and David Jones.
■ The domains are then classified within the CATH structural hierarchy:
o at the Class (C) level,
o the Architecture (A)level,
o at the Topology/fold (T) level
o At the Homologous superfamily (H) level.
The CluSTr (Cluster of SWISS-PROT and TrEMBL proteins) : This database offers an automatic
classification of the entries in the SWISS-PROT and TrEMBL databases into groups of related proteins.
The clustering is based on the analysis of all pair wise comparisons between protein sequences.
The ProDom protein domain : This database is a compilation of homologous domains that have been
automatically identified sequence comparison and clustering methods using the program PSI-BLAST. The
focus is here to look for complete and self-contained structural domains and the search methods includes
signals for such features.
28. Retrieval Databases
Data Retrieval : data retrieval is the process of identifying and extracting data from a database, based on a
query provided by the user or application.
■ The three systems dier in the databases they search and the links they have to other information:
Sequence Retrieval System (SRS) is a homogeneous interface to over 80 biological databases that had
been developed at the European Bioinformatics Institute (EBI) at Hinxton, . It includes databases of
sequences, metabolic pathways, transcription factors, application results (like BLAST, SSEARCH, FASTA),
protein 3-D structures, genomes, mappings, mutations, and locus specic mutations.
Entrez is a molecular biology database and retrieval system. Developed by the National Center for
Biotechnology information (NCBI) . It is entry point for exploring distinct but integrated databases.
DBGET is an integrated database retrieval system, for handling the web of molecular biology databases,
which is used as a backbone system in GenomeNet and KEGG developed at the university of Tokyo.
Provided access to 20 databases, one at a time.
29. BLAST and FASTA
■ BLAST (basic local alignment search tool)
A BLAST search enables a researcher to compare a subject protein or
nucleotide sequence with a library or database of sequences, and identify
library sequences that resemble the query sequence above a certain
threshold.
■ FASTA format
FASTA is a DNA and protein sequence alignment software package first
described by David J. Lipman and William R. Pearson in 1985 is a text-
based format for representing either nucleotide sequences or amino acid
(protein) sequences, in which nucleotides or amino acids are represented
using single-letter codes. The format also allows for sequence names and
comments to precede the sequences.
30.
31. A sequence in FASTA format consists of:
• One line starting with a ">" sign, followed by a
sequence identification code.
A file in FASTA format may comprise more than one sequence.
• The FASTA format is sometimes also referred to as the "Pearson"
format (after the author of the FASTA program and ditto format).