SlideShare uma empresa Scribd logo
1 de 33
Types of Biological data, Biological
databases: Nucleic acid and Protein
sequences and Protein structure databases
Presented By :
Syeda Tamanna Yasmin
Doctoral Research Scholar
Department of Microbiology
INTRODUCTION
Data : A collection of facts from which conclusions may be drawn
Biological Data: Relating to, caused by, or affecting life or living organisms
TYPES OF BIOLOGICAL DATA
BIOLOGICAL DATABASES
■ Database: A collection of ,structured ,searchable, updated
periodically data
■ Biological databases : libraries of life sciences information,
collected from scientific experiments, published literature, high-
throughput experiment technology, and computational analysis.
■ The data stored in biological databases consists of two types:
o Raw and
o Curated (or annotated)
■ Type and Content of Data
o Sequence or Structure
o Nucleic acid or protein
■ The databases can be classified into three categories on the basis of the information
stored. They are
Primary Databases: It contains data that is derived experimentally.
■ They can be further divided into protein or nucleotide databases which can be further
divided as sequence or structure databases.
■ The most commonly used primary databases are:
o DNA Data Bank of Japan (DDBJ),
o European Molecular Biology Laboratory (EMBL)
o Nucleotide Sequence Database,
o GenBank, and
o Protein Data Bank (PDB)
o SWISS-PROT
o Protein information Resource (PIR)
Secondary Databases: It contains the data that is obtained through the
analysis or treatment of data present in primary databases.
■ It can contain conserved protein sequence, signature sequence active site
residues of protein families.
■ These databases can be further classified as
o metabolic pathways database,
o protein family database, etc.
■ The most common examples are :
o Class Architecture Topology Homology (CATH),
o Kyoto Encyclopedia of Genes and Genomics (KEGG),
o Protein Families (Pfam) and
o Structural Classification of Proteins (SCOP).
Composite Databases: Composite databases are collections of several
(usually more than two) primary database resources.
■ This helps in the lessening the tedious task of searching through multiple
databases referring to the same data.
■ For example
o DrugBank offers details on drug and their targets,
o BioGraph incorporates assorted knowledge of biomedical science
o Bio Model is a storehouse of computational models of the biological
developments, etc.
o NCBI being a composite database has stored a lot of sequence of
nucleotide and protein within its server and thereby suffers from high
redundancy in the data deposited (IASRI, (N.D.).
Biological Databases
Nucleotide
databases
Protein
databases
Structure Sequence
Genbank
EMBL
DDBJ
PROSITE
PFAM
SwissProt
TrEMBL
PIR
PDB
SCOP
CATH
CSD
Primary Nucleotide databases:
GenBank
■ The GenBank sequence database is open access, annotated collection of all publicly
available nucleotide sequences and their protein translations.
■ This database is produced and maintained by the National Center for Biotechnology Information (NCBI)
as part of the International Nucleotide Sequence Database Collaboration (INSDC).
■ The database started in 1982 by Walter Goad and Los Alamos National Laboratory.
EMBL (European Molecular Biology Laboratory)
■ The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is a
comprehensive collection of primary nucleotide sequences maintained at the European Bioinformatics
Institute (EBI).
■ Data are received from genome sequencing centres, individual scientists and patent offices.
■ EMBL was created in 1974 and is an intergovernmental organization funded by public research money
from its member states. It was the idea of Leó Szilárd, James Watson and John Kendrew.
DDBJ (DNA databank of Japan)
■ It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is the only
nucleotide sequence data bank in Asia.
■ DDBJ began data bank activities in 1986 at NIG and funded by the Japanese Ministry of Education, Culture,
Sports, Science and Technology.
Secondary Nucleotide databases
Omniome Database:
■ Omniome Database is a comprehensive microbial resource maintained by TIGR (The
Institute for Genomic Research).
■ It facilitates the meaningful multi-genome searches and analysis, for instance,
alignment of entire genomes, and comparison of the physical proper of proteins and
genes from different genomes etc.
FlyBase Database:
■ A consortium sequenced the entire genome of the fruit fly D. Melanogaster to a high
degree of completeness and quality.
■ FlyBase is one of the organizations contributing to the Generic Model Organism
Database (GMOD).
Primary databases of protein
Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD):
• The PIR-PSD is a collaborative endeavor between the PIR, the MIPS (Munich Information Centre for Protein Sequences, Germany)
and the JIPID (Japan International Protein Information Database, Japan).
• A unique characteristic of the PIR-PSD is its classification of protein sequences based on the superfamily concept and also classified
based on homology domain and sequence motifs.
Protein Databank (PDB):
• It is a crystallographic database for the three-dimensional structure of large biological molecules, such as proteins.
• The PDB was established in 1971 at Brookhaven National Laboratory under the leadership of Walter Hamilton and originally
contained 7 structures. After Hamilton's untimely death, Tom Koetzle began to lead the PDB in 1973, and then Joel Sussman in 1994.
• The database holds data derived from mainly three sources: Structure determined by X-ray crystallography, NMR experiments, and
molecular modeling.
SWISS-PROT
• UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase .
• It is a high quality annotated and non-redundant protein sequence database, Since 2002, it is maintained by the UniProt
consortium and is accessible via the UniProt website.
• The data in each entry can be considered separately as core data and annotation.
TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is released as a supplement to SWISS-PROT.
■ It contains the translation of all coding sequences present in the EMBL Nucleotide database, which have not been fully annotated.
The secondary databases of protein
PROSITE:
• A set of databases collects together patterns found in protein sequences rather than the complete
sequences.
• PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since
July 2018, the director of PROSITE and Swiss-Prot is Alan Bridge.
• The protein motif and pattern are encoded as “regular expressions”.
PRINTS:
• In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’.
• The information contained in the PRINT entry may be divided into three sections.
o the first section contains cross-links to other databases that have more information about the
characterized family.
o The second section provides a table showing how many of the motifs that make up the fingerprint
occurs in the how many of the sequences in that family.
o The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets
of sequences.
MHCPep:
• MHCPep is a database comprising over 13000 peptide sequences known to bind the
Major Histocompatibility Complex of the immune system.
• It was established in 1994.
Pfam
• Pfam contains the profiles used using Hidden Markov models.
• Pfam consists of the four elements.
o The first is the annotation, which has the information on the source to make the entry, the method used and
some numbers that serve as figures of merit.
o The second is the seed alignment that is used to bootstrap the rest of the sequences .
o The third is the HMM profile.
o The fourth element is the complete alignment of all the sequences identified in that family.
• The most recent version, Pfam 33.1, was released in May 2020 and contains 18,259
families.
The Cambridge Structural Database (CSD)
■ It was originally a project of the University of Cambridge, which is set up to collect together the
published three-dimensional structure of small organic molecules.
■ All these crystal structures have been obtained using X-ray or neuron diffraction technique.
■ For each entry in the CSD there are three distinct types of information stored. These are categorized
as bibliographic information, chemical connectivity information and the three- dimensional
coordinates.
The Structural Classification of Proteins database (SCOP)
■ It is a largely manual classification of protein structural domains based on
similarities of their structures and amino acid sequences.
■ SCOP was created in 1994 in the Centre for Protein Engineering and the Laboratory of Molecular
Biology. It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein
Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in
Cambridge,England.
Example of some structural databases
CATH
■ The CATH Protein Structure Classification database is a free, publicly available online resource that
provides information on the evolutionary relationships of proteindomains.
■ It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet
Thornton and David Jones.
■ The domains are then classified within the CATH structural hierarchy:
o at the Class (C) level,
o the Architecture (A)level,
o at the Topology/fold (T) level
o At the Homologous superfamily (H) level.
The CluSTr (Cluster of SWISS-PROT and TrEMBL proteins) : This database offers an automatic
classification of the entries in the SWISS-PROT and TrEMBL databases into groups of related proteins.
The clustering is based on the analysis of all pair wise comparisons between protein sequences.
The ProDom protein domain : This database is a compilation of homologous domains that have been
automatically identified sequence comparison and clustering methods using the program PSI-BLAST. The
focus is here to look for complete and self-contained structural domains and the search methods includes
signals for such features.
Retrieval Databases
Data Retrieval : data retrieval is the process of identifying and extracting data from a database, based on a
query provided by the user or application.
■ The three systems dier in the databases they search and the links they have to other information:
Sequence Retrieval System (SRS) is a homogeneous interface to over 80 biological databases that had
been developed at the European Bioinformatics Institute (EBI) at Hinxton, . It includes databases of
sequences, metabolic pathways, transcription factors, application results (like BLAST, SSEARCH, FASTA),
protein 3-D structures, genomes, mappings, mutations, and locus specic mutations.
Entrez is a molecular biology database and retrieval system. Developed by the National Center for
Biotechnology information (NCBI) . It is entry point for exploring distinct but integrated databases.
DBGET is an integrated database retrieval system, for handling the web of molecular biology databases,
which is used as a backbone system in GenomeNet and KEGG developed at the university of Tokyo.
Provided access to 20 databases, one at a time.
BLAST and FASTA
■ BLAST (basic local alignment search tool)
A BLAST search enables a researcher to compare a subject protein or
nucleotide sequence with a library or database of sequences, and identify
library sequences that resemble the query sequence above a certain
threshold.
■ FASTA format
FASTA is a DNA and protein sequence alignment software package first
described by David J. Lipman and William R. Pearson in 1985 is a text-
based format for representing either nucleotide sequences or amino acid
(protein) sequences, in which nucleotides or amino acids are represented
using single-letter codes. The format also allows for sequence names and
comments to precede the sequences.
A sequence in FASTA format consists of:
• One line starting with a ">" sign, followed by a
sequence identification code.
A file in FASTA format may comprise more than one sequence.
• The FASTA format is sometimes also referred to as the "Pearson"
format (after the author of the FASTA program and ditto format).
• https://www.toppr.com/guides/maths/statistics/frequency-distribution/
• https://www.enago.com/academy/biological-databases-an-overview-and-future-
perspectives/
• Biotechnology – expanding horizons by B.D. Singh, Kalyani publishers,
Reprinted ,2016. Pages 736-743
• A textbook of Bioinformatics by Sharma, Munjal, Shankar , Rastogi
publications, pages 153- 160
• https://www.slideshare.net/vidhyakalaivani29/major-databases-in-bioinformatics-
71778405
REFERENCES
Biological databases

Mais conteúdo relacionado

Mais procurados

databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
nadeem akhter
 

Mais procurados (20)

Scop database
Scop databaseScop database
Scop database
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Prosite
PrositeProsite
Prosite
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Protein database
Protein databaseProtein database
Protein database
 
UniProt
UniProtUniProt
UniProt
 
EMBL-EBI
EMBL-EBIEMBL-EBI
EMBL-EBI
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Protein database
Protein  databaseProtein  database
Protein database
 
SWISS-PROT
SWISS-PROTSWISS-PROT
SWISS-PROT
 
Finding ORF
Finding ORFFinding ORF
Finding ORF
 
Fasta
FastaFasta
Fasta
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Cath
CathCath
Cath
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 
Swiss PROT
Swiss PROT Swiss PROT
Swiss PROT
 

Semelhante a Biological databases

Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
SBituila
 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
PagudalaSangeetha
 
Protein databases
Protein databasesProtein databases
Protein databases
sarumalay
 

Semelhante a Biological databases (20)

Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Biological databases
Biological databases Biological databases
Biological databases
 
Biological databases.pptx
Biological databases.pptxBiological databases.pptx
Biological databases.pptx
 
Nucleic acid and protein databanks
Nucleic acid and protein databanksNucleic acid and protein databanks
Nucleic acid and protein databanks
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Presentation on Biological database  By Elufer Akram @ University Of Science ...Presentation on Biological database  By Elufer Akram @ University Of Science ...
Presentation on Biological database By Elufer Akram @ University Of Science ...
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Primary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptxPrimary Bioinformatics Database.pptx
Primary Bioinformatics Database.pptx
 
Protein Database
Protein DatabaseProtein Database
Protein Database
 
Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information Biological Databases | Access to sequence data and related information
Biological Databases | Access to sequence data and related information
 
Structural database and their classification by abdul qahar
Structural database and their classification by abdul qaharStructural database and their classification by abdul qahar
Structural database and their classification by abdul qahar
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Databases
DatabasesDatabases
Databases
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databases
 
protein databases.ppt
protein databases.pptprotein databases.ppt
protein databases.ppt
 
Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu Bioinformatics in biotechnology by kk sahu
Bioinformatics in biotechnology by kk sahu
 

Mais de Tamanna Syeda

Women Health.pptx
Women Health.pptxWomen Health.pptx
Women Health.pptx
Tamanna Syeda
 

Mais de Tamanna Syeda (17)

SIMPLE PRESENTATION ON DIGESTIVE SYSTEM AND OTHER BODY PARTS
SIMPLE PRESENTATION ON DIGESTIVE SYSTEM AND OTHER BODY PARTSSIMPLE PRESENTATION ON DIGESTIVE SYSTEM AND OTHER BODY PARTS
SIMPLE PRESENTATION ON DIGESTIVE SYSTEM AND OTHER BODY PARTS
 
Savitribai Phule
Savitribai PhuleSavitribai Phule
Savitribai Phule
 
BASIC MICROBIOLOGY LAB EQUIPMENTS
BASIC MICROBIOLOGY LAB EQUIPMENTSBASIC MICROBIOLOGY LAB EQUIPMENTS
BASIC MICROBIOLOGY LAB EQUIPMENTS
 
Women Health.pptx
Women Health.pptxWomen Health.pptx
Women Health.pptx
 
Road transportation of manipur
Road transportation of manipurRoad transportation of manipur
Road transportation of manipur
 
Online education
Online educationOnline education
Online education
 
Being positive- a personal tale
Being positive- a personal taleBeing positive- a personal tale
Being positive- a personal tale
 
short note on Covid-19
short note on Covid-19short note on Covid-19
short note on Covid-19
 
Biostatistics
Biostatistics Biostatistics
Biostatistics
 
Water borne microorganisms
Water borne microorganismsWater borne microorganisms
Water borne microorganisms
 
Oligonucleotide ligation assay
Oligonucleotide ligation assayOligonucleotide ligation assay
Oligonucleotide ligation assay
 
Henrietta Lack
Henrietta   LackHenrietta   Lack
Henrietta Lack
 
ELISA
ELISAELISA
ELISA
 
Rolling circle model
Rolling circle modelRolling circle model
Rolling circle model
 
The five kingdom system
The five kingdom systemThe five kingdom system
The five kingdom system
 
Presentation1
Presentation1Presentation1
Presentation1
 
Mitotic chromatin
Mitotic chromatinMitotic chromatin
Mitotic chromatin
 

Último

POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 

Último (20)

POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 

Biological databases

  • 1. Types of Biological data, Biological databases: Nucleic acid and Protein sequences and Protein structure databases Presented By : Syeda Tamanna Yasmin Doctoral Research Scholar Department of Microbiology
  • 2. INTRODUCTION Data : A collection of facts from which conclusions may be drawn Biological Data: Relating to, caused by, or affecting life or living organisms TYPES OF BIOLOGICAL DATA
  • 3.
  • 4. BIOLOGICAL DATABASES ■ Database: A collection of ,structured ,searchable, updated periodically data ■ Biological databases : libraries of life sciences information, collected from scientific experiments, published literature, high- throughput experiment technology, and computational analysis. ■ The data stored in biological databases consists of two types: o Raw and o Curated (or annotated) ■ Type and Content of Data o Sequence or Structure o Nucleic acid or protein
  • 5. ■ The databases can be classified into three categories on the basis of the information stored. They are Primary Databases: It contains data that is derived experimentally. ■ They can be further divided into protein or nucleotide databases which can be further divided as sequence or structure databases. ■ The most commonly used primary databases are: o DNA Data Bank of Japan (DDBJ), o European Molecular Biology Laboratory (EMBL) o Nucleotide Sequence Database, o GenBank, and o Protein Data Bank (PDB) o SWISS-PROT o Protein information Resource (PIR)
  • 6. Secondary Databases: It contains the data that is obtained through the analysis or treatment of data present in primary databases. ■ It can contain conserved protein sequence, signature sequence active site residues of protein families. ■ These databases can be further classified as o metabolic pathways database, o protein family database, etc. ■ The most common examples are : o Class Architecture Topology Homology (CATH), o Kyoto Encyclopedia of Genes and Genomics (KEGG), o Protein Families (Pfam) and o Structural Classification of Proteins (SCOP).
  • 7. Composite Databases: Composite databases are collections of several (usually more than two) primary database resources. ■ This helps in the lessening the tedious task of searching through multiple databases referring to the same data. ■ For example o DrugBank offers details on drug and their targets, o BioGraph incorporates assorted knowledge of biomedical science o Bio Model is a storehouse of computational models of the biological developments, etc. o NCBI being a composite database has stored a lot of sequence of nucleotide and protein within its server and thereby suffers from high redundancy in the data deposited (IASRI, (N.D.).
  • 9. Primary Nucleotide databases: GenBank ■ The GenBank sequence database is open access, annotated collection of all publicly available nucleotide sequences and their protein translations. ■ This database is produced and maintained by the National Center for Biotechnology Information (NCBI) as part of the International Nucleotide Sequence Database Collaboration (INSDC). ■ The database started in 1982 by Walter Goad and Los Alamos National Laboratory. EMBL (European Molecular Biology Laboratory) ■ The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is a comprehensive collection of primary nucleotide sequences maintained at the European Bioinformatics Institute (EBI). ■ Data are received from genome sequencing centres, individual scientists and patent offices. ■ EMBL was created in 1974 and is an intergovernmental organization funded by public research money from its member states. It was the idea of Leó Szilárd, James Watson and John Kendrew. DDBJ (DNA databank of Japan) ■ It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is the only nucleotide sequence data bank in Asia. ■ DDBJ began data bank activities in 1986 at NIG and funded by the Japanese Ministry of Education, Culture, Sports, Science and Technology.
  • 10.
  • 11.
  • 12.
  • 13. Secondary Nucleotide databases Omniome Database: ■ Omniome Database is a comprehensive microbial resource maintained by TIGR (The Institute for Genomic Research). ■ It facilitates the meaningful multi-genome searches and analysis, for instance, alignment of entire genomes, and comparison of the physical proper of proteins and genes from different genomes etc. FlyBase Database: ■ A consortium sequenced the entire genome of the fruit fly D. Melanogaster to a high degree of completeness and quality. ■ FlyBase is one of the organizations contributing to the Generic Model Organism Database (GMOD).
  • 14.
  • 15.
  • 16. Primary databases of protein Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD): • The PIR-PSD is a collaborative endeavor between the PIR, the MIPS (Munich Information Centre for Protein Sequences, Germany) and the JIPID (Japan International Protein Information Database, Japan). • A unique characteristic of the PIR-PSD is its classification of protein sequences based on the superfamily concept and also classified based on homology domain and sequence motifs. Protein Databank (PDB): • It is a crystallographic database for the three-dimensional structure of large biological molecules, such as proteins. • The PDB was established in 1971 at Brookhaven National Laboratory under the leadership of Walter Hamilton and originally contained 7 structures. After Hamilton's untimely death, Tom Koetzle began to lead the PDB in 1973, and then Joel Sussman in 1994. • The database holds data derived from mainly three sources: Structure determined by X-ray crystallography, NMR experiments, and molecular modeling. SWISS-PROT • UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase . • It is a high quality annotated and non-redundant protein sequence database, Since 2002, it is maintained by the UniProt consortium and is accessible via the UniProt website. • The data in each entry can be considered separately as core data and annotation. TrEMBL (for Translated EMBL) is a computer-annotated protein sequence database that is released as a supplement to SWISS-PROT. ■ It contains the translation of all coding sequences present in the EMBL Nucleotide database, which have not been fully annotated.
  • 17.
  • 18.
  • 19.
  • 20. The secondary databases of protein PROSITE: • A set of databases collects together patterns found in protein sequences rather than the complete sequences. • PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since July 2018, the director of PROSITE and Swiss-Prot is Alan Bridge. • The protein motif and pattern are encoded as “regular expressions”. PRINTS: • In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’. • The information contained in the PRINT entry may be divided into three sections. o the first section contains cross-links to other databases that have more information about the characterized family. o The second section provides a table showing how many of the motifs that make up the fingerprint occurs in the how many of the sequences in that family. o The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets of sequences.
  • 21.
  • 22.
  • 23. MHCPep: • MHCPep is a database comprising over 13000 peptide sequences known to bind the Major Histocompatibility Complex of the immune system. • It was established in 1994. Pfam • Pfam contains the profiles used using Hidden Markov models. • Pfam consists of the four elements. o The first is the annotation, which has the information on the source to make the entry, the method used and some numbers that serve as figures of merit. o The second is the seed alignment that is used to bootstrap the rest of the sequences . o The third is the HMM profile. o The fourth element is the complete alignment of all the sequences identified in that family. • The most recent version, Pfam 33.1, was released in May 2020 and contains 18,259 families.
  • 24.
  • 25.
  • 26. The Cambridge Structural Database (CSD) ■ It was originally a project of the University of Cambridge, which is set up to collect together the published three-dimensional structure of small organic molecules. ■ All these crystal structures have been obtained using X-ray or neuron diffraction technique. ■ For each entry in the CSD there are three distinct types of information stored. These are categorized as bibliographic information, chemical connectivity information and the three- dimensional coordinates. The Structural Classification of Proteins database (SCOP) ■ It is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. ■ SCOP was created in 1994 in the Centre for Protein Engineering and the Laboratory of Molecular Biology. It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in Cambridge,England. Example of some structural databases
  • 27. CATH ■ The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of proteindomains. ■ It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones. ■ The domains are then classified within the CATH structural hierarchy: o at the Class (C) level, o the Architecture (A)level, o at the Topology/fold (T) level o At the Homologous superfamily (H) level. The CluSTr (Cluster of SWISS-PROT and TrEMBL proteins) : This database offers an automatic classification of the entries in the SWISS-PROT and TrEMBL databases into groups of related proteins. The clustering is based on the analysis of all pair wise comparisons between protein sequences. The ProDom protein domain : This database is a compilation of homologous domains that have been automatically identified sequence comparison and clustering methods using the program PSI-BLAST. The focus is here to look for complete and self-contained structural domains and the search methods includes signals for such features.
  • 28. Retrieval Databases Data Retrieval : data retrieval is the process of identifying and extracting data from a database, based on a query provided by the user or application. ■ The three systems dier in the databases they search and the links they have to other information: Sequence Retrieval System (SRS) is a homogeneous interface to over 80 biological databases that had been developed at the European Bioinformatics Institute (EBI) at Hinxton, . It includes databases of sequences, metabolic pathways, transcription factors, application results (like BLAST, SSEARCH, FASTA), protein 3-D structures, genomes, mappings, mutations, and locus specic mutations. Entrez is a molecular biology database and retrieval system. Developed by the National Center for Biotechnology information (NCBI) . It is entry point for exploring distinct but integrated databases. DBGET is an integrated database retrieval system, for handling the web of molecular biology databases, which is used as a backbone system in GenomeNet and KEGG developed at the university of Tokyo. Provided access to 20 databases, one at a time.
  • 29. BLAST and FASTA ■ BLAST (basic local alignment search tool) A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. ■ FASTA format FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985 is a text- based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
  • 30.
  • 31. A sequence in FASTA format consists of: • One line starting with a ">" sign, followed by a sequence identification code. A file in FASTA format may comprise more than one sequence. • The FASTA format is sometimes also referred to as the "Pearson" format (after the author of the FASTA program and ditto format).
  • 32. • https://www.toppr.com/guides/maths/statistics/frequency-distribution/ • https://www.enago.com/academy/biological-databases-an-overview-and-future- perspectives/ • Biotechnology – expanding horizons by B.D. Singh, Kalyani publishers, Reprinted ,2016. Pages 736-743 • A textbook of Bioinformatics by Sharma, Munjal, Shankar , Rastogi publications, pages 153- 160 • https://www.slideshare.net/vidhyakalaivani29/major-databases-in-bioinformatics- 71778405 REFERENCES