Introduction to databases.pptx

Introduction to databases
and tools used in
bioinformatics
SWORNA KUMARI.C. Ph.D.,
Hexacara Lifesciences Pvt ltd

PRELUDE
 Biological sequence
 Biological databases
 Types of bioinformatic databases
 Bioinformatic tools
 Some examples of bioinformatic tools
 Bioinformatic resources on the web
 Biological databases Links
 Importance

Biological sequences
 A biological sequence is a single, continuous molecule of nucleic acid or
protein. One hierarchy is that of the underlying molecule type: DNA, RNA, or
protein.

Biological Databases
 Biological databases are libraries of life sciences information, collected from
scientific experiments, published literature etc
 Information regarding gene function, structure, localization, clinical effects of
mutations
 as well as similarities of biological sequences and structures.
 These databases consisting of biological data like protein sequencing,
molecular structure, DNA sequences, etc in an organized form.
 Several computer tools are there to manipulate the biological data like an
update, delete or insert
 from all over the world Scientists, researchers are entering their experiment
data and results in a biological database.
 free to use and contain a huge collection of a variety of biological data.

Types of Bioinformatic databases
 There are basically 3 types of biological databases are as
follows
Primary Databases
Secondary Databases
Composite Databases

Primary databases
• It can also be called an archival database since it archives the experimental
results submitted by the scientists. The primary database is populated with
experimentally derived data like genome sequence, macromolecular structure,
etc.
• The data entered here remains uncurated (no modifications are performed over
the data).unique data obtained from the laboratory and these data are made
accessible to normal users without any change.
• The primary data are resulting in accession numbers. The same data can later
be retrieved using the accession number. Accession number identifies each
data uniquely and it never changes.
 Examples –
• Examples of Primary database- Nucleic Acid Databases are
GenBank and DDBJ
• Protein Databases are PDB, SwissProt, PIR, TrEMBL, Metacyc, etc.

Secondary Database
• The data stored in these types of databases are the analyzed result of the
primary database.
• The data here are highly curated(processing the data before it is presented in
the database). A secondary database is better and contains more valuable
knowledge compared to the primary database.
 Examples –
 Examples of Secondary databases are as follows.
• InterPro (protein families, motifs, and domains)
• UniProt Knowledgebase (sequence and functional information on proteins)

Composite Databases
• The data entered in these types of databases are first compared and then
filtered based on desired criteria.
• The initial data are taken from the primary database, and then they are merged
together based on certain conditions.
• It helps in searching sequences rapidly. Composite Databases contain non-
redundant data.
 Examples –
 Examples of Composite Databases are as follows.
• Composite Databases -OWL,NRD and Swissport +TREMBL

GenBank:
 GenBank (Genetic Sequence Databank) is one of the fastest growing
repositories of known genetic sequences.
 It has a flat file structure, that is an ASCII text file, readable by both humans and
computers.
 In addition to sequence data, GenBank files contain information like accession
numbers and gene names, phylogenetic classification and references to
published literature. There are approximately 191,400,000 bases and 183,000
sequences

EMBL
 The EMBL Nucleotide Sequence Database is a comprehensive database of
DNA and RNA sequences collected from the scientific literature and patent
applications and directly submitted from researchers and sequencing groups.
 Data collection is done in collaboration with GenBank (USA) and the DNA
Database of Japan (DDBJ).
 The database currently doubles in size every 18 months and currently contains
nearly 2 million bases from 182,615 sequence entries.

SwissProt, EC-Enzyme and PROSITE
 SwissProt :This is a protein sequence database that provides a high level of
integration with other databases and also has a very low level of redundancy
(means less identical sequences are present in the database).
 PROSITE: PROSITE is a protein database. It consists of entries describing the
protein families, domains and functional sites as well as amino acid patterns
and profiles in them. These are manually curated by a team of the Swiss
Institute of Bioinformatics, Amos Bairoch at the University of Geneva
 EC-ENZYME: The 'ENZYME' data bank contains the following data for each
type of characterized enzyme for which an EC number has been provided.
 EC number, Recommended name, Alternative names, Catalytic activity,
Cofactors, Pointers to the SWISS-PROT entrie(s) that correspond to the
enzyme, Pointers to disease(s) associated with a deficiency of the enzyme.

PDB and GDB
 PDB: The X-ray crystallography Protein Data Bank (PDB), compiled at the
Brookhaven National Laboratory.
 GDB: The GDB Human Genome Data Base provides storage and
dissemination of data about genes and other DNA markers, map location,
genetic disease and locus information, and bibliographic information.

OMIM AND PIR-PSD
 OMIM: The Mendelian Inheritance in Man data bank (MIM) is prepared by Victor Mc
Kusick with the assistance of Claire A. Francomano and Stylianos E. Antonarakis at
John Hopkins University.
 Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of
human genes and genetic disorders and traits, with a particular focus on the gene-
phenotype relationship. As of now approximately 9,000 of the over 25,000 entries
are registered
 PIR-PSD: PIR (Protein Information Resource) produces and distributes the PIR-
International Protein Sequence Database (PSD).
 Protein sequence databases are classified as primary, secondary and composite
depending upon the content stored in them.
 PIR and SwissProt are primary databases that contain protein sequences as 'raw'
data. Secondary databases (like Prosite) contain the information derived from
protein sequences.

Sequence databases links
Database URL Feature
GenBank http://www.ncbi.nlm.nih.gov/ NIH’s archival genetic sequence database
EMBL http://www.ebi.ac.uk/embl/ EBI’s archival genetic sequence database
DDBJ http://www.ddbj.nig.ac.jp/ NIG’s archival genetic sequence database
SGD http://www.yeastgenome.org/ A repository for baker’s yeast genome and biological data
EBI genomes http://www.ebi.ac.uk/genomes/ It provides access and statistics for the completed
genomes
Ensembl http://www.ensembl.org/ Database that maintains automatic annotation on
selected eukaryotic genomes
UniGene http://www.ncbi.nlm.nih.gov/sites/entrez?db=u
nigene
Each UniGene cluster contains sequences that represent a
unique gene, as well as related information.
dbEST http://www.ncbi.nlm.nih.gov/dbEST/ Division of GenBank that contains expression tag
sequence data

Uses of bioinformatic Databases :
• It helps the researchers to study the available data
• It helps scientists to understand the concepts of biological phenomena.
• The database acts as a storage of information.
• It helps remove the redundancy of data.

Bioinformatics-Programs & Tools
 Bioinformatic tools are software programs that are designed
for extracting the meaningful information from the mass of
data & to carry out this analysis step.
 There are data-mining software that retrieves data from
genomic sequence databases and also visualization tools
to analyze and retrieve information from proteomic
databases.

Bioinformatic tool
 The Bioinformatics Tools may be categorized into following
categories:
• Homology and Similarity Tools
• Protein Function Analysis
• Structural Analysis
• Sequence Analysis

Some examples of Bioinformatics Tools
 BLAST
 BLAST (Basic Local Alignment Search Tool) comes under the category of
homology and similarity tools.
 It is a set of search programs designed to perform fast similarity searches for
your protein or DNA.
 Comparison of nucleotide sequences in a database can be performed.
 Also a protein database can be searched to find a match against the queried
protein sequence.
 NCBI has also introduced the new queuing system to BLAST (Q BLAST) that
allows users to retrieve results at their convenience and format their results
multiple times with different formatting options.

blast
 Depending on the type of sequences to compare, there are different programs:
 blastp compares an amino acid query sequence against a protein sequence
database
 blastn compares a nucleotide query sequence against a nucleotide sequence
database
 blastx compares a nucleotide query sequence translated in all reading frames
against a protein sequence database
 tblastn compares a protein query sequence against a nucleotide sequence
database dynamically translated in all reading frames
 tblastx compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database.

FASTA
 FAST homology search All sequences. An alignment program for protein
sequences created by Pearsin and Lipman in 1988.
 The program is one of the many heuristic algorithms proposed to speed up
sequence comparison.
 The basic idea is to add a fast prescreen step to locate the highly matching
segments between two sequences, and then extend these matching segments
to local alignments using more rigorous algorithms such as Smith-Waterman.

EMBOSS
 EMBOSS (European Molecular Biology Open Software Suite) is a software-
analysis package.
 It can work with data in a range of formats and also retrieve sequence data
transparently from the Web.
 Extensive libraries are also provided with this package, allowing other
scientists to release their software as open source. It provides a set of
sequence-analysis programs, and also supports all UNIX platforms.

Clustalw and RasMol
 Clustalw:It is a fully automated sequence alignment tool for DNA and protein
sequences. It returns the best match over a total length of input sequences-
protein or a nucleic acid.
 RasMol:It is a powerful research tool to display the structure of DNA, proteins,
and smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to
use program.

Bioinformatics Resources on the Web
 General Nucleotide Sequence Databases: Some general
nucleotide sequence databases
 Specific Human Genome Databases: Collection of
human genome databases
 Specific Genome Databases of all Other Species:
Collection of genome databases of all other species
 Online Tools and Protocols : Online Tools and Protocols
links
 Bio-Journals -- a big collection, This is a combination of
Pedro's Collection, Springer, Oxford, and APNet,
updated by us.
 NCBI - Established in 1988 as a national resource for
molecular biology information, NCBI creates public
databases, conducts research in computational biology,
develops software tools for analyzing genome data, and
disseminates biomedical information - all for the better
understanding of molecular processes affecting human
health and disease.
 EBI - The European Bioinformatics Institute (EBI) is a
non-profit academic organisation that forms part of the
European Molecular Biology Laboratory (EMBL).
 DDBJ- DDBJ (DNA Data Bank of Japan) began DNA
data bank activities in earnest in 1986 at the National
Institute of Genetics (NIG).
DDBJ has been functioning as the international
nucleotide sequence database in collaboration with
EBI/EMBL and NCBI/GenBank.
DNA sequence records organismic evolution more
directly than other biological materials and thus is
invaluable not only for research in life sciences but also
human welfare in general. The databases are, so to
speak, a common treasure of human beings. With this in
mind, we make the databases online accessible to
anyone in the world.
 Feature Table Definition- the format of entries in these
databases. DNA Data Bank of Japan, Mishima, Japan.
EMBL Nucleotide Sequence Database, Cambridge,
UK.GenBank, NCBI, Bethesda, MD, USA.

Biological Database Links
• NCBI Home
Established in 1988 as a national resource for molecular biology information,
NCBI creates public databases, conducts research in computational biology,
develops software tools for analyzing genome data, and disseminates
biomedical information - all for the better understanding of molecular processes
affecting human health and disease.
• Entrez Search and Retrieval System
Entrez Programming Utilities are tools that provide access to Entrez data
outside of the regular web query interface and may be helpful for retrieving
search results for future use in another environment.

• KEGG: Kyoto Encyclopedia of Genes and Genomes
A complete computer representation of the cell and the organism, which will
enable computational prediction of higher-level complexity of cellular processes
and organism behaviors from genomic information.
• TIGR Gene Indices
The TIGR Gene Index databases (TGI) (http://www.tigr.org/tdb/tgi) are
constructed using all publicly available expressed sequence tags (EST) and
known gene sequence data stored in GenBank for each target species

• Gramene: A Comparative Mapping Resource for Grains
Gramene is a curated, open-source, Web-accessible data resource for
comparative genome analysis in the grasses. Our goal is to facilitate the study
of cross-species homology relationships using information derived from public
projects involved in genomic and EST sequencing, protein structure and
function analysis, genetic and physical mapping, interpretation of biochemical
pathways, gene and QTL localization and descriptions of phenotypic characters
and mutations.
• MaizeDB
The goals of this project are to provide a central repository for public maize
information and present it in a way that creates intuitive biological connections
for the researcher with minimal effort as well as provide a series of
computational tools that directly address the questions of the biologist in an
easy-to-use form.

• Barley Genomics
Genome Mapping , Map-Based Cloning, Molecular Breeding, Mutant Isolation &
Characterization, Functional Genomics, BAC Address Calculator,
Developmental Mutants
• EMBL European Bioinformatics Institute
The European Bioinformatics Institute (EBI) is a non-profit academic
organisation that forms part of the European Molecular Biology Laboratory
(EMBL). Databases of biological data including nucleic acid, protein sequences
and macromolecular structures.

• A Catalog of Genes for Plant Glycerol Lipid Biosynthesis
Has genomic, cDNA, EST and GSS sequences for 62 plant polypeptides
involved in lipid metabolism in higher plant species. This version of the dataset
accounts for approximately 70% of the Arabidopsis genome.
• Grain Genes: A Small Grains and Sugarcane Database
GBrowse, developed by the GMOD group, is a Genome Browser that provides
a wealth of genome annotation for maps in the Grain Genes collection. Users
can easily manipulate the view of the chromosome and type of data displayed.

• PathDB Pathways
PathDB is a beta level research tool for scientists interested in analyzing their
experimental or computational data in the context of biological pathways and
networks.
• Enzymes and Metabolic Pathways Database
Enzymes and Metabolic Pathways database, EMP, is a unique and most
comprehensive electronic source of biochemical data. It covers all aspects of
enzymology and metabolism and represents the whole factual content of
original journal publications.
• ExPASy Molecular Biology Server
The ExPASy (Expert Protein Analysis System) proteomics server of the Swiss
Institute of Bioinformatics (SIB) is dedicated to the analysis of protein
sequences and structures as well as 2-D PAGE

• Nucleic Acids Research:2000 Biological Database Issue
Nucleic Acids Research (NAR) publishes the results of leading edge research
into physical, chemical, biochemical and biological aspects of nucleic acids and
proteins involved in nucleic acid metabolism and/or interactions.
• It enables the rapid publication of papers under the following categories:
chemistry, computational biology, genomics, molecular biology, RNA and
structural biology.
• Yeast Protein Database HOME PAGE
Six database volumes of biological information about proteins comprise Incyte's
Proteome BioKnowledge Library. Each volume focuses on a different organism
important in pharmaceutical research.

• Saccharomyces Genome Database
SGDTM is a scientific database of the molecular biology and genetics of the
yeast Saccharomyces cerevisiae, which is commonly known as baker's or
budding yeast.
• The Breast Cancer Gene Database
A database of genes involved in breast cancer.
• The Mammary Transgene Interactive Database
This is an interactive database of literature on research designed to target
transgene proteins to the mammary gland. Current emphasis is on
biotechnology applications. Addition of tumor model and developmental model
literature is planned.

• The Small RNA database
Small RNAs are broadly defined as the RNAs not directly involved in protein
synthesis. These are grouped under three categories: l) Capped small RNAs; 2)
Noncapped small RNAs; and 3) Viral small RNAs.
• The Tumor Gene Database
A database of genes associated with tumorigenesis and cellular transformation.
This database includes oncogenes, proto-oncogenes, tumor supressor
genes/anti-oncogenes, regulators and substrates of the above, regions believed
to contain such genes such as tumor-associated chromosomal break points
and viral integration sites, and other genes and chromosomal regions that
seems relevant.

Importance of Databases
• Databases act as a store house of information.
• Databases are used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria.
• It allows knowledge discovery, which refers to the identification of
connections between pieces of information that were not known when
the information was first entered. This facilitates the discovery of new
biological insights from raw data.
• Secondary databases have become the molecular biologist’s reference
library over the past decade or so, providing a wealth of information on
just about any gene or gene product that has been investigated by the
research community.
• It helps to solve cases where many users want to access the same
entries of data.
• Allows the indexing of data.

Introduction to databases.pptx

Introduction to databases.pptx

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Introduction to databases.pptx

Semelhante a Introduction to databases.pptx (20)

Mais de sworna kumari chithiraivelu

Mais de sworna kumari chithiraivelu (20)

Último

Último (20)

Introduction to databases.pptx