2. •Bioinformatics relies heavily on vast amounts of data on key
biological molecules
•Enormous amount of biological data are being generated every
day
•These raw data form the base from which biological information
is obtained
•From this primary information, logical interpretations can be
drawn by applying known principles of molecular biology
•This secondary information forms the basis for 'secondary'
databases for creating still more information
•The databases effectively store, manage, connect and distribute
data.
•Handling this rich data resource is, of course, a challenging and
uphill task, which requires considerable knowledge and skill.
Here is the role of bioinformatician
3. Creating Databases
•Large quantities of data produced daily
•To make data easily accessible, a filing & network of
biological information needed
•Biological database- collection of files containing
records of biological data in machine readable form,
arranged in fields and which can be accessed, added,
retrieved, manipulated & modified
•Data arranged as files in fields
4. •The data in a database are arranged by sets
of rules, which are programmed into
software that manages the data - Database
Management System / DBMS.
•Set by the owners of the databases
•Huge databases have sophisticated
methods of arranging data in the form of
arrays and tables - structured databases.
•Help in efficient and rapid data mining
5. Interaction with the database is made user-
friendly by creating Graphical User Interfaces or
GUIs.
GUIs provide pictorial representations or icons
that enable the user to interact through simple &
easily understandable mouse-driven commands
•Allow the uploading of raw data
•Each database will have its own file formats and
DBMS for storing and managing these data.
• Some of the data formats are Text, Sequence,
Structure, Links, etc.
6. Biological databases can thus be genomic
databases, nucleic acid and amino acid sequence
databases, protein databases, metabolic pathway
databases, protein family databases, structure
databases, taxonomic databases, bibliographic
databases, etc.
Accession number
•Unique identifier for a sequence record.
•Do not change, even if information in the record
is changed at the author's request.
7. Searching Databases
•Usual mode of search is through the usage for
appropriate keywords
•The software that goes along with the database uses
sets of questions through an repeated process called
structured query language (SQL).
•Most popular text-based search is available in
PubMed.
• This database is a repository of links to all scientific
literature in refered journals giving details of the
publication, including hyperlinks to the original
articles.
•GenBank - DNA sequences
•UniProt- amino acid sequences
8. Categories of Databases
1. Categories Based on Type of Data
•Primary database contains original data in the form of
primary sequence data or structural data as submitted by
the scientific community.
•Unique data obtained through laboratory experiments
and are retained as the original data. They are not
curated.
•Also known as archival databanks
Example: Nucleic acid databases: EMBL, GenBank,
DDBJ
Protein databases: Swiss-Prot, PDB, PIR, TrEMBL
Metabolite databases: KEGG, EcoCyc, MetaCyc
9. Secondary databases
• Also known as pattern databases.
•These contain information that has been processed &
derived from the raw data available in primary
databases.
•Here, the data are classified according to their
structure, models, common characteristics of sequence
classes, structure of domains and motifs, etc.
•Value added databases - derivative databases
Examples: PROSITE, PRINTS, BLOCKS, Pfam, etc.
10. Composite database
•Database that amalgamates a number of primary
sources, using a set of defined criteria.
•The choice of different data sources and the
application of different criteria result in the
emergence of composite databases, each of
which has its own particular format.
Eg. OWL( nucleic acid sequences),
NRDB(protein database) and SWISS-
PROT+TrEMBL.
11. 2.Categories Based on Composition of Data Type
Sequence databases: either nucleotide or amino acid sequences, or
may contain both.
Genome databases: repositories of whole genome nucleotide
sequences of various organisms.
Micro-array databases: They contain data obtained from empirical
micro-array based experiments.
Metabolite databases: data on biochemical pathways, metabolites,
enzymes, etc. in different organisms.
Structure databases: They carry data on the 3D structure of proteins
and nucleotides.
Chemical databases: They store the data on chemical structures,
their composition, functional groups, etc.
Bibliographic databases: These are repositories of scientific
publications from accredited and peer-reviewed journals.
Eg PubMed
12. PubMed
•Bibliographic database.
•Free database accessing the MEDLINE
database of citations, abstracts and some full text
articles on life sciences and related fields
•Developed and maintained by the National
Center for Biotechnology Information (NCBI),
at the U.S National Library of Medicine (NLM),
located at the National Institutes Of Health
(NIH).
•Provides access to additional relevant Web sites
and links to the other NCBI molecular biology
resources.
13.
14. 3. Categories based on database configuration
•Flat file databases, Relational databases, Object
oriented databases and Hypertext databases
•flat file database is the simplest database model in
which all the information is stored in text files,
•ASN.I (Abstract Syntax Notation One) is an
International Standards organisation (ISO) data
representation format.
•NCBI uses this notation
15. Primary Database
•Composed of an array of nucleotide sequence
entries.
•These databases are data repositories that
accept nucleic acid sequence data and make it
freely available to the public.
Eg. EMBL, DDBJ and GenBank of NCBI.
16. NCBI (GenBank)
GenBank is hosted by, National Centre for Biotechnology
Information
•This offers all publicly available nucleotide sequences, their
protein translations, and their bibliographic and annotated
information.
•It also facilitates and encourages direct submission of sequence
data by providing a very simple and user friendly process.
•You can access the data in NCBI free of cost over the Internet
through their site, http://www.ncbi.nlm.nih.gov/genbank/
•Data can be submitted & it is released after quality assurance
check
17.
18. DDBJ , DNA Data Bank of Japan
•Started in 1986.
• It is now hosted at national Institute of Genetics.
•DDBJ can be accessed though Internet via DDBJ
homepage, http://www.ddbj.nig.ac.jp/.
•Collect nucleotide sequences from researchers and to
issue the internationally recognized accession number
to data submitters.
•Each database entry includes details of sequences,
submitter's details, bibliographic references, biological
significance, and the scientific name and taxonomy of
the organism.
19.
20. EMBL (European Molecular Biology Laboratory)
•Nucleotide sequence database (of DNA and
RNA) Hosted at UK by the EMBL European
Bioinformatics Institute.
•EMBL collects nucleotide sequence data from
individual researchers, genome sequencing
projects and patent applications.
•It was first established in 1974
•Sequences are stored in the database as they
would exist in the biological state.
•The stored data generally correspond to wild type
sequences without mutation or genetic
manipulations.
21. •OMIM (Online Mendelian Inheritance in Man)
Human gene database.
•OMIM focuses on the relationship between
phenotype and genotype.
•OMIM was developed for the World Wide Web
by NCBI
•It can be accessed at the URL:
http:/https://www.ncbi.nlm.nih.gov/omim
22. Basic Local Alignment Search Tool/ BLAST, is one of the most widely used
sequence analysis search tools used for comparing primary biological sequence
information
The BLAST program can be accessed over WWW or downloaded from
http://ncbi.nlm.nih.gov/ BLAST/ at NCBI
Programs
BLASTn - nucleotide query sequence against nucleotide sequence database
comparison.
BLASTp - protein query sequence against protein sequence database.
BLASTx - translated nucleotide query sequence against protein sequence
database.
tBLASTn - protein query sequence against translated nucleotide sequence
database.
tBLASTx - translated nucleotide query sequence against translated nucleotide
database.
PSI-BLAST - finds distant relatives of a protein.
MEGABLAST - Faster program used when large numbers of input sequences
are compared.
BLAST is much more effective for protein sequences than DNA sequences.
23. FASTA is a popular DNA and protein sequence
alignment/ database scanning program created byWR
Pearson and D J Lipman in 1988.
Programs
fasta: compares a query sequence and a group of
sequences of the same type (nucleotide or protein).
fastx: compares a translated nucleotide query sequence
and a group of protein sequences.
fasty: compares a DNA sequence to a protein sequence
database.
fasts: compares set of short peptide fragments against a
protein database