The document discusses biological databases. It begins by defining what a database is, including that it is a collection of related data organized in a way that allows information to be retrieved easily. It then discusses different types of biological databases, including those containing nucleotide sequences, protein sequences, 3D structures, gene expression data, and metabolic pathways. The rest of the document provides details on specific biological databases like GenBank, EMBL, DDBJ, and NCBI databases including Entrez. It emphasizes that biological data is heterogeneous, large in volume, dynamic and integrated across multiple databases.
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
Introduction to Biological databases
1.
2. BIOLOGICAL DATABASES
Dr. K. RameshKumar
Assistant Professor
PG & Research Dept. of Zoology
Vivekananda College
Tiruvedakam- West
MADURAI - 625234
3. WHAT IS DATABASE?
A database is any collection of related data.
A Computerized archive used to store and organize
data in such a way that information can be retrieved
easily.
A database is a collection of interrelated data store
together without harmful and unnecessary
redundancy (duplicate data) to serve multiple
applications
Retrieving is called firing a query
4. Structured collection of information.
Consists of basic units called records or entries.
Each record consists of fields, which hold pre-
defined data related to the record.
For example, a protein database would have
protein entries as records and protein properties as
fields (e.g., name of protein, length, amino-acid
sequence)
5. THE ‘PERFECT’ DATABASE
Comprehensive, but easy to search.
Annotated, but not “too annotated”.
A simple, easy to understand structure.
Cross-referenced.
Minimum redundancy.
Easy retrieval of data.
6.
7. BIOLOGICAL DATABASES
Data Heterogeneity – diversity in the data types
High Volume of data – highly heterogeneous-
voluminous to support comprehensive investigation in
various fields
Uncertainty – Biological phenomena are observed and
assumed to be true
Data Curation – inconsistencies are due to a lack of
knowledge in the desired field
Large scale data integration – data collected from lab
worldwide after years of research
Data sharing – Scientific community examination and
inspection
Dynamic and subject to continual change – Everyday
in various lab
8. BIOLOGICAL DATABASES
Need for storing and communicating large
datasets has grown
Make biological data available to scientists.
To make biological data available in computer-
readable form.
9.
10.
11. DIFFERENT CLASSIFICATIONS OF
DATABASES
Type of data
nucleotide sequences
protein sequences
proteins sequence patterns or motifs
macromolecular 3D structure
gene expression data
metabolic pathways
12. DIFFERENT CLASSIFICATIONS OF DATABASES….
Primary or derived databases
Primary databases: experimental results directly into
database
Secondary databases: results of analysis of primary
databases
Aggregate of many databases
Links to other data items
Combination of data
Consolidation of data
13. DIFFERENT CLASSIFICATIONS OF DATABASES….
Availability
Publicly available, no restrictions
Available, but with copyright
Accessible, but not downloadable
Academic, but not freely available
Proprietary, commercial; possibly free for academics
14. THE CENTRAL DOGMA & BIOLOGICAL DATA
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequence
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tag
(ESTs)
15.
16. NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD
18. 18
NCBI AND ENTREZ
One of the largest and most comprehensive
databases belonging to the NIH – national
institute of health (USA)
Entrez is the search engine of NCBI
Search for :
genes, proteins, genomes, structures, diseases,
publications and more.
http://www.ncbi.nlm.nih.gov/
19. PRIMARY DATABASES
This databases contains the raw nucleic acid
sequence data which are produced and submitted
by researchers worldwide.
• Nucleic acid
EMBL
GenBank
DDBJ (DNA Data Bank of Japan)
• Protein
PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D
20. NUCLEOTIDE SEQUENCE DATABASES
EMBL, GenBank, and DDBJ are the three primary
nucleotide sequence databases
EMBL www.ebi.ac.uk/embl/
GenBank www.ncbi.nlm.nih.gov/Genbank/
DDBJ www.ddbj.nig.ac.jp
They together constitute the International
Nucleotide Sequence database callaboration.
21. GENBANK
An annotated collection of all publicly available
nucleotide and proteins
Set up in 1979 at the LANL (Los Alamos).
Maintained since 1992 NCBI (Bethesda).
http://www.ncbi.nlm.nih.gov
24. EMBL NUCLEOTIDE SEQUENCE
DATABASE
An annotated collection of all publicly available
nucleotide and protein sequences
Created in 1980 at the European Molecular
Biology Laboratory in Heidelberg.
Maintained since 1994 by EBI- Cambridge.
http://www.ebi.ac.uk/embl.html
25. DDBJ–DNA DATA BANK OF JAPAN
An annotated collection of all publicly
available nucleotide and protein sequences
Started, 1984 at the National Institute of
Genetics (NIG) in Mishima.
Still maintained in this institute a team led
by Takashi Gojobori.
http://www.ddbj.nig.ac.jp
26. NCBI DATABASES AND SERVICES
GenBank primary sequence database
Free public access to biomedical literature
PubMed free Medline (3 million searches per day)
PubMed Central full text online access
Entrez integrated molecular and literature databases
27. TYPES OF MOLECULAR DATABASES
Primary Databases
Original submissions by experimentalists
Content controlled by the submitter
Examples: GenBank, Trace, SRA, SNP, GEO
Derivative Databases
Derived from primary data
Content controlled by third party (NCBI)
Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene,
Homologene, Structure, Conserved Domain
28. PRIMARY VS. DERIVATIVE SEQUENCE
DATABASES
GenBank
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
29. SEQUENCE DATABASES AT NCBI
Primary
GenBank: NCBI’s primary sequence database
Trace Archive: reads from capillary sequencers
Sequence Read Archive: next generation data
Derivative
GenPept (GenBank translations)
Outside Protein (UniProt—Swiss-Prot, PDB)
NCBI Reference Sequences (RefSeq)
30. GENBANK - PRIMARY SEQUENCE DB
Nucleotide only sequence database
Archival in nature
Historical
Reflective of submitter point of view (subjective)
Redundant
Data
Direct submissions (traditional records)
Batch submissions
FTP accounts (genome data)
31. GENBANK - PRIMARY SEQUENCE DB (2)
Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database
32. TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
GI number
NCBI internal use
well annotated
the sequence is the data
34. FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
/gene="MLH1"
CDS 22..2292
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
35. REFSEQ: DERIVATIVE SEQUENCE DATABASE
Curated transcripts and proteins
Model transcripts and proteins
Assembled Genomic Regions
Chromosome records
Human genome
microbial
organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
38. REFSEQS: ANNOTATION REAGENTS
Genomic DNA
(NC, NT, NW)
Model mRNA (XM)
(XR)
Curated mRNA (NM)
(NR)
Model protein (XP)
Curated Protein (NP)
Scanning....
= ?
GenBank
Sequences
RefSeq
39. REFSEQ BENEFITS
Non-redundancy
Updates to reflect current sequence data and biology
Data validation
Format consistency
Distinct accession series
Stewardship by NCBI staff and collaborators
42. ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLAST
BLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.
43. GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databases
44. TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure
45. THE PROBLEM
Rapidly growing databases with complex and changing
relationships
Rapidly changing interfaces to match the above
Result
Many people don’t know:
Where to begin
Where to click on a Web page
Why it might be useful to click there
53. ‘TAKE HOME MESSAGE’ ADVANTAGES OF
DATA INTEGRATION
More relevant inter-related information in one place
Makes it easier to find additional relevant
information related to your initial query
Potentially find information indirectly linked, but
relevant to your subject of interest
uncover non-obvious genetic features that explain
phenotype or disease
Easier to build a ‘story’ based on multiple pieces of
biological evidence