Bioinformatics is the application of Information technology to store, organize and analyze the vast amount of biological data which is available in the form of sequences and structures of proteins and nucleic acids. The biological information of nucleic acids is available as sequences while the data of proteins is available as sequences and structures.
A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. The activity of preparing a database can be divided in to:
Collection of data in a form which can be easily accessed
Making it available to a multi-user system (always available for the user)
1. Free Powerpoint Templates
Page 1
Protein Database
By
KAUSHAL KUMAR SAHU
Assistant Professor (Ad Hoc)
Department of Biotechnology
Govt. Digvijay Autonomous P. G. College
Raj-Nandgaon ( C. G. )
2. Free Powerpoint Templates
Page 2
Introduction
• Bioinformatics is the application of Information technology to store, organize
and analyze the vast amount of biological data which is available in the form
of sequences and structures of proteins and nucleic acids. The biological
information of nucleic acids is available as sequences while the data of
proteins is available as sequences and structures.
• A biological database is a collection of data that is organized so that its
contents can easily be accessed, managed, and updated. The activity of
preparing a database can be divided in to:
• Collection of data in a form which can be easily accessed
• Making it available to a multi-user system (always available for the user)
3. Free Powerpoint Templates
Page 3
The network for production, construction
and accession of a database
EXPERIMENTS N
• | |- E U
• ORGANIZATION |----------|- T-->S COPY
• OF DATA HOST/SERVER | W-->E-->ONLINE -----> PERSONAL
• | | O-->R ACCESS DATABASE
• |------------> DATABASES R S
• K
• |
•
• EDS
• (Electronic Data Storage)
•
4. Free Powerpoint Templates
Page 4
Protein databases
• Protein databases are more specialized than primary sequence
databases. They contain information derived from the primary
sequence databases. Some contain protein translations of the
nucleic acid sequences. Some contain sets of patterns and motifs
derived from sequence homologs.
5. Free Powerpoint Templates
Page 5
History
• The first database was created within a short period after the Insulin protein
sequence was made available in 1956. Insulin is the first protein to be
sequenced. The sequence of Insulin consisted of just 51 residues which
characterize the sequence.
• In 1959, V.M. Ingram first made attempt to compare sickle cell
haemoglobin and normal haemoglobin and demonstrated their homology.
this results in more protein sequencing and accumulation of vast information
.hence it is realized to have database so that using computation software
the protein can be quickly compared.
• In 1965, Margaret Dayhoff established the first database of protein
sequences, a database that was published annually as a series of volumes
entitled “Atlas of Protein Sequence and Structure”
• In 1972, Protein Data Bank was developed as the first protein structure
database
7. Free Powerpoint Templates
Page 7
Primary database:-
Protein data bank (PDB)
• Three-dimensional structures are stored in the Protein Databank (PDB).
This is the single world-wide archive of structural data derived by X-ray
crystallography, nuclear magnetic resonance spectroscopy, and other
techniques, as well as structural models
• The database is maintained by the Research Collaboratory for Structural
Bioinformatics (RCSB), at Rutgers University.
• Data in the PDB are very high quality and are extensively curated.
11. Free Powerpoint Templates
Page 11
Sequence database:
SWISS-PROT protein sequence database
• SWISS-PROT was created in at the department of medical biochemistry
(university of geneva) in 1986.
• In 1987, European Molecular biology laboratory and Swiss institute of
Bioinformatics (SIB) work in collaboration ,as equal partners , to develop
and maintain this highly annotated repository of protein sequences.
• It provides high quality annotation with minimum redundancy.
12. Free Powerpoint Templates
Page 12
Translated EMBL (TrEMBL)
• It was created in 1996 with the objective to fill the gap between flow of
genomic data and annotated protein sequences.
• TrEMBL contains computer annotated records generated by translating
coding sequences (CDS) available in EMBL nucleotide sequence database.
• It has two main sections-
• SP- TrEMBL
• REM- TrEMBL-
13. Free Powerpoint Templates
Page 13
Protein information resource (PIR)
• PIR was established in 1984 by the National Biomedical Research
Foundation (NBRF) as a resource to assist researchers in the identification
and interpretation of protein sequence information.
• The database is split into four sections PIR1 to PIR4
– PIR1 contains fully classified and annotated entries.
– PIR2 includes preliminary entries.
– PIR3 contains unverified entries
– PIR4 entries all into:-
• Conceptual translations sequence
• Protein sequences
• Conceptual translations of artifactual sequence.
• Sequence that are not genetically encoded and not produced in ribosome.
15. Free Powerpoint Templates
Page 15
Secondary databases:
Structural classification of proteins (SCOP)
• It was created in 1995 by Murzin et al. it is maintained at Cambridge with
the aim to gather information about structural similarities of proteins to
increase our understanding of protein evolution and development.
• SCOP provides comprehensive information on structural and evolutionary
relationships of protein with known structure including structures available in
protein data bank.
• The manually constructed SCOP classifies proteins in a hierarchy which
includes class, folds, superfamily, family, protein and species.
16. Free Powerpoint Templates
Page 16
Class Architecture Topology Homology
(CATH)
• The CATH database established in 1993 is a protein structure classification
based on four levels namely class, Architecture ,Topology and Homology.
• CATH contains hierarchical domain classification of protein structures
present in protein data bank and is maintained at University College
London.
• The classification has been done by combination of automated and manual
methods.
17. Free Powerpoint Templates
Page 17
Sequence database-
1.PROSITE:
• It is a method of determining what is the function of uncharacterized
proteins translated from genomic or cDNA sequences.
• It consists of a database of biologically significant sites, patterns and
profiles that help to reliably identify to which known family of protein (if any)
a new sequence belongs.
• It include protein pattern motifs indicative protein’s function , are widely
used for function prediction studies, cellular localization annotation, and
sequence classification.
19. Free Powerpoint Templates
Page 19
• 3. BLOCKS
• Blocks are multiply aligned ungapped segments corresponding to the most
highly conserved regions of proteins.
• Block database Itself contain more than 4000 entries.
• 4. Pfam
• The methodology used by Pfam to create protein family or domain
signatures is Hidden Markov Models (HMMs).
• They are thus particularly useful when analysing multidomain proteins.
• The biggest drawback of Pfam is its lack of biological information
(annotation) of the protein families
20. Free Powerpoint Templates
Page 20
Important database search tool:
SEARCH TOOL FUNCTION PROVIDED
BLAST (BASIC LOCALALIGNMENT TOOL) Used to analyze sequence information and detect
homologous sequences.
ENTREZ Used to access literature , sequence and
structural database.
DNAPLOT Sequence alignment tool
LOCUS LINK Accessing information on homologous gene
STRUCTURE It support molecular molding database
(MMDB)and software tool for structure analysis.
TAXONOMY BROWSER Taxonomic classification of various species as
well as genetic information.
FASTA This program provide algorithm to speed up
sequence comparison.
21. Free Powerpoint Templates
Page 21
Example: study protein sequence of hepatitis B virus
surface antigen FASTA product by NCBI
25. Free Powerpoint Templates
Page 25
Application of protein database
• Protein sequence
• Determination of macromolecular structure
• Molecular evolution
• Drug development
26. Free Powerpoint Templates
Page 26
Conclusion
• The aim of most protein structure databases is to organize and annotate
the protein structures, providing the biological community access to the
experimental data in a useful way. whereas sequence databases focus on
sequence information, and contain no structural information for the majority
of entries.
• Thus there is no doubt that Bioinformatics tools for efficient research will
have significant impact in biological sciences and betterment of human
lives.
27. Free Powerpoint Templates
Page 27
References
• Principles of gene manipulation and genomics- S.B.
Primrose and R.M.Twyman (seventh edition)
• www.bioinfo.com
• www.ncbi.nil.nih.gov.
• http://www.mrc-
lmb.cam.ac.uk/genomes/madanm/pdfs/biodbseq.pdf
•