2. What is bioinformatics?
• an emerging interdisciplinary research area
• deals with the computational management
and analysis of biological information: genes,
genomes, proteins, cells, ecological systems,
medical information, robots, artificial
intelligence...
3. The Core of Bioinformatics to date
•Relationships between
TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAA
GYALYGSATMLV
sequence 3D structure protein functions
•Properties and evolution of genes, genomes,
proteins, metabolic pathways in cells
•Use of this knowledge for prediction, modelling, and
design
4. “The holy grail of bioinformatics”
GCTCCTCACTGTCTGTGTTTATTC
TTTTAGCTTCTTCAGATCTTTTAG
TCTGAGGAAGCCTGGCATGTGCA
AATGAAGTTAACCTAA...
> 500, 000 genes
sequenced to date
Expected number of
unique protein
structures:
~ 700-1, 000
5. Basic concepts
• conceptual foundations of bioinformatics:
evolution
protein folding
protein function
• bioinformatics builds mathematical models
of these processes -
to infer relationships between components
of complex biological systems
6. Information processing in cells
coding regions
regulatory
sites
nucleic acids
transcripts
proteins
One-to-many mappings!
Context-dependence!
7. Global approaches: Toward a new Systems Biology
Global cell state
Genome
Genome activation
patterns: transcriptomics
Protein population:
proteomics
Organisation:
tissue imaging EM X-ray, NMR
cells
molecular complexes
•How does the spatial and
temporal organisation of
living matter give rise to
biological processes?
8. Global approaches: Toward a new Systems Biology
Perturbation Living cell
Dynamic response
“Virtual cell”
Biological knowledge
(computerised)
Sequence information
Structural information
•Basic principles
•Practical
applications
Bioinformatics
Mathematical
modelling
Simulation
9. We do not know yet whether the information in the genome is sufficient
to reconstruct an entire biological system. Information on building
blocks not enough, information on their interactions is essential.
External environment
Internal environment
Metabolic net
Genetic networks
DNA hRNA mRNAs proteins
10. Bioinformatics in context
Genomics
Molecular Biophysics
biology
Molecular
evolution
Ethical, legal,
and social
implications
Bioinformatics
Mathematics/
computer
science
11. Current challenges to users
• Potential hurdles:
Methods are in flux and not fully developed-scattered
and heterogeneous resources
• Remedies: Web resources
navigation guides
integration of tools and databanks
http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
12. Example 1
Sequence homology search of the
genome of Plasmodium falciparum
Target identification for antimalerial
drugs
13. The search for new antimalarial
drugs
• Malaria is one of the leading causes of morbidity
and mortality in the tropics.
• 300 to 500 million estimated clinical cases and 1.5
million to 2.7 million deaths per year.
• Nearly all fatal cases are caused by Plasmodium
falciparum.
• The parasite's resistance to conventional
antimalarial drugs such as chloroquine is growing
at an alarming rate.
14. •P. falciparum has a plastidlike organelle, called the
apicoplast, acquired by endosymbiosis of an alga.
Jomaa et al. (1999)
•Self-replicating, maternally inherited (35kb, circular DNA).
•Comparative genome analysis: Search for orthologs.
Apicoplast contains enzymes found in plant and bacterial,
but not animal metabolic pathways.
•Potential target for antimalerial drugs:
DOXP reductoisomerase
17. (Boguski, 1999)
The challenge
In 1995, the number of genes in the database started to exceed
the number of papers on molecular biology and genetics in the
literature!
18. Data types
primary data
secondary data
tertiary data
sequence
DNA
amino acid
AATGCGTATAGGC
DMPVERILEALAVE
primary database
secondary
“motifs”: regular protein structure
expressions, blocks,
profiles, fingerprints e. g., alpha-helices, beta-strands
secondary db
tertiary protein
structure
domains, folding units
tertiary db
atomic co-ordinates
19. Primary biological databases
• Nucleic acid
EMBL
GenBank
DDBJ (DNA Data
Bank of Japan)
• Protein
PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D
20. International nucleotide data banks
EMBL
Europe
EMBL
EBI
GenBank
USA
NLM
NCBI
International
Advisory Meeting
Collaborative Meeting
DDBJ
Japan
NIG
CIB
TrEMBL NRDB
28. Other primary protein databases
• TrEMBL (translated EMBL) in SWISS-PROT format
rapid access to sequence data from genome projects
computer-annotated supplement to SWISS-PROT
translations of all coding sequences (CDS) in EMBL
• SP-TrEMBL
• REM-TrEMBL: immunoglobulins, T-cell receptors, short
fragments, synthetic and patented sequences
29. Other primary protein databases
The Protein Information Resource (PIR)
• integrated system of protein sequence databases
and derived related databases, e. g., alignment
databases
• rapid searching, comparison, and pattern matching of
protein sequences
• retrieval of descriptive, bibliographic, feature, and
concurrent cross-reference information
• aims to be comprehensive and consistently
annotated
30. PIR: related databases
NRL-3D Sequence-Structure Database
• produced by PIR from sequence and annotation
information extracted from three-dimensional
structures in the Protein Databank (PDB)
• allows keyword and similarity searches
31. PIR: related databases
PATCHX integrated with PIR
• a non-redundant database of protein sequences
produced by MIPS, the European branch of PIR-International
The PIR Protein Sequence Database and PATCHX
together provide the most complete collection of
protein sequence data currently available in the
public domain.
32. Composite protein sequence dbs
NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL
PIR PIR PIR TrEMBL
SP SP SP SP
PDB GenBank MIPSOwn
GenPept NRL-3D NRL-3D
MIPSH
PIRMOD
MIPSTrn
EMTrans
GBTrans
Kabat
PseqIP
33. OWL composite database
OWL only released every 6-8
weeks
By accession number
• By database code
• By text
• By sequence
• By title
• By author
• By query language
• By regular expression
Direct OWL access:
OWL Blast server
34. Two other useful sites
INFOBIOGEN-The Public Catalog of Databases
http://www.infobiogen.fr/services/dbcat/
KEGG-Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/
Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to
computerize current knowledge of molecular and cellular biology in
terms of the information pathways that consist of interacting molecules
or genes and to provide links from the gene catalogs produced by
genome sequencing projects.
35. Sequence Retrieval System (SRS)
Database browser that allows
users to
•retrieve
•link
•access
entries from all interconnected
resources.
Users can formulate queries
across a range of different
database types.
36. Guide to Protein Databases:
http://www.biochem.ucl.ac.uk/~robert/bioinf
/lecture1/index.html
http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index
.html
With thanks to Dr Roman Laskowski.