13. Usage of the databases
Annotation searches - Search for keywords, authors, features
14. Usage of the databases
Annotation searches - Search for keywords, authors, features
What is the protein sequence for human insulin?
How does the 3D structure of calmodulin look like?
What is the genetic location of the cystic fibrosis gene?
List all intron sequences in rat.
15. Usage of the databases
Annotation searches - Search for keywords, authors, features
16. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
17. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Is there any known protein sequence that is similar to x?
Is this gene known in any other species?
Has someone already cloned this sequence?
18. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
19. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Pattern searches - Search for occurrences of patterns
20. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Pattern searches - Search for occurrences of patterns
Do my protein sequence contain any known motif
(that can give me a clue about the function)?
Which known sequences contain this motif?
Is any part of my nucleotide sequence recognized
by a transcriptional factor?
List all known start, splice and stop signals in my
genomic sequence.
21. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Pattern searches - Search for occurrences of patterns
22. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Pattern searches - Search for occurrences of patterns
Predictions - Using the databases as knowledge databases
23. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Pattern searches - Search for occurrences of patterns
Predictions - Using the databases as knowledge databases
What may the structure of my protein be?
Secondary structure prediction.
Modelling by homology.
What is the gene structure of my genomic sequence?
Which parts of my protein have a high antigenicity?
24. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Pattern searches - Search for occurrences of patterns
Predictions - Using the databases as knowledge databases
25. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Pattern searches - Search for occurrences of patterns
Predictions - Using the databases as knowledge databases
Comparisons
26. Usage of the databases
Annotation searches - Search for keywords, authors, features
Homology (similarity) searches - Search for similar sequences
Pattern searches - Search for occurrences of patterns
Predictions - Using the databases as knowledge databases
Comparisons
Gene families
Phylogenetic trees
27. Les 1
• Bioinformatics I Revisited in 5 slides
• Why bother making databases ?
• DataBases
– FF
• *.txt
• Indexed version
– Relational (RDBMS)
• Access, MySQL, PostGRES, Oracle
– OO (OODBMS)
• AceDB, ObjectStore
– Hierarchical
• XML
– Frame based system
• Eg. DAML+OIL
– Hybrid systems
28. GenBank Format
LOCUS LISOD 756 bp DNA BCT 30-JUN-1993
DEFINITION L.ivanovii sod gene for superoxide dismutase.
ACCESSION X64011.1 GI:37619753
NID g44010
KEYWORDS sod gene; superoxide dismutase.
SOURCE Listeria ivanovii.
ORGANISM Listeria ivanovii
Eubacteria; Firmicutes; Low G+C gram-positive bacteria;
Bacillaceae; Listeria.
REFERENCE 1 (bases 1 to 756)
AUTHORS Haas,A. and Goebel,W.
TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii
by functional complementation in Escherichia coli and
characterization of the gene product
JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992)
MEDLINE 92140371
REFERENCE 2 (bases 1 to 756)
AUTHORS Kreft,J.
TITLE Direct Submission
JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie,
Universitaet Wuerzburg, Biozentrum Am Hubland, 8700
Wuerzburg, FRG
30. Example of location descriptors
Location Description
476 Points to a single base in the presented sequence
340..565 Points to a continuous range of bases bounded by and
including the starting and ending bases
<345..500 The exact lower boundary point of a feature is unknown.
(102.110) Indicates that the exact location is unknown but that it
is one of the bases between bases 102 and 110.
(23.45)..600 Specifies that the starting point is one of the bases
between bases 23 and 45, inclusive, and the end base 600
123^124 Points to a site between bases 123 and 124
145^177 Points to a site anywhere between bases 145 and 177
J00193:hladr Points to a feature whose location is described in
another entry: the feature labeled 'hladr' in the
entry (in this database) with primary accession 'J00193'
32. EMBL format
ID LISOD standard; DNA; PRO; 756 BP. IDentification
XX
AC X64011; S78972; Accession (Axxxxx, Afxxxxxx), GUID
XX
NI g44010 Nucleotide Identifier --> x.x
XX
DT 28-APR-1992 (Rel. 31, Created) DaTe
DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)
XX
DE L.ivanovii sod gene for superoxide dismutase DEscription
XX.
KW sod gene; superoxide dismutase. KeyWord
XX
OS Listeria ivanovii Organism Species
OC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae;
OC Listeria. Organism Classification
XX
RN [1]
RA Haas A., Goebel W.; Reference
RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by
RT functional complementation in Escherichia coli and
RT characterization of the gene product.";
RL Mol. Gen. Genet. 231:313-322(1992).
XX
33. Example of a SwissProt entry
ID TNFA_HUMAN STANDARD; PRT; 233 AA. IDentification
AC P01375; ACcession
DT 21-JUL-1986 (REL. 01, CREATED) DaTe
DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE)
DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN).
GN TNFA. Gene name
OS HOMO SAPIENS (HUMAN). Organism Species
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC EUTHERIA; PRIMATES. Organism Classification
RN [1] Reference
RP SEQUENCE FROM N.A.
RX MEDLINE; 87217060.
RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A.,
RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N.,
RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A.,
RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.;
RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986).
RN [2]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85086244.
RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R.,
RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.;
RL NATURE 312:724-729(1984).
...
34. CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN
CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED
CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING
CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT
CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION
CC UNDER CERTAIN CONDITIONS. Comments
CC -!- SUBUNIT: HOMOTRIMER.
CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS
CC AN EXTRACELLULAR SOLUBLE FORM.
CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY
CC PROTEOLYTIC PROCESSING.
CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING
CC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL
CC HEALTH AND MALNUTRITION.
CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY.
DR EMBL; X02910; G37210; -. Database Cross-references
DR EMBL; M16441; G339741; -.
DR EMBL; X01394; G37220; -.
DR EMBL; M10988; G339738; -.
DR EMBL; M26331; G339764; -.
DR EMBL; Z15026; G37212; -.
DR PIR; B23784; QWHUN.
DR PIR; A44189; A44189.
DR PDB; 1TNF; 15-JAN-91.
DR PDB; 2TUN; 31-JAN-94.
35. KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR;
KW MYRISTYLATION; 3D-STRUCTURE. KeyWord
FT PROPEP 1 76 Feature Table
FT CHAIN 77 233 TUMOR NECROSIS FACTOR.
FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN).
FT LIPID 19 19 MYRISTATE.
FT LIPID 20 20 MYRISTATE.
FT DISULFID 145 177
FT MUTAGEN 105 105 L->S: LOW ACTIVITY.
FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE.
FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE.
FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE.
FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE.
FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE.
FT CONFLICT 63 63 F -> S (IN REF. 5).
FT STRAND 89 93
FT TURN 99 100
FT TURN 109 110
FT STRAND 112 113
FT TURN 115 116
FT STRAND 118 119
FT STRAND 124 125
36. FT STRAND 130 143
FT STRAND 152 159
FT STRAND 166 170
FT STRAND 173 174
FT TURN 183 184
FT STRAND 189 202
FT TURN 204 205
FT STRAND 207 212
FT HELIX 215 217
FT STRAND 218 218
FT STRAND 227 232
SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32;
MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR
EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR
DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE
TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL
//
37. Structure databases
Protein Data Bank (PDB)
Protein Data Bank - http://www.rcsb.org/pdb
Diffraction 7373 structures determined by X-ray diffraction
NMR 388 structures determined by NMR spectroscopy
Theoretical Model 201 structures proposed by modeling
43. Les 1
• Bioinformatics I Revisited in 5 slides
• Why bother making databases ?
• DataBases
– FF
• *.txt
• Indexed version
– Relational (RDBMS)
• Access, MySQL, PostGRES, Oracle
– OO (OODBMS)
• AceDB, ObjectStore
– Hierarchical
• XML
– Frame based system
• Eg. DAML+OIL
– Hybrid systems
44. Problems with Flat files …
• Wasted storage space
• Wasted processing time
• Data control problems
• Problems caused by changes to data
structures
• Access to data difficult
• Data out of date
• Constraints are system based
• Limited querying eg. all single exon
GPCRs (<1000 bp)
45. • What is a relational database ?
– Sets of tables and links (the data)
– A language to query the datanase (Structured
Query Language)
– A program to manage the data (RDBMS)
• Flat files are not relational
– Data type (attribute) is part of the data
– Record order mateters
– Multiline records
– Massive duplication
• Bv Organism: Homo sapeinsm Eukaryota, …
– Some records are hierarchical
• Xrefs
– Records contain multiple “sub-records”
– Implecit “Key”
48. Introduction to Database Systems
• Historic Background
– Hierarchical databases (IMS) - IBM 1968
• Hierarchical structures between file records
– Network databases - CODASYL Group 1969
• Network structures of record types
• Linked chains between 'Owner' and 'Member' records
• Included in Cobol, procedural language - Manual
navigation
– Relational Data Model - E. F. Codd 1970
• Mathematical foundation of databases
• New non-procedural language SQL - Automatic
navigation
– Object-relational databases
– Object-oriented databases
49. Relational
• The Relational model is not only very mature, but it
has developed a strong knowledge on how to make a
relational back-end fast and reliable, and how to
exploit different technologies such as massive SMP,
Optical jukeboxes, clustering and etc. Object
databases are nowhere near to this, and I do not
expect then to get there in the short or medium term.
• Relational Databases have a very well-known and
proven underlying mathematical theory, a simple one
(the set theory) that makes possible
– automatic cost-based query optimization,
– schema generation from high-level models and
– many other features that are now vital for mission-critical
Information Systems development and operations.
50. The Benefits of Databases
• Redundancy can be reduced
• Inconsistency can be avoided
• Conflicting requirements can be
balanced
• Standards can be enforced
• Data can be shared
• Data independence
• Integrity can be maintained
• Security restrictions can be applied
52. Relational Database Terminology
• Each row of data in a table is uniquely identified by a primary key (PK)
• Information in multiple tables can be logically related by foreign keys (FK)
ID LAST_NAME FIRST_NAME
10 Havel Marta
11 Magee Colin
12 Giljum Henry
14 Nguyen Mai
ID NAME PHONE EMP_ID
201 Unisports 55-2066101 12
202 Simms Atheletics 81-20101 14
203 Delhi Sports 91-10351 14
204 Womansport 1-206-104-0103 11
Table Name: CUSTOMER Table Name: EMP
Primary Key Foreign Key Primary Key
53. Relational Database Terminology
Relational operators
• Relational
– select
rel WHERE boolean-xpr
– project
rel [ attr-specs ]
– join
rel JOIN rel
– divide by
rel DIVIDEBY rel
• Set-based
rel UNION rel
rel INTERSECT rel
rel MINUS rel
rel TIMES rel
55. • RDBM products
– Free
• MySQL, very fast, widely usedm easy to
jump into but limited non standard SQL
• PostrgreSQL – full SQLm limited OO,
higher learning curve than MySQL
– Commercial
• MS Access – Great query builder, GUI
interfaces
• MS SQL Server – full SQL, NT only
• Oracle, everything, including the kitchen
sink
• IBM DB2, Sybase
56. Example 3-tier model in biological database
http://www.bioinformatics.be
Example of different interface to the same back-end database (MySQL)
60. Conclusions
• A database is a central component of any
contemporary information system
• The operations on the database and the mainenance
of database consistency is handled by a DBMS
• There exist stand alone query languages or
embedded languages but both deal with definition
(DDL) and manipulation (DML) aspects
• The structural properties, constraints and operations
permitted within a DBMS are defined by a data
model - hierarchical, network, relational
• Recovery and concurrency control are essential
• Linking of heterogebous datasources is central theme
in modern bioinformatics
61. What is to come ?
Basic outline
• Setup RDMBS
• OLTP Access through CLI, dedicated
client, PHP, Perl/Python
• OLAP Access through Perl/Python, R ..
Integration
• Cytoscape
Semantic Web
• noSQL/Hadoop
• SPARQL