3. Lab for Bioinformatics and
computational genomics
10 “genome hackers”
mostly engineers (statistics)
42 scientists
technicians, geneticists, clinicians
>100 people
hardware engineers,
mathematicians, molecular biologists
4.
5. What is Bioinformatics ?
• Application of information technology to
the storage, management and analysis of
biological information (Facilitated by the
use of computers)
– Sequence analysis?
– Molecular modeling (HTX) ?
– Phylogeny/evolution?
– Ecology and population studies?
– Medical informatics?
– Image Analysis ?
– Statistics ? AI ?
– Sterkstroom of zwakstroom ?
6. Promises of genomics and bioinformatics
• Medicine (Pharma)
– Genome analysis allows the targeting of genetic
diseases
– The effect of a disease or of a therapeutic on
RNA and protein levels can be elucidated
– Knowledge of protein structure facilitates drug
design
– Understanding of genomic variation allows the
tailoring of medical treatment to the individual’s
genetic make-up
• The same techniques can be applied to crop (Agro)
and livestock improvement (Animal Health)
16. PCR + dye termination
Suddenly, a flash of insight caused him to pull the
car off the road and stop. He awakened his
friend dozing in the passenger seat and
excitedly explained to her that he had hit upon
a solution - not to his original problem, but to
one of even greater significance. Kary Mullis
had just conceived of a simple method for
producing virtually unlimited copies of a
specific DNA sequence in a test tube - the
polymerase chain reaction (PCR)
23. Read Length is Not As Important For Resequencing
100%
% of Paired K-mers with Uniquely
90%
80%
Assignable Location
70%
60%
E.COLI
50%
HUMAN
40%
30%
20%
10%
0%
8 10 12 14 16 18 20
Length of K-mer Reads (bp)
Jay Shendure
26. Paired End Reads are Important!
Known Distance
Read 1 Read 2
Repetitive DNA
Unique DNA
Paired read maps uniquely
Single read maps to
multiple positions
27. Adapted from: Barak Cohen, Washington University, Bio5488 http://tinyurl.com/6zttuq http://tinyurl.com/6k26nh
Single Molecule Sequencing
Microscope slide
* * *
Single DNA
molecule
Super-cooled
primer TIRF microscope
dNTP-Cy3 *
Helicos Biosciences Corp.
33. Genome Size
E. coli = 4.2 x 106
Yeast = 18 x 106
Arabidopsis = 80 x 106
C.elegans = 100 x 106
Drosophila = 180 x 106
Human/Rat/Mouse = 3000 x 106
Lily = 300 000 x 106
With ... : 99.9 %
To primates: 99%
DOGS: Database Of Genome Sizes
37. Definitions
Identity
The extent to which two (nucleotide or amino acid)
sequences are invariant.
Homology
Similarity attributed to descent from a common ancestor.
RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84
+ K ++ + + GTW++MA+ L + A V T + +L+ W+
glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
38. Definitions
Orthologous
Homologous sequences in different species
that arose from a common ancestral gene
during speciation; may or may not be responsible
for a similar function.
Paralogous
Homologous sequences within a single species
that arose by gene duplication.
40. Overview
• Simple identity, which scores only identical amino
acids as a match.
• Genetic code changes, which scores the
minimum number of nucieotide changes to change
a codon for one amino acid into a codon for the
other.
• Chemical similarity of amino acid side chains,
which scores as a match two amino acids which
have a similar side chain, such as hydrophobic,
charged and polar amino acid groups.
• The Dayhoff percent accepted mutation (PAM)
family of matrices, which scores amino acid pairs
on the basis of the expected frequency of
substitution of one amino acid for the other during
protein evolution.
• The blocks substitution matrix (BLOSUM) amino
acid substitution tables, which scores amino acid
pairs based on the frequency of amino acid
substitutions in aligned sequence motifs called
blocks which are found in protein families
41. BLOSUM (BLOck – SUM) scoring
Block = ungapped alignent
Eg. Amino Acids D N V A
S = 3 sequences
W = 6 aa
N= (W*S*(S-1))/2 = 18 pairs
a b c d e f
1 DDNAAV
2 DNAVDD
3 NNVAVV
42. A. Observed pairs
a b c d e f
1 DDNAAV
2 DNAVDD f fij D N A V
3 NNVAVV D
N
1
4 1
A 1 1 1
V 3 1 4 1
Relative frequency table gij D N A V
D .056
Probability of obtaining a pair /18 N .222 .056
if randomly choosing pairs
A .056 .056 .056
from block
V .167 .056 .222 .056
43. B. Expected pairs A
Pi
DDDDD 5/18
DDNAAV
NNNN 4/18
DNAVDD
AAAA 4/18
NNVAVV
VVVVV 5/18
P{Draw DN pair}= P{Draw D, then N or Draw M, then D}
P{Draw DN pair}= PDPN + PNPD = 2 * (5/18)*(4/18) = .123
Random rel. frequency table eij D N A V
D .077
Probability of obtaining a pair of N .123 .049
each amino acid drawn A .154 .123 .049
independently from block V .123 .099 .123 .049
44. C. Summary (A/B)
sij = log2 gij/eij
(sij) is basic BLOSUM score matrix
Notes:
• Observed pairs in blocks contain information about
relationships at all levels of evolutionary distance
simultaneously (Cf: Dayhoffs’s close relationships)
• Actual algorithm generates observed + expected pair
distributions by accumalution over a set of approx. 2000
ungapped blocks of varrying with (w) + depth (s)
45. The BLOSUM Series
• blosum30,35,40,45,50,55,60,62,65,70,75,80,85,90
• transition frequencies observed directly by identifying
blocks that are at least
– 45% identical (BLOSUM-45)
– 50% identical (BLOSUM-50)
– 62% identical (BLOSUM-62) etc.
• No extrapolation made
• High blosum - closely related sequences
• Low blosum - distant sequences
• blosum45 pam250
• blosum62 pam160
• blosum62 is the most popular matrix
47. • Church of the Flying Spaghetti Monster
• http://www.venganza.org/about/open-letter
48. Overview
– Henikoff and Henikoff have compared the
BLOSUM matrices to PAM by evaluating how
effectively the matrices can detect known members
of a protein family from a database when searching
with the ungapped local alignment program
BLAST. They conclude that overall the BLOSUM
62 matrix is the most effective.
• However, all the substitution matrices investigated
perform better than BLOSUM 62 for a proportion of
the families. This suggests that no single matrix is
the complete answer for all sequence comparisons.
• It is probably best to compliment the BLOSUM 62
matrix with comparisons using 250 PAMS, and
Overington structurally derived matrices.
– It seems likely that as more protein three
dimensional structures are determined, substitution
tables derived from structure comparison will give
the most reliable data.
49. Rat versus Rat versus
mouse RBP bacterial
lipocalin
50. Alignments
• Exhaustive …
– All combinations:
• Algorithm
– Dynamic programming (much faster)
• Heuristics
– Needleman – Wunsh for global
alignments
(Journal of Molecular Biology, 1970)
– Later adapated by Smith-Waterman
for local alignment
51. A metric …
GACGGATTAG, GATCGGAATAG
GA-CGGATTAG
GATCGGAATAG
+1 (a match), -1 (a mismatch),-2 (gap)
9*1 + 1*(-1)+1*(-2) = 6
52. Needleman-Wunsch-edu.pl
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
53. Needleman-Wunsch-edu.pl
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
54. Needleman-Wunsch-edu.pl
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 a 0 -1 -2 -3 -4 -5
2 K -2 0c 2b 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
A: 0 0 0 -1 0 -1
5 F -5 -3 -1(substr(seq1,j-1,1) eq substr(seq2,i-1,1)
if -1 -1 1 0 -1
6 C -6 -4 up_score = matrix(i-1,j) + GAP 2
B: -2 -2 -2 0 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 left_score =-4
C: -4 matrix(i,j-1) +-2
-4 GAP 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
55. Needleman-Wunsch-edu.pl
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
58. • Practicum: use similarity function in
initialization step -> scoring tables
• Time Complexity
• Use random proteins to generate
histogram of scores from aligned
random sequences
59. Time complexity with needleman-wunsch.pl
Sequence Length (aa) Execution Time (s)
10 0
25 0
50 0
100 1
500 5
1000 19
2500 559
5000 Memory could not be
written
61. If the sequences are similar, the path
of the best alignment should be very
close to the main diagonal.
Therefore, we may not need to fill the
entire matrix, rather, we fill a narrow
band of entries around the main
diagonal.
An algorithm that fills in a band of
width 2k+1 around the main
diagonal.
64. Examples
Phylogenetic methods may be used to
solve crimes, test purity of products, and
determine whether endangered species
have been smuggled or mislabeled:
– Vogel, G. 1998. HIV strain analysis debuts in
murder trial. Science 282(5390): 851-853.
– Lau, D. T.-W., et al. 2001. Authentication of
medicinal Dendrobium species by the internal
transcribed spacer of ribosomal DNA. Planta
Med 67:456-460.
65.
66. Examples
– Epidemiologists use phylogenetic methods to
understand the development of
pandemics, patterns of disease transmission, and
development of antimicrobial resistance or
pathogenicity:
• Basler, C.F., et al. 2001. Sequence of the 1918
pandemic influenza virus nonstructural gene (NS)
segment and characterization of recombinant viruses
bearing the 1918 NS genes. PNAS, 98(5):2746-2751.
• Ou, C.-Y., et al. 1992. Molecular epidemiology of HIV
transmission in a dental practice. Science
256(5060):1165-1171.
• Bacillus Antracis:
74. Modeling
• Finding a structural homologue
• Blast
–versus PDB database or PSI-
blast (E<0.005)
–Domain coverage at least 60%
• Avoid Gaps
–Choose for few gaps and
reasonable similarity scores
instead of lots of gaps and high
similarity scores
79. Overview
Personalized Medicine,
Biomarkers …
… Molecular Profiling
First Generation Molecular Profiling
Next Generation Molecular Profiling
Next Generation Epigenetic Profiling
Concluding Remarks
80. Overview
Personalized Medicine,
Biomarkers …
… Molecular Profiling
First Generation Molecular Profiling
Next Generation Molecular Profiling
Next Generation Epigenetic Profiling
Concluding Remarks
81.
82.
83.
84.
85.
86. Personalized Medicine
• The use of diagnostic tests (aka biomarkers) to identify in advance
which patients are likely to respond well to a therapy
• The benefits of this approach are to
– avoid adverse drug reactions
– improve efficacy
– adjust the dose to suit the patient
– differentiate a product in a competitive market
– meet future legal or regulatory requirements
• Potential uses of biomarkers
– Risk assessment
– Initial/early detection
– Prognosis
– Prediction/therapy selection
– Response assessment
– Monitoring for recurrence
87. Biomarker
First used in 1971 … An objective and
« predictive » measure … at the molecular
level … of normal and pathogenic processes
and responses to therapeutic interventions
Characteristic that is objectively measured and
evaluated as an indicator of normal biologic
or pathogenic processes or pharmacologic
response to a drug
A biomarker is valid if:
– It can be measured in a test system with well
established performance characteristics
– Evidence for its clinical significance has been
established
88. Rationale 1:
Why now ? Regulatory path becoming more clear
There is more at stake than
efficient drug
development. FDA
« critical path initiative »
Pharmacogenomics
guideline
Biomarkers are the
foundation of « evidence
based medicine » - who
should be treated, how
and with what.
Without Biomarkers
advances in targeted
therapy will be limited and
treatment remain largely
emperical. It is imperative
that Biomarker
development be
accelarated along with
therapeutics
89. Why now ?
First and maturing second generation molecular
profiling methodologies allow to stratify clinical
trial participants to include those most likely to
benefit from the drug candidate—and exclude
those who likely will not—pharmacogenomics-
based
Clinical trials should attain more specific results
with smaller numbers of patients. Smaller
numbers mean fewer costs (factor 2-10)
An additional benefit for trial participants and
internal review boards (IRBs) is that
stratification, given the correct biomarker, may
reduce or eliminate adverse events.
90. Molecular Profiling
The study of specific patterns (fingerprints) of proteins,
DNA, and/or mRNA and how these patterns correlate
with an individual's physical characteristics or
symptoms of disease.
91. Generic Health advice
• Exercise (Hypertrophic Cardiomyopathy)
• Drink your milk (MCM6 Lactose intolarance)
• Eat your green beans (glucose-6-phosphate
dehydrogenase Deficiency)
• & your grains (HLA-DQ2 – Celiac disease)
• & your iron (HFE - Hemochromatosis)
• Get more rest (HLA-DR2 - Narcolepsy)
92. Generic Health advice (UNLESS)
• Exercise (Hypertrophic Cardiomyopathy)
• Drink your milk (MCM6 Lactose intolarance)
• Eat your green beans (glucose-6-phosphate
dehydrogenase Deficiency)
• & your grains (HLA-DQ2 – Celiac disease)
• & your iron (HFE - Hemochromatosis)
• Get more rest (HLA-DR2 - Narcolepsy)
93. Generic Health advice (UNLESS)
• Exercise (Hypertrophic Cardiomyopathy)
• Drink your milk (MCM6 Lactose intolerance)
• Eat your green beans (glucose-6-phosphate
dehydrogenase Deficiency)
• & your grains (HLA-DQ2 – Celiac disease)
• & your iron (HFE - Hemochromatosis)
• Get more rest (HLA-DR2 - Narcolepsy)
94. Generic Health advice (UNLESS)
• Exercise (Hypertrophic Cardiomyopathy)
• Drink your milk (MCM6 Lactose intolerance)
• Eat your green beans (glucose-6-phosphate
dehydrogenase Deficiency)
• & your grains (HLA-DQ2 – Celiac disease)
• & your iron (HFE - Hemochromatosis)
• Get more rest (HLA-DR2 - Narcolepsy)
104. First Generation Molecular Profiling
• Flow cytometry correlates surface markers,
cell size and other parameters
• Circulating tumor cell assays (CTC’s)
quantitate the number of tumor cells in the
peripheral blood.
• Exosomes are 30-90 nm vesicles secreted by
a wide range of mammalian cell types.
• Immunohistochemistry (IHC) measures
protein expression, usually on the cell
surface.
105.
106.
107.
108. First Generation Molecular Profiling
• Gene sequencing for mutation detection
• Microarray for m-RNA message detection
• RT-PCR for gene expression
• FISH analysis for gene copy number
• Comparative Genome Hybridization (CGH) for
gene copy number
109. Basics of the ―old‖ technology
• Clone the DNA.
• Generate a ladder of labeled (colored)
molecules that are different by 1 nucleotide.
• Separate mixture on some matrix.
• Detect fluorochrome by laser.
• Interpret peaks as string of DNA.
• Strings are 500 to 1,000 letters long
• 1 machine generates 57,000 nucleotides/run
• Assemble all strings into a genome.
110.
111. Genetic Variation
Among People
Single nucleotide polymorphisms
(SNPs)
GATTTAGATCGCGATAGAG
GATTTAGATCTCGATAGAG
0.1% difference among
people
113. First Generation Molecular Profiling
• Gene sequencing for mutation detection
• Microarray for m-RNA message detection
• RT-PCR for gene expression
• FISH analysis for gene copy number
• Comparative Genome Hybridization (CGH) for
gene copy number
115. First Generation Molecular Profiling
• Gene sequencing for mutation detection
• Microarray for m-RNA message detection
• RT-PCR for gene expression
• FISH analysis for gene copy number
• Comparative Genome Hybridization (CGH) for
gene copy number
116.
117. Overview
Personalized Medicine,
Biomarkers …
… Molecular Profiling
First Generation Molecular Profiling
Next Generation Molecular Profiling
Next Generation Epigenetic Profiling
Concluding Remarks
118. Second Generation DNA profiling
• Exome Sequencing (aka known as
targeted exome capture) is an
efficient strategy to selectively
sequence the coding regions of the
genome to identify novel genes
associated with rare and common
disorders.
• 160K exons
121. Second Generation RNA profiling
Besides the 6000 protein coding-genes …
140 ribosomal RNA genes
275 transfer RNA gnes
40 small nuclear RNA genes
>100 small nucleolar genes
Function of RNA genes
pRNA in 29 rotary packaging motor (Simpson
et el. Nature 408:745-750,2000)
Cartilage-hair hypoplasmia mapped to an RNA
Contents-Schedule
(Ridanpoa et al. Cell 104:195-203,2001)
The human Prader-Willi ciritical region (Cavaille
et al. PNAS 97:14035-7, 2000)
122. Second Generation RNA profiling
RNA genes can be hard to detects
UGAGGUAGUAGGUUGUAUAGU
C.elegans let-27; 21 nt
(Pasquinelli et al. Nature 408:86-89,2000)
Often small
Sometimes multicopy and redundant
Often not polyadenylated
(not represented in ESTs)
Immune to frameshift and nonsense
mutations
No open reading frame, no codon bias
Often evolving rapidly in primary sequence
125. Mapping Structural Variation in Humans
>1 kb segments
- Thought to be Common
12% of the genome
(Redon et al. 2006)
- Likely involved in phenotype
variation and disease
CNVs
- Until recently most methods for
detection were low resolution
(>50 kb)
128. Overview
Personalized Medicine,
Biomarkers …
… Molecular Profiling
First Generation Molecular Profiling
Next Generation Molecular Profiling
Next Generation Epigenetic Profiling
Concluding Remarks
129. Defining Epigenetics
Genome
DNA Reversible changes in gene
expression/function
Without changes in DNA
Chromatin sequence
Epigenome
Can be inherited from
precursor cells
Gene Expression Allows to integrate intrinsic
with environmental signals
Phenotype
(including diet)
CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
131. Epigenetic Regulation:
Post Translational Modifications to Histones and Base Changes in DNA
Epigenetic modifications of histones and DNA include:
– Histone acetylation and methylation, and DNA methylation
Histone
Methylation
Me Me
Histone
Me
Acetylation
Ac
DNA Methylation
CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
132.
133. MGMT Biology
O6 Methyl-Guanine
Methyl Transferase
Essential DNA Repair Enzyme
Removes alkyl groups from damaged guanine
bases
Healthy individual:
- MGMT is an essential DNA repair enzyme
Loss of MGMT activity makes individuals susceptible
to DNA damage and prone to tumor development
Glioblastoma patient on alkylator chemotherapy:
- Patients with MGMT promoter methylation show
have longer PFS and OS with the use of alkylating
agents as chemotherapy
CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
134. MGMT Promoter
Methylation Predicts
Benefit form DNA-Alkylating Chemotherapy
Post-hoc subgroup analysis of Temozolomide Clinical trial with primary glioblastoma
patients show benefit for patients with MGMT promoter methylation
Median Overall Survival
25
21.7 months
20 plus
temozolomide
15
12.7 months
10 radiotherapy
radiotherapy
5
Adapted from Hegi et al.
NEJM 2005
0 352(10):1036-8.
Non-Methylated Methylated Study with 207 patients
MGMT Gene MGMT Gene
CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
135. Genome-wide methylation
by methylation sensitive restriction enzymes
CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
137. Genome-wide methylation
…. by next generation sequencing
# markers
Discovery
Verification
Validation
# samples CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
138. MBD_Seq
Condensed Chromatin DNA Sheared
Immobilized
Methyl Binding Domain
DNA Sheared
CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
139. MBD_Seq
Immobilized
Methyl binding domain
MgCl2
Next Gen Sequencing
GA Illumina: 100 million reads
CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
140. MBD_Seq
MGMT = dual core
CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
141. Genome-wide methylation
…. by next generation sequencing
# markers
1-2 million
MBD_Seq
methylation
cores
Discovery
# samples CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC
147. CONFIDENTIAL
Methylation I Epigenetics | Oncology | Biomarker
147
I NEXT-GEN | PharmacoDX | CRC
148. Overview
Personalized Medicine,
Biomarkers …
… Molecular Profiling
First Generation Molecular Profiling
Next Generation Molecular Profiling
Next Generation Epigenetic Profiling
Concluding Remarks
149. Translational Medicine: An inconvenient truth
• 1% of genome codes for proteins, however
more than 90% is transcribed
• Less than 10% of protein experimentally
measured can be ―explained‖ from the
genome
• 1 genome ? Structural variation
• > 200 Epigenomes ??
• Space/time continuum …
150. Translational Medicine: An inconvenient truth
• 1% of genome codes for proteins, however
more than 90% is transcribed
• Less than 10% of protein experimentally
measured can be ―explained‖ from the
genome
• 1 genome ? Structural variation
• > 200 Epigenomes …
• ―space/time‖ continuum