Bioinformatics life sciences_2012

Inleiding tot de bio-informatica en
computationele biologie

Lab for Bioinformatics and
computational genomics
10 “genome hackers”
mostly engineers (statistics)

42 scientists
technicians, geneticists, clinicians

>100 people
hardware engineers,
mathematicians, molecular biologists

What is Bioinformatics ?

• Application of information technology to
the storage, management and analysis of
biological information (Facilitated by the
use of computers)
– Sequence analysis?
– Molecular modeling (HTX) ?
– Phylogeny/evolution?
– Ecology and population studies?
– Medical informatics?
– Image Analysis ?
– Statistics ? AI ?
– Sterkstroom of zwakstroom ?

Promises of genomics and bioinformatics

• Medicine (Pharma)
– Genome analysis allows the targeting of genetic
diseases
– The effect of a disease or of a therapeutic on
RNA and protein levels can be elucidated
– Knowledge of protein structure facilitates drug
design
– Understanding of genomic variation allows the
tailoring of medical treatment to the individual’s
genetic make-up
• The same techniques can be applied to crop (Agro)
and livestock improvement (Animal Health)

Bioinformatics, a life science discipline …

Math

(Molecular)
Informatics
Biology


Math

Computer Science Theoretical Biology

(Molecular)
Informatics
Biology
Computational Biology


Math


Bioinformatics

(Molecular)
Informatics
Biology

Bioinformatics, a life science discipline … management of expectations

Math

NP AI, Image Analysis
Datamining structure prediction (HTX)
Bioinformatics

Interface Design Expert Annotation
Sequence Analysis (Molecular)
Informatics
Biology

Bioinformatics, a life science discipline … management of expectations

Math

NP AI, Image Analysis
Datamining structure prediction (HTX)
Bioinformatics
Discovery Informatics – Computational Genomics
Interface Design Expert Annotation
Sequence Analysis (Molecular)
Informatics
Biology

• Timelin: Magaret
Dayhoff …

PCR + dye termination

Suddenly, a flash of insight caused him to pull the
car off the road and stop. He awakened his
friend dozing in the passenger seat and
excitedly explained to her that he had hit upon
a solution - not to his original problem, but to
one of even greater significance. Kary Mullis
had just conceived of a simple method for
producing virtually unlimited copies of a
specific DNA sequence in a test tube - the
polymerase chain reaction (PCR)

Setting the stage …

nature
the
Human
genome

Biological Research

Adapted from John McPherson, OICR

And this is just the beginning ….

Next Generation Sequencing is
here

Read Length is Not As Important For Resequencing

100%
% of Paired K-mers with Uniquely
90%
80%
Assignable Location

70%
60%
E.COLI
50%
HUMAN
40%
30%
20%
10%
0%
8 10 12 14 16 18 20
Length of K-mer Reads (bp)
Jay Shendure

Paired End Reads are Important!

Known Distance

Read 1 Read 2

Repetitive DNA
Unique DNA

Paired read maps uniquely

Single read maps to
multiple positions

Adapted from: Barak Cohen, Washington University, Bio5488 http://tinyurl.com/6zttuq http://tinyurl.com/6k26nh

Single Molecule Sequencing

Microscope slide
* * *

Single DNA
molecule
Super-cooled
primer TIRF microscope

dNTP-Cy3 *

Helicos Biosciences Corp.

Next next generation sequencing
Third generation sequencing
Now sequencing

Pacific Biosciences: A Third Generation Sequencing Technology

Eid et al 2008

Ultra-low-cost SINGLE molecule sequencing

Genome Size

E. coli = 4.2 x 106
Yeast = 18 x 106
Arabidopsis = 80 x 106
C.elegans = 100 x 106
Drosophila = 180 x 106
Human/Rat/Mouse = 3000 x 106
Lily = 300 000 x 106

With ... : 99.9 %
To primates: 99%

DOGS: Database Of Genome Sizes

Definitions
Identity
The extent to which two (nucleotide or amino acid)
sequences are invariant.

Homology
Similarity attributed to descent from a common ancestor.

RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84
+ K ++ + + GTW++MA+ L + A V T + +L+ W+
glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Definitions

Orthologous
Homologous sequences in different species
that arose from a common ancestral gene
during speciation; may or may not be responsible
for a similar function.

Paralogous
Homologous sequences within a single species
that arose by gene duplication.

Overview
• Simple identity, which scores only identical amino
acids as a match.
• Genetic code changes, which scores the
minimum number of nucieotide changes to change
a codon for one amino acid into a codon for the
other.
• Chemical similarity of amino acid side chains,
which scores as a match two amino acids which
have a similar side chain, such as hydrophobic,
charged and polar amino acid groups.
• The Dayhoff percent accepted mutation (PAM)
family of matrices, which scores amino acid pairs
on the basis of the expected frequency of
substitution of one amino acid for the other during
protein evolution.
• The blocks substitution matrix (BLOSUM) amino
acid substitution tables, which scores amino acid
pairs based on the frequency of amino acid
substitutions in aligned sequence motifs called
blocks which are found in protein families

BLOSUM (BLOck – SUM) scoring

Block = ungapped alignent
Eg. Amino Acids D N V A
S = 3 sequences
W = 6 aa
N= (W*S*(S-1))/2 = 18 pairs

a b c d e f
1 DDNAAV
2 DNAVDD
3 NNVAVV

A. Observed pairs

a b c d e f
1 DDNAAV
2 DNAVDD f fij D N A V

3 NNVAVV D
N
1
4 1
A 1 1 1
V 3 1 4 1

Relative frequency table gij D N A V
D .056
Probability of obtaining a pair /18 N .222 .056
if randomly choosing pairs
A .056 .056 .056
from block
V .167 .056 .222 .056

B. Expected pairs A
Pi
DDDDD 5/18
DDNAAV
NNNN 4/18
DNAVDD
AAAA 4/18
NNVAVV
VVVVV 5/18

P{Draw DN pair}= P{Draw D, then N or Draw M, then D}
P{Draw DN pair}= PDPN + PNPD = 2 * (5/18)*(4/18) = .123

Random rel. frequency table eij D N A V
D .077
Probability of obtaining a pair of N .123 .049
each amino acid drawn A .154 .123 .049
independently from block V .123 .099 .123 .049

C. Summary (A/B)

sij = log2 gij/eij

(sij) is basic BLOSUM score matrix

Notes:
• Observed pairs in blocks contain information about
relationships at all levels of evolutionary distance
simultaneously (Cf: Dayhoffs’s close relationships)
• Actual algorithm generates observed + expected pair
distributions by accumalution over a set of approx. 2000
ungapped blocks of varrying with (w) + depth (s)

The BLOSUM Series

• blosum30,35,40,45,50,55,60,62,65,70,75,80,85,90
• transition frequencies observed directly by identifying
blocks that are at least
– 45% identical (BLOSUM-45)
– 50% identical (BLOSUM-50)
– 62% identical (BLOSUM-62) etc.
• No extrapolation made

• High blosum - closely related sequences
• Low blosum - distant sequences

• blosum45  pam250
• blosum62  pam160

• blosum62 is the most popular matrix

• Church of the Flying Spaghetti Monster

• http://www.venganza.org/about/open-letter

Overview
– Henikoff and Henikoff have compared the
BLOSUM matrices to PAM by evaluating how
effectively the matrices can detect known members
of a protein family from a database when searching
with the ungapped local alignment program
BLAST. They conclude that overall the BLOSUM
62 matrix is the most effective.
• However, all the substitution matrices investigated
perform better than BLOSUM 62 for a proportion of
the families. This suggests that no single matrix is
the complete answer for all sequence comparisons.
• It is probably best to compliment the BLOSUM 62
matrix with comparisons using 250 PAMS, and
Overington structurally derived matrices.
– It seems likely that as more protein three
dimensional structures are determined, substitution
tables derived from structure comparison will give
the most reliable data.

Rat versus Rat versus
mouse RBP bacterial
lipocalin

Alignments

• Exhaustive …
– All combinations:
• Algorithm
– Dynamic programming (much faster)
• Heuristics
– Needleman – Wunsh for global
alignments
(Journal of Molecular Biology, 1970)
– Later adapated by Smith-Waterman
for local alignment

A metric …

GACGGATTAG, GATCGGAATAG

GA-CGGATTAG
GATCGGAATAG

+1 (a match), -1 (a mismatch),-2 (gap)

9*1 + 1*(-1)+1*(-2) = 6

Needleman-Wunsch-edu.pl

The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 0 -1 -2 -3 -4 -5
2 K -2 0 2 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1


The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 a 0 -1 -2 -3 -4 -5
2 K -2 0c 2b 1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
A: 0 0 0 -1 0 -1
5 F -5 -3 -1(substr(seq1,j-1,1) eq substr(seq2,i-1,1)
if -1 -1 1 0 -1
6 C -6 -4 up_score = matrix(i-1,j) + GAP 2
B: -2 -2 -2 0 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 left_score =-4
C: -4 matrix(i,j-1) +-2
-4 GAP 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1


Seq1:CKHVFCRVCI
Seq2:CKKCFC-KCV
++--++--+- score = 0

• Practicum: use similarity function in
initialization step -> scoring tables

• Time Complexity

• Use random proteins to generate
histogram of scores from aligned
random sequences

Time complexity with needleman-wunsch.pl

Sequence Length (aa) Execution Time (s)
10 0
25 0
50 0
100 1
500 5
1000 19
2500 559
5000 Memory could not be
written

Average around -64 !
-80
-78
-76
-74
-72 **
-70 *******
-68 ***************
-66 *************************
-64 ************************************************************
-60 ***********************
-58 ***************
-56 ********
-54 ****
-52 *
-50
-48
-46
-44
-42
-40
-38

If the sequences are similar, the path
of the best alignment should be very
close to the main diagonal.

Therefore, we may not need to fill the
entire matrix, rather, we fill a narrow
band of entries around the main
diagonal.

An algorithm that fills in a band of
width 2k+1 around the main
diagonal.

Examples

Phylogenetic methods may be used to
solve crimes, test purity of products, and
determine whether endangered species
have been smuggled or mislabeled:
– Vogel, G. 1998. HIV strain analysis debuts in
murder trial. Science 282(5390): 851-853.
– Lau, D. T.-W., et al. 2001. Authentication of
medicinal Dendrobium species by the internal
transcribed spacer of ribosomal DNA. Planta
Med 67:456-460.

Examples

– Epidemiologists use phylogenetic methods to
understand the development of
pandemics, patterns of disease transmission, and
development of antimicrobial resistance or
pathogenicity:
• Basler, C.F., et al. 2001. Sequence of the 1918
pandemic influenza virus nonstructural gene (NS)
segment and characterization of recombinant viruses
bearing the 1918 NS genes. PNAS, 98(5):2746-2751.
• Ou, C.-Y., et al. 1992. Molecular epidemiology of HIV
transmission in a dental practice. Science
256(5060):1165-1171.
• Bacillus Antracis:

Modeling

• Finding a structural homologue
• Blast
–versus PDB database or PSI-
blast (E<0.005)
–Domain coverage at least 60%
• Avoid Gaps
–Choose for few gaps and
reasonable similarity scores
instead of lots of gaps and high
similarity scores

Bootstrapping - an example

Ciliate SSUrDNA - parsimony bootstrap
Ochromonas (1)

Symbiodinium (2)
100
Prorocentrum (3)

Euplotes (8)
84
Tetrahymena (9)

96 Loxodes (4)
100
Tracheloraphis (5)
100
Spirostomum (6)
100
Gruberia (7)
Majority-rule consensus

Overview

Personalized Medicine,
Biomarkers …
… Molecular Profiling

First Generation Molecular Profiling
Next Generation Molecular Profiling
Next Generation Epigenetic Profiling

Concluding Remarks

Personalized Medicine
• The use of diagnostic tests (aka biomarkers) to identify in advance
which patients are likely to respond well to a therapy
• The benefits of this approach are to
– avoid adverse drug reactions
– improve efficacy
– adjust the dose to suit the patient
– differentiate a product in a competitive market
– meet future legal or regulatory requirements
• Potential uses of biomarkers
– Risk assessment
– Initial/early detection
– Prognosis
– Prediction/therapy selection
– Response assessment
– Monitoring for recurrence

Biomarker

First used in 1971 … An objective and
« predictive » measure … at the molecular
level … of normal and pathogenic processes
and responses to therapeutic interventions
Characteristic that is objectively measured and
evaluated as an indicator of normal biologic
or pathogenic processes or pharmacologic
response to a drug
A biomarker is valid if:
– It can be measured in a test system with well
established performance characteristics
– Evidence for its clinical significance has been
established

Rationale 1:
Why now ? Regulatory path becoming more clear

There is more at stake than
efficient drug
development. FDA
« critical path initiative »
Pharmacogenomics
guideline

Biomarkers are the
foundation of « evidence
based medicine » - who
should be treated, how
and with what.

Without Biomarkers
advances in targeted
therapy will be limited and
treatment remain largely
emperical. It is imperative
that Biomarker
development be
accelarated along with
therapeutics

Why now ?

First and maturing second generation molecular
profiling methodologies allow to stratify clinical
trial participants to include those most likely to
benefit from the drug candidate—and exclude
those who likely will not—pharmacogenomics-
based
Clinical trials should attain more specific results
with smaller numbers of patients. Smaller
numbers mean fewer costs (factor 2-10)
An additional benefit for trial participants and
internal review boards (IRBs) is that
stratification, given the correct biomarker, may
reduce or eliminate adverse events.

Molecular Profiling

The study of specific patterns (fingerprints) of proteins,
DNA, and/or mRNA and how these patterns correlate
with an individual's physical characteristics or
symptoms of disease.

Generic Health advice

• Exercise (Hypertrophic Cardiomyopathy)
• Drink your milk (MCM6 Lactose intolarance)
• Eat your green beans (glucose-6-phosphate
dehydrogenase Deficiency)
• & your grains (HLA-DQ2 – Celiac disease)
• & your iron (HFE - Hemochromatosis)
• Get more rest (HLA-DR2 - Narcolepsy)

Generic Health advice (UNLESS)

• Drink your milk (MCM6 Lactose intolarance)

Generic Health advice (UNLESS)

• Drink your milk (MCM6 Lactose intolerance)

Before molecular profiling …


• Flow cytometry correlates surface markers,
cell size and other parameters
• Circulating tumor cell assays (CTC’s)
quantitate the number of tumor cells in the
peripheral blood.
• Exosomes are 30-90 nm vesicles secreted by
a wide range of mammalian cell types.
• Immunohistochemistry (IHC) measures
protein expression, usually on the cell
surface.


• Gene sequencing for mutation detection

• Microarray for m-RNA message detection
• RT-PCR for gene expression

• FISH analysis for gene copy number
• Comparative Genome Hybridization (CGH) for
gene copy number

Basics of the ―old‖ technology

• Clone the DNA.
• Generate a ladder of labeled (colored)
molecules that are different by 1 nucleotide.
• Separate mixture on some matrix.
• Detect fluorochrome by laser.
• Interpret peaks as string of DNA.
• Strings are 500 to 1,000 letters long
• 1 machine generates 57,000 nucleotides/run
• Assemble all strings into a genome.

Genetic Variation
Among People
Single nucleotide polymorphisms
(SNPs)
GATTTAGATCGCGATAGAG
GATTTAGATCTCGATAGAG

0.1% difference among
people

The genome fits as an e-mail attachment

Second Generation DNA profiling

• Exome Sequencing (aka known as
targeted exome capture) is an
efficient strategy to selectively
sequence the coding regions of the
genome to identify novel genes
associated with rare and common
disorders.
• 160K exons

Second Generation DNA profiling

Second Generation RNA profiling

Besides the 6000 protein coding-genes …

140 ribosomal RNA genes
275 transfer RNA gnes
40 small nuclear RNA genes
>100 small nucleolar genes

Function of RNA genes

pRNA in 29 rotary packaging motor (Simpson
et el. Nature 408:745-750,2000)
Cartilage-hair hypoplasmia mapped to an RNA
Contents-Schedule

(Ridanpoa et al. Cell 104:195-203,2001)
The human Prader-Willi ciritical region (Cavaille
et al. PNAS 97:14035-7, 2000)

Second Generation RNA profiling

RNA genes can be hard to detects

UGAGGUAGUAGGUUGUAUAGU

C.elegans let-27; 21 nt
(Pasquinelli et al. Nature 408:86-89,2000)

Often small
Sometimes multicopy and redundant
Often not polyadenylated
(not represented in ESTs)
Immune to frameshift and nonsense
mutations
No open reading frame, no codon bias
Often evolving rapidly in primary sequence

ncRNAs in human genome

tRNA 600 SRP RNA 1
18S rRNA 200 RNase P RNA 1
5.8S rRNA 200
Telomerase RNA 1
28S rRNA 200
RNase MRP 1
5S rRNA 200
Y RNA 5
snoRNA 300
miRNA 250 Vault 4
U1 40 7SK RNA 1
U2 30 Xist 1
U4 30 H19 1
U5 30 BIC 1
U6 20
U4atac 5
Antisense RNAs 1000s?
U6atac 5
Cis reg regions 100s?
U11 5
U12 5 Others ?

Mapping Structural Variation in Humans
>1 kb segments
- Thought to be Common
12% of the genome
(Redon et al. 2006)
- Likely involved in phenotype
variation and disease
CNVs
- Until recently most methods for
detection were low resolution
(>50 kb)

Size Distribution of CNV in a Human Genome

Defining Epigenetics
Genome

DNA  Reversible changes in gene
expression/function
 Without changes in DNA
Chromatin sequence
Epigenome
 Can be inherited from
precursor cells
Gene Expression  Allows to integrate intrinsic
with environmental signals
Phenotype
(including diet)

CONFIDENTIAL

Methylation I Epigenetics | Oncology | Biomarker
I NEXT-GEN | PharmacoDX | CRC

CONFIDENTIAL


Epigenetic Regulation:
Post Translational Modifications to Histones and Base Changes in DNA

 Epigenetic modifications of histones and DNA include:
– Histone acetylation and methylation, and DNA methylation

Histone
Methylation
Me Me
Histone
Me
Acetylation
Ac

DNA Methylation

CONFIDENTIAL


MGMT Biology
O6 Methyl-Guanine
Methyl Transferase
Essential DNA Repair Enzyme

Removes alkyl groups from damaged guanine
bases

Healthy individual:
- MGMT is an essential DNA repair enzyme
Loss of MGMT activity makes individuals susceptible
to DNA damage and prone to tumor development

Glioblastoma patient on alkylator chemotherapy:
- Patients with MGMT promoter methylation show
have longer PFS and OS with the use of alkylating
agents as chemotherapy

CONFIDENTIAL


MGMT Promoter
Methylation Predicts
Benefit form DNA-Alkylating Chemotherapy
Post-hoc subgroup analysis of Temozolomide Clinical trial with primary glioblastoma
patients show benefit for patients with MGMT promoter methylation

Median Overall Survival
25
21.7 months
20 plus
temozolomide
15
12.7 months

10 radiotherapy

radiotherapy
5
Adapted from Hegi et al.
NEJM 2005
0 352(10):1036-8.
Non-Methylated Methylated Study with 207 patients
MGMT Gene MGMT Gene
CONFIDENTIAL


Genome-wide methylation
by methylation sensitive restriction enzymes

CONFIDENTIAL


by probes

CONFIDENTIAL


…. by next generation sequencing

# markers

Discovery

Verification

Validation

# samples CONFIDENTIAL


MBD_Seq

Condensed Chromatin DNA Sheared

Immobilized
Methyl Binding Domain
DNA Sheared

CONFIDENTIAL


MBD_Seq

Immobilized
Methyl binding domain

MgCl2

Next Gen Sequencing
GA Illumina: 100 million reads

CONFIDENTIAL


MBD_Seq
MGMT = dual core

CONFIDENTIAL


# markers

1-2 million
MBD_Seq
methylation
cores
Discovery



Data integration
Correlation tracks

expression expression

Corr =-1 Corr = 1

methylation methylation

CONFIDENTIAL

142

Correlation track
in GBM @ MGMT

+1

-1
CONFIDENTIAL

143
I NEXT-GEN | PharmacoDX |


# markers

MBD_Seq

Discovery

454_BT_Seq
Verification MSP
Validation


I NEXT-GEN | PharmacoDX |

Deep Sequencing

unmethylated alleles

methylated alleles less methylation

more methylation

CONFIDENTIAL
GCATCGTGACTTACGACTGATCGATGGATGCTA

Deep MGMT
Heterogenic complexity

CONFIDENTIAL


CONFIDENTIAL

147

Translational Medicine: An inconvenient truth

• 1% of genome codes for proteins, however
more than 90% is transcribed
• Less than 10% of protein experimentally
measured can be ―explained‖ from the
genome
• 1 genome ? Structural variation
• > 200 Epigenomes ??

• Space/time continuum …

Translational Medicine: An inconvenient truth

• 1% of genome codes for proteins, however
more than 90% is transcribed
• Less than 10% of protein experimentally
measured can be ―explained‖ from the
genome
• 1 genome ? Structural variation
• > 200 Epigenomes …

• ―space/time‖ continuum

Cellular programming

Epigenetic (meta)information = stem cells

Cellular reprogramming

Tumor

Tumor
Development
and
Growth

Epigenetically
altered, self-
renewing cancer
stem cells

Cellular reprogramming

Gene-specific
Epigenetic
reprogramming

biobix
wvcrieki

biobix.be
bioinformatics.be

156

Bioinformatics life sciences_2012

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (12)

Destaque

Destaque (20)

Semelhante a Bioinformatics life sciences_2012

Semelhante a Bioinformatics life sciences_2012 (20)

Mais de Prof. Wim Van Criekinge

Mais de Prof. Wim Van Criekinge (20)

Último

Último (20)

Bioinformatics life sciences_2012