2. Introduction 2
Basic gene regulation
• Proteins (transcription
factors, TFs)
recognise binding
sites (sequence
motifs) in gene
regulatory regions
• The transcription
factors stabilise the Michael Lones
transcription complex
• Distal promoters
(enhancers) interact
through DNA looping
Finn Drabløs [tare.medisin.ntnu.no]
3. Motivation 3
De novo prediction of binding sites
• Make a set of co-regulated genes
– E.g. from microarray experiments, normally imperfect sets
• Extract assumed regulatory regions
– Normally a fixed region upstream from TSS of each gene
• Search for overrepresented patterns in these regions
– Use a model for what a motif should look like
• Consensus sequence with mismatches
• Position Weight Matrix (PWM) based on log odds scores for occurrences
– Use a strategy to find (local) optima for this model
• E.g. Gibbs sampling, expectation maximisation …
• Problem: More than 100 different methods
– Which methods are reliable?
Finn Drabløs [tare.medisin.ntnu.no]
4. Motivation 4
Benchmarking of de novo tools
• Tompa et al, Nature Biotech 23, 137-144 (2005)
• Tested 14 different tools for motif discovery
• Used 52 data sets from fly (6), human (26), mouse (12)
and yeast (8)
• Used data sets with real (Transfac) binding sites in
different sequence contexts
– ”real” – The actual promoter sequences
– ”generic” – Randomly chosen promoter sequences from same genome
– ”markov” – Sequences generated by Markov chain of order 3
• Measured performance at nucleotide level
Finn Drabløs [tare.medisin.ntnu.no]
6. Motivation 6
Can we improve performance?
• Use better motif representations
– Hidden Markov Models
• Use better algorithms
– More exhaustive searching TODAY!
– Discriminative motif discovery
• Use better background models
– Real sequences (not Markov models) TODAY!
• Filter out false positives
– Identify “motif-like” solutions
– Identify regulatory regions
– Use co-occurrence of motifs
TODAY!
• Modules, composite motifs
Finn Drabløs [tare.medisin.ntnu.no]
7. Approach 7
Composite motif discovery
• TFs act together as modules
• Modules are not completely unique
Finn Drabløs [tare.medisin.ntnu.no]
8. Algorithm 8
Basic definitions
• Frequent modules
– Modules (and motifs) can be ranked by support
• Fraction of sequences where the module (or motif) is found
– Support is monotonous
• Adding a motif to a module can never increase module support
• Specific modules
– Modules can be ranked by hit probability
• Probability that a sequence supports the module
– Hit probability is monotonous (as for support)
– Specific modules have low hit probability in background sequences
• Significant modules
– Modules can be ranked by significance
• Probability that support in sequence ≠ background
Finn Drabløs [tare.medisin.ntnu.no]
9. Algorithm 9
Search tree
• Discretized single motifs
{1, 2, 3, …} organised as an
implicit search tree
• Support set H and hit
probability P is iteratively
computed (monotonicity)
– Initially H is full sequence set and
P is 1)
• Search tree is efficiently
pruned (indicated with X)
based on H and P
• Final output can be ranked
by module significance
Finn Drabløs [tare.medisin.ntnu.no]
10. Implementation 10
Module significance
• Position-level probability in background
– Probability of single motif at specific location
– Estimated from real DNA background sequences
• Sequence-level probability in background
– Probability of single motif at least once in given background sequence
– Estimated as union of position-level probabilities
• Hit-probability in background
– Probability of composite motif at least once in background sequence
– Estimated as product of individual motif components
• Significance p-value of observed support
– Probability of seeing at least observed support in background set
– Estimated as right tail of binomial distribution
p • At least k out of n successes given hit-probability
Finn Drabløs [tare.medisin.ntnu.no]
11. Implementation 11
Problem specification
• Frequent and specific modules
– Use thresholds on support and
specificity
– Complete solutions but multi-
objective optimization
• Top-ranking modules
– Combine objectives into single
measure, e.g. p-value
• Pareto-optimal modules
– Each objective is a separate
dimension of optimality
http://en.wikipedia.org/wiki/Pareto_efficiency
– Return Pareto front of composite
motifs
Finn Drabløs [tare.medisin.ntnu.no]
12. Implementation 12
Motif prediction flowchart
Finn Drabløs [tare.medisin.ntnu.no]
13. Benchmarking 13
Benchmark data set
• Known composite motifs from the TransCompel database
• Tests performance by adding “noise matrices” to input
– Matrices for TFs assumed not to bind in sequence set
• Will have random (false positive) hits
– Selected at random from Transfac
• Max noise level includes all Transfac matrices
– Similar to actual usage
• Searching for motifs consisting of unknown TFs
Finn Drabløs [tare.medisin.ntnu.no]
14. Benchmarking 14
General performance (nCC)
• Compo compared to several other tools
– TransCompel benchmark set
• Compo has clearly best performance, in particular at
realistic settings (high noise level)
Finn Drabløs [tare.medisin.ntnu.no]
15. Benchmarking 15
Background and support
• Compo gains performance from realistic background (real
DNA) and support
– Random DNA based on multinomial sequence model
• Performance without real DNA background or support
comparable to other tools
Finn Drabløs [tare.medisin.ntnu.no]
16. Future development 16
Pareto front
• Pareto front on support,
max motif distance and
significance (colour)
• Compo prediction not
optimal
– Compo predicted Ets and
GATA
– Annotated motif is AP1 and
NFAT
• Explore alternative
solutions
• Explore parameter X – NFAT
interactions O – AP1
Finn Drabløs [tare.medisin.ntnu.no]
17. Acknowledgements 17
The research group
BiGR Programmers / Technicians
Johansen, Jostein
Drabløs, Finn Thomas, Laurent
Olsen, Lene C.
Postdocs / Researchers
Sætrom, Pål Others
Kusnierczyk, Wacek Solbakken, Trude
Rye, Morten
Klein, Jörn Master students
Anderssen, Endre Bolstad, Kjersti
Wang, Xinhui (ERCIM) Muiser, Iwe
Capatana, Ana (ERCIM, starting 2009) Sponberg, Bjørn
Brands, Stef
PhDs Skaland, Even
Bratlie, Marit Skyrud
Klepper, Kjetil Former members
Saito, Takaya Sandve, Geir Kjetil
Lundbæk, Marie Abul, Osman
Håndstad, Tony Schwalie, Petra
Lones, Michael
Finn Drabløs [tare.medisin.ntnu.no]