SlideShare uma empresa Scribd logo
1 de 77
Baixar para ler offline
Randomized Algorithms
and Motif Finding
Outline
1. Randomized QuickSort
2. Randomized Algorithms
3. Greedy Profile Motif Search
4. Gibbs Sampler
5. Random Projections
Outline - CHANGES
• Greedy Profile Motif Search – Animate ―scoring a string with
a profile‖ – Animate ―P=most probable l-mer‖
• Gibbs Sampler – Describe relative entropies
Section 1:
Randomized QuickSort
Randomized Algorithms
• Randomized Algorithm: Makes random rather than
deterministic decisions.
• The main advantage is that no input can reliably produce
worst-case results because the algorithm runs differently each
time.
• These algorithms are commonly used in situations where no
exact and fast algorithm is known.
Introduction to QuickSort
• QuickSort is a simple and efficient approach to sorting.
1. Select an element m from unsorted array c and divide the
array into two subarrays:
csmall = elements smaller than m
clarge = elements larger than m
2. Recursively sort the subarrays and combine them together
in sorted array csorted
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 1: Choose the first element as m
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { } clarge = { }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { 3 } clarge = { }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { 3, 2 } clarge = { }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { 3, 2 } clarge = { 8 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { 3, 2, 4 } clarge = { 8 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { 3, 2, 4, 5 } clarge = { 8 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { 3, 2, 4, 5, 1 } clarge = { 8 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { 3, 2, 4, 5, 1 } clarge = { 8, 7 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
csmall = { 3, 2, 4, 5, 1, 0
}
clarge = { 8, 7 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 2: Split the array into csmall and clarge based on m.
csmall = { 3, 2, 4, 5, 1, 0
}
clarge = { 8, 7, 9 }
c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 3: Recursively do the same thing to csmall and clarge until
each subarray has only one element or is empty.
m = 3 m = 8
{ 2, 1, 0 } { 4, 5 } { 7 } { 9 }
m = 2
{ 1, 0 } { empty} { empty } { 5 }
m = 4
m = 1
{ 0 } { empty }
csmall = { 3, 2, 4, 5, 1, 0
}
clarge = { 8, 7, 9 }
QuickSort: Example
• Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9}
• Step 4: Combine the two arrays with m working back out of
the recursion as we build together the sorted array.
csmall = { 0, 1, 2, 3, 4, 5 } clarge = { 7, 8, 9 }
{ 0, 1, 2 } 3 { 4, 5 }
{ 7 } 8 { 9 }
{ 0, 1 } 2 { empty }
{ empty } 4 { 5 }
{ 0 } 1 { empty }
QuickSort: Example
• Finally, we can assemble csmall and clarge with our original
choice of m, creating the sorted array csorted.
csmall = { 0, 1, 2, 3, 4, 5 } clarge = { 7, 8, 9 }m = 6
csorted = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
The QuickSort Algorithm
1. QuickSort(c)
2. if c consists of a single element
3. return c
4. m  c1
5. Determine the set of elements csmall smaller than m
6. Determine the set of elements clarge larger than m
7. QuickSort(csmall)
8. QuickSort(clarge)
9. Combine csmall, m, and clarge into a single array, csorted
10. return csorted
QuickSort Analysis: Optimistic Outlook
• Runtime is based on our selection of m:
• A good selection will split c evenly so that |csmall | = |clarge |.
• For a sequence of good selections, the recurrence relation is:
• In this case, the solution of the recurrence gives a runtime of
O(n log n).
Time it takes to split the array
into 2 parts
The time it takes to sort two
smaller arrays of size n/2
T n   2T
n
2




 constant  n
QuickSort Analysis: Pessimistic Outlook
• However, a poor selection of m will split c unevenly and in the
worst case, all elements will be greater or less than m so that
one subarray is full and the other is empty.
• For a sequence of poor selection, the recurrence relation is:
• In this case, the solution of the recurrence gives runtime O(n2).
The time it takes to sort one
array containing n-1 elements
Time it takes to split the array into 2
parts where const is a positive
constant

T n   T n  1   constant  n
QuickSort Analysis
• QuickSort seems like an ineffecient MergeSort.
• To improve QuickSort, we need to choose m to be a good
―splitter.‖
• It can be proven that to achieve O(n log n) running time, we
don’t need a perfect split, just a reasonably good one. In fact,
if both subarrays are at least of size n/4, then the running time
will be O(n log n).
• This implies that half of the choices of m make good splitters.
Section 2:
Randomized Algorithms
A Randomized Approach to QuickSort
• To improve QuickSort, randomly select m.
• Since half of the elements will be good splitters, if we choose
m at random we will have a 50% chance that m will be a good
choice.
• This approach will make sure that no matter what input is
received, the expected running time is small.
The RandomizedQuickSort Algorithm
1. RandomizedQuickSort(c)
2. if c consists of a single element
3. return c
4. Choose element m uniformly at random from c
5. Determine the set of elements csmall smaller than m
6. Determine the set of elements clarge larger than m
7. RandomizedQuickSort(csmall)
8. RandomizedQuickSort(clarge)
9. Combine csmall , m, and clarge into a single array, csorted
10.return csorted
*Lines in red indicate the differences between QuickSort and RandomizedQuickSort
RandomizedQuickSort Analysis
• Worst case runtime: O(m2)
• Expected Runtime: O(m log m).
• Expected runtime is a good measure of the performance of
randomized algorithms; it is often more informative than worst
case runtimes.
• RandomizedQuickSort will always return the correct answer,
which offers us a way to classify Randomized Algorithms.
Two Types of Randomized Algorithms
1. Las Vegas Algorithm: Always produces the correct solution
(ie. RandomizedQuickSort)
2. Monte Carlo Algorithm: Does not always return the correct
solution.
• Good Las Vegas Algorithms are always preferred, but they
are often hard to come by.
Section 3:
Greedy Profile Motif
Search
A New Motif Finding Approach
• Motif Finding Problem: Given a list of t sequences each of
length n, find the ―best‖ pattern of length l that appears in each
of the t sequences.
• Previously: We solved the Motif Finding Problem using an
Exhaustive Search or a Greedy technique.
• Now: Randomly select possible locations and find a way to
greedily change those locations until we have converged to the
hidden motif.
Profiles Revisited
• Let s=(s1,...,st) be the set of starting positions for l-mers in our
t sequences.
•
• The substrings corresponding to these starting positions will
form:
• t x l alignment matrix
• 4 x l profile matrix P.
• We make a special note that the profile matrix will be defined
in terms of the frequency of letters, and not as the count of
letters.
• Pr(a | P) is defined as the probability that an l-mer a was
created by the profile P.
• If a is very similar to the consensus string of P then Pr(a | P)
will be high.
• If a is very different, then Pr(a | P) will be low.
• Formula for Pr(a | P):
Scoring Strings with a Profile
Pr a P  Pa i , i
i1
n

Scoring Strings with a Profile
• Given a profile: P =
• The probability of the consensus string:
• Pr(AAACCT | P) = ???
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
• Given a profile: P =
• The probability of the consensus string:
• Pr(AAACCT | P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 =
0.033646
Scoring Strings with a Profile
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
• Given a profile: P =
• The probability of the consensus string:
• Pr(AAACCT | P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 =
0.033646
• The probability of a different string:
• Pr(ATACAG | P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = 0.001602
Scoring Strings with a Profile
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
P-Most Probable l-mer
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
P =
• Define the P-most probable l-mer from a sequence as the l-mer
contained in that sequence which has the highest probability of
being generated by the profile P.
• Example: Given a sequence = CTATAAACCTTACATC, find
the P-most probable l-mer.
• Find Pr(a | P) of every possible 6-mer:
• First Try: C T A T A A A C C T A C A T C
• Second Try: C T A T A A A C C T T A C A T C
• Third Try: C T A T A A A C C T T A C A T C
• Continue this process to evaluate every 6-mer.
P-Most Probable l-mer
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
P-Most Probable l-mer
• Compute Pr(a | P) for every possible 6-mer:
String, Highlighted in Red Calculations prob(a|P)
CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0
CTATAAACCTTACAT 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0
CTATAAACCTTACAT 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0
CTATAAACCTTACAT 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0
CTATAAACCTTACAT 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336
CTATAAACCTTACAT 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299
CTATAAACCTTACAT 1/2 x 0 x 1/2 x 0 1/4 x 0 0
CTATAAACCTTACAT 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0
CTATAAACCTTACAT 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0
CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004
P-Most Probable l-mer
• The P-Most Probable 6-mer in the sequence is thus AAACCT:
String, Highlighted in Red Calculations Prob(a|P)
CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0
CTATAAACCTTACAT 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0
CTATAAACCTTACAT 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0
CTATAAACCTTACAT 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0
CTATAAACCTTACAT 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336
CTATAAACCTTACAT 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299
CTATAAACCTTACAT 1/2 x 0 x 1/2 x 0 1/4 x 0 0
CTATAAACCTTACAT 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0
CTATAAACCTTACAT 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0
CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004
Dealing with Zeroes
• In our toy example Pr(a | P)=0 in many cases.
• In practice, there will be enough sequences so that the number
of elements in the profile with a frequency of zero is small.
• To avoid many entries with Pr(a | P) = 0, there exist
techniques to equate zero to a very small number so that
having one zero in the profile matrix does not make the entire
probability of a string zero (we will not address these
techniques here).
P-Most Probable l-mers in Many Sequences
• Find the P-most probable l-
mer in each of the
sequences.
CTATAAACGTTACATC
ATAGCGATTCGACTG
CAGCCCAGAACCCT
CGGTATACCTTACATC
TGCATTCAATAGCTTA
TATCCTTTCCACTCAC
CTCCAAATCCTTTACA
GGTCATCCTTTATCCT
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
P=
P-Most Probable l-mers in Many Sequences
• The P-Most Probable l-mers form a new profile.
CTATAAACGTTACATC
ATAGCGATTCGACTG
CAGCCCAGAACCCT
CGGTGAACCTTACATC
TGCATTCAATAGCTTA
TGTCCTGTCCACTCAC
CTCCAAATCCTTTACA
GGTCTACCTTTATCCT
1 a a a c g t
2 a t a g c g
3 a a c c c t
4 g a a c c t
5 a t a g c t
6 g a c c t g
7 a t c c t t
8 t a c c t t
A 5/8 5/8 4/8 0 0 0
C 0 0 4/8 6/8 4/8 0
T 1/8 3/8 0 0 3/8 6/8
G 2/8 0 0 2/8 1/8 2/8
Comparing New and Old Profiles
1 a a a c g t
2 a t a g c g
3 a a c c c t
4 g a a c c t
5 a t a g c t
6 g a c c t g
7 a t c c t t
8 t a c c t t
A 5/8 5/8 4/8 0 0 0
C 0 0 4/8 6/8 4/8 0
T 1/8 3/8 0 0 3/8 6/8
G 2/8 0 0 2/8 1/8 2/8
A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8
• Red = frequency increased, Blue – frequency decreased
Greedy Profile Motif Search
• Use P-Most probable l-mers to adjust start positions until we
reach a ―best‖ profile; this is the motif.
1. Select random starting positions.
2. Create a profile P from the substrings at these starting
positions.
3. Find the P-most probable l-mer a in each sequence and
change the starting position to the starting position of a.
4. Compute a new profile based on the new starting positions
after each iteration and proceed until we cannot increase
the score anymore.
GreedyProfileMotifSearch Algorithm
1. GreedyProfileMotifSearch(DNA, t, n, l )
2. Randomly select starting positions s=(s1,…,st) from DNA
3. bestScore  0
4. while Score(s, DNA) > bestScore
5. Form profile P from s
6. bestScore  Score(s, DNA)
7. for i  1 to t
8. Find a P-most probable l-mer a from the ith sequence
9. si  starting position of a
10. return bestScore
GreedyProfileMotifSearch Analysis
• Since we choose starting positions randomly, there is little
chance that our guess will be close to an optimal motif,
meaning it will take a very long time to find the optimal motif.
• It is actually unlikely that the random starting positions will
lead us to the correct solution at all.
• Therefore this is a Monte Carlo algorithm.
• In practice, this algorithm is run many times with the hope that
random starting positions will be close to the optimal solution
simply by chance.
Section 4:
Gibbs Sampler
Gibbs Sampling
• GreedyProfileMotifSearch is probably not the best way to find
motifs.
• However, we can improve the algorithm by introducing Gibbs
Sampling, an iterative procedure that discards one l-mer after
each iteration and replaces it with a new one.
• Gibbs Sampling proceeds more slowly and chooses new l-
mers at random, increasing the odds that it will converge to the
correct solution.
Gibbs Sampling Algorithm
1. Randomly choose starting positions s = (s1,...,st) and form the set
of l-mers associated with these starting positions.
2. Randomly choose one of the t sequences.
3. Create a profile P from the other t – 1 sequences.
4. For each position in the removed sequence, calculate the
probability that the l-mer starting at that position was generated
by P.
5. Choose a new starting position for the removed sequence at
random based on the probabilities calculated in Step 4.
6. Repeat steps 2-5 until there is no improvement.
Gibbs Sampling: Example
• Input: t = 5 sequences, motif length l = 8
1. GTAAACAATATTTATAGC
2. AAAATTTACCTCGCAAGG
3. CCGTACTGTCAAGCGTGG
4. TGAGTAAACGACGTCCCA
5. TACTTAACACCCTGTCAA
Gibbs Sampling: Example
1. Randomly choose starting positions, s =(s1,s2,s3,s4,s5) in the 5
sequences:
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
Gibbs Sampling: Example
2. Choose one of the sequences at random
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
Gibbs Sampling: Example
2. Choose one of the sequences at random: Sequence 2
s1=7 GTAAACAATATTTATAGC
s2=11 AAAATTTACCTTAGAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=4 TGAGTAAACGACGTCCCA
s5=1 TACTTAACACCCTGTCAA
Gibbs Sampling: Example
3. Create profile P from l-mers in remaining 4 sequences:
1 A A T A T T T A
3 T C A A G C G T
4 G T A A A C G A
5 T A C T T A A C
A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4
C 0 1/4 1/4 0 0 2/4 0 1/4
T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4
G 1/4 0 0 0 1/4 0 3/4 0
Consensus
String
T A A A T C G A
Gibbs Sampling: Example
4. Calculate Pr(a | P) for every possible 8-mer in the removed
sequence:
Strings Highlighted in Red Pr(a | P)
AAAATTTACCTTAGAAGG .000732
AAAATTTACCTTAGAAGG .000122
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG .000183
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
AAAATTTACCTTAGAAGG 0
Gibbs Sampling: Example
5. Create a distribution of probabilities of l-mers Pr(a | P), and
randomly select a new starting position based on this
distribution.
• To create this distribution, divide each probability
Pr(a | P) by the lowest probability:
• Ratio = 6 : 1 : 1.5
Starting Position 1: Pr( AAAATTTA | P ) /.000122 = .000732 / .000122 = 6
Starting Position 2: Pr( AAATTTAC | P )/.000122 = .000122 / .000122 = 1
Starting Position 8: Pr( ACCTTAGA | P )/.000122 = .000183 / .000122 = 1.5
• Define probabilities of starting positions according to the
computed ratios.
• Select the start position probabilistically based on these
ratios.
Turning Ratios into Probabilities
Pr(Selecting Starting Position 1): 6/(6+1+1.5) = 0.706
Pr(Selecting Starting Position 2): 1/(6+1+1.5) = 0.118
Pr(Selecting Starting Position 8): 1.5/(6+1+1.5) = 0.176
Gibbs Sampling: Example
• Assume we select the substring with the highest probability—
then we are left with the following new substrings and starting
positions.
s1=7 GTAAACAATATTTATAGC
s2=1 AAAATTTACCTCGCAAGG
s3=9 CCGTACTGTCAAGCGTGG
s4=5 TGAGTAATCGACGTCCCA
s5=1 TACTTCACACCCTGTCAA
Gibbs Sampling: Example
6. We iterate the procedure again with the above starting
positions until we cannot improve the score.
Gibbs Sampling in Practice
• Gibbs sampling needs to be modified when applied to samples
with unequal distributions of nucleotides (relative entropy
approach).
• Gibbs sampling often converges to locally optimal motifs rather
than globally optimal motifs.
• Needs to be run with many randomly chosen seeds to achieve
good results.
Section 5:
Random Projections
Another Randomized Approach
• The Random Projection Algorithm is an alternative way to
solve the Motif Finding Problem.
• Guiding Principle: Some instances of a motif agree on a
subset of positions.
• However, it is unclear how to find these ―non-mutated‖
positions.
• To bypass the effect of mutations within a motif, we randomly
select a subset of positions in the patter,n creating a projection
of the pattern.
• We then search for the projection in a hope that the selected
positions are not affected by mutations in most instances of the
motif.
Projections: Formation
• Choose k positions in a string of length l.
• Concatenate nucleotides at the chosen k positions to form a k-
tuple.
• This can be viewed as a projection of l-dimensional space onto
k-dimensional subspace.
• Projection = (2, 4, 5, 7, 11, 12, 13)
ATGGCATTCAGATTC TGCTGAT
l = 15 k = 7Projection
Random Projections Algorithm
• Select k out of l positions
uniformly at random.
• For each l-tuple in input
sequences, hash into bucket
based on letters at k selected
positions.
• Recover motif from enriched
bucket that contains many l-
tuples.
Bucket TGCT
TGCACCT
Input sequence:
…TCAATGCACCTAT...
Random Projections Algorithm
• Some projections will fail to detect motifs but if we try many
of them the probability that one of the buckets fills in is
increasing.
• In the example below, the bucket **GC*AC is ―bad‖ while the
bucket AT**G*C is ―good‖
ATGCGTC
...ccATCCGACca...
...ttATGAGGCtc...
...ctATAAGTCgc...
...tcATGTGACac... (7,2) motif
Random Projections Algorithm: Example
• l = 7 (motif size), k = 4 (projection size)
• Projection: (1,2,5,7)
GCTC
...TAGACATCCGACTTGCCTTACTAC...
Buckets
ATGC
ATCCGAC
GCCTTAC
Hashing and Buckets
• Hash function h(x) is obtained from k positions of projection.
• Buckets are labeled by values of h(x).
• Enriched Buckets: Contain more than s l-tuples, for some
decided upon parameter s.
ATTCCATCGCTCATGC
Motif Refinement
• How do we recover the motif from the sequences in the
enriched buckets?
• k nucleotides are from hash value of bucket.
• Use information in other l-k positions as starting point for local
refinement scheme, e.g. Gibbs sampler.
Local refinement algorithm
ATGCGAC
Candidate motif
ATGC
ATCCGAC
ATGAGGC
ATAAGTC
ATGCGAC
Synergy Between Random Projection and Gibbs
• Random Projection is a procedure for finding good starting points:
Every enriched bucket is a potential starting point.
• Feeding these starting points into existing algorithms (like Gibbs
sampler) provides a good local search in vicinity of every starting
point.
• These algorithms work particularly well for ―good‖ starting points.
Building Profiles from Buckets
A 1 0 .25 .50 0 .50 0
C 0 0 .25 .25 0 0 1
G 0 0 .50 0 1 .25 0
T 0 1 0 .25 0 .25 0
Profile P
Gibbs sampler
Refined profile P*
ATCCGAC
ATGAGGC
ATAAGTC
ATGTGAC
ATGC
Motif Refinement
• For each bucket h containing more than s sequences, form
profile P(h).
• Use Gibbs sampler algorithm with starting point P(h) to
obtain refined profile P*.
Random Projection Algorithm: A Single
Iteration
• Choose a random k-projection.
• Hash each l-mer x in input sequence into bucket labeled by
h(x).
• From each enriched bucket (e.g., a bucket with more than s
sequences), form profile P and perform Gibbs sampler motif
refinement.
• Candidate motif is best found by selecting the best motif
among refinements of all enriched buckets.
Choosing Projection Size
• Choose k small enough so that several motif instances hash to
the same bucket.
• Choose k large enough to avoid contamination by spurious l-
mers:

4
k
 t n  l  1 
How Many Iterations?
• Planted Bucket: Bucket with hash value h(M), where M is the
motif.
• Choose m = number of iterations, such that Pr(planted bucket
contains at least s sequences in at least one of m iterations) =0.95
• This probability is readily computable since iterations form a
sequence of independent Bernoulli trials.
• S = x(1),…x(t)} : set of input sequences
• Given: A probabilistic motif model W(  ) depending on
unknown parameters , and a background probability
distribution P.
• Find value  max that maximizes the likelihood ratio:
• EM is local optimization scheme. Requires starting value 0.
Expectation Maximization (EM)

Pr S W  max , P 
Pr S P 
EM Motif Refinement
• For each input sequence x(i), return l-tuple y(i) which
maximizes likelihood ratio:
• T = { y(1), y(2),…,y(t) }
• C(T) = consensus string

Pr y i  W  h

  
Pr y i  P 

Mais conteúdo relacionado

Mais procurados

Counting sort(Non Comparison Sort)
Counting sort(Non Comparison Sort)Counting sort(Non Comparison Sort)
Counting sort(Non Comparison Sort)Hossain Md Shakhawat
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
A Detailed Comparative Study between Reduced Order Cumming Observer & Reduced...
A Detailed Comparative Study between Reduced Order Cumming Observer & Reduced...A Detailed Comparative Study between Reduced Order Cumming Observer & Reduced...
A Detailed Comparative Study between Reduced Order Cumming Observer & Reduced...IJERD Editor
 
11. Linear Models
11. Linear Models11. Linear Models
11. Linear ModelsFAO
 
Time series data mining techniques
Time series data mining techniquesTime series data mining techniques
Time series data mining techniquesShanmukha S. Potti
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMijfls
 
Arima model (time series)
Arima model (time series)Arima model (time series)
Arima model (time series)Kumar P
 
Engineering Numerical Analysis Lecture-1
Engineering Numerical Analysis Lecture-1Engineering Numerical Analysis Lecture-1
Engineering Numerical Analysis Lecture-1Muhammad Waqas
 
Linear Regression
Linear RegressionLinear Regression
Linear RegressionVARUN KUMAR
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCirclecharlingual
 
13. Cubist
13. Cubist13. Cubist
13. CubistFAO
 
Some Thoughts on Sampling
Some Thoughts on SamplingSome Thoughts on Sampling
Some Thoughts on SamplingDon Sheehy
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
A New Approach on the Log - Convex Orderings and Integral inequalities of the...
A New Approach on the Log - Convex Orderings and Integral inequalities of the...A New Approach on the Log - Convex Orderings and Integral inequalities of the...
A New Approach on the Log - Convex Orderings and Integral inequalities of the...inventionjournals
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of AlgorithmsArvind Krishnaa
 
Poggi analytics - star - 1a
Poggi   analytics - star - 1aPoggi   analytics - star - 1a
Poggi analytics - star - 1aGaston Liberman
 
Application of interpolation and finite difference
Application of interpolation and finite differenceApplication of interpolation and finite difference
Application of interpolation and finite differenceManthan Chavda
 

Mais procurados (19)

Counting sort(Non Comparison Sort)
Counting sort(Non Comparison Sort)Counting sort(Non Comparison Sort)
Counting sort(Non Comparison Sort)
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
A Detailed Comparative Study between Reduced Order Cumming Observer & Reduced...
A Detailed Comparative Study between Reduced Order Cumming Observer & Reduced...A Detailed Comparative Study between Reduced Order Cumming Observer & Reduced...
A Detailed Comparative Study between Reduced Order Cumming Observer & Reduced...
 
11. Linear Models
11. Linear Models11. Linear Models
11. Linear Models
 
MATLAB - Arrays and Matrices
MATLAB - Arrays and MatricesMATLAB - Arrays and Matrices
MATLAB - Arrays and Matrices
 
Time series data mining techniques
Time series data mining techniquesTime series data mining techniques
Time series data mining techniques
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Arima model (time series)
Arima model (time series)Arima model (time series)
Arima model (time series)
 
Engineering Numerical Analysis Lecture-1
Engineering Numerical Analysis Lecture-1Engineering Numerical Analysis Lecture-1
Engineering Numerical Analysis Lecture-1
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircleFinding Top-k Similar Graphs in Graph Database @ ReadingCircle
Finding Top-k Similar Graphs in Graph Database @ ReadingCircle
 
13. Cubist
13. Cubist13. Cubist
13. Cubist
 
Some Thoughts on Sampling
Some Thoughts on SamplingSome Thoughts on Sampling
Some Thoughts on Sampling
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
A New Approach on the Log - Convex Orderings and Integral inequalities of the...
A New Approach on the Log - Convex Orderings and Integral inequalities of the...A New Approach on the Log - Convex Orderings and Integral inequalities of the...
A New Approach on the Log - Convex Orderings and Integral inequalities of the...
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of Algorithms
 
Poggi analytics - star - 1a
Poggi   analytics - star - 1aPoggi   analytics - star - 1a
Poggi analytics - star - 1a
 
Ada boost
Ada boostAda boost
Ada boost
 
Application of interpolation and finite difference
Application of interpolation and finite differenceApplication of interpolation and finite difference
Application of interpolation and finite difference
 

Semelhante a Ch12 randalgs

ICANN19: Model-Agnostic Explanations for Decisions using Minimal Pattern
ICANN19: Model-Agnostic Explanations for Decisions using Minimal PatternICANN19: Model-Agnostic Explanations for Decisions using Minimal Pattern
ICANN19: Model-Agnostic Explanations for Decisions using Minimal PatternKohei Asano
 
ADA Unit — 2 Greedy Strategy and Examples | RGPV De Bunkers
ADA Unit — 2 Greedy Strategy and Examples | RGPV De BunkersADA Unit — 2 Greedy Strategy and Examples | RGPV De Bunkers
ADA Unit — 2 Greedy Strategy and Examples | RGPV De BunkersRGPV De Bunkers
 
Brock Butlett Time Series-Great Lakes
Brock Butlett Time Series-Great Lakes Brock Butlett Time Series-Great Lakes
Brock Butlett Time Series-Great Lakes Brock Butlett
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniquestalktoharry
 
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From ScratchPPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From ScratchJisang Yoon
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learningVivek Maskara
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Unit-1 Basic Concept of Algorithm.pptx
Unit-1 Basic Concept of Algorithm.pptxUnit-1 Basic Concept of Algorithm.pptx
Unit-1 Basic Concept of Algorithm.pptxssuser01e301
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithmDarshak Mehta
 
MODULE_05-Matrix Decomposition.pptx
MODULE_05-Matrix Decomposition.pptxMODULE_05-Matrix Decomposition.pptx
MODULE_05-Matrix Decomposition.pptxAlokSingh205089
 

Semelhante a Ch12 randalgs (20)

Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
ICANN19: Model-Agnostic Explanations for Decisions using Minimal Pattern
ICANN19: Model-Agnostic Explanations for Decisions using Minimal PatternICANN19: Model-Agnostic Explanations for Decisions using Minimal Pattern
ICANN19: Model-Agnostic Explanations for Decisions using Minimal Pattern
 
ADA Unit — 2 Greedy Strategy and Examples | RGPV De Bunkers
ADA Unit — 2 Greedy Strategy and Examples | RGPV De BunkersADA Unit — 2 Greedy Strategy and Examples | RGPV De Bunkers
ADA Unit — 2 Greedy Strategy and Examples | RGPV De Bunkers
 
Brock Butlett Time Series-Great Lakes
Brock Butlett Time Series-Great Lakes Brock Butlett Time Series-Great Lakes
Brock Butlett Time Series-Great Lakes
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Computer Science Exam Help
Computer Science Exam Help Computer Science Exam Help
Computer Science Exam Help
 
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From ScratchPPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
 
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
MUMS: Transition & SPUQ Workshop - Gradient-Free Construction of Active Subsp...
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
DAA Notes.pdf
DAA Notes.pdfDAA Notes.pdf
DAA Notes.pdf
 
ARIMA.pptx
ARIMA.pptxARIMA.pptx
ARIMA.pptx
 
Unit-1 Basic Concept of Algorithm.pptx
Unit-1 Basic Concept of Algorithm.pptxUnit-1 Basic Concept of Algorithm.pptx
Unit-1 Basic Concept of Algorithm.pptx
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Ch8 Arrays
Ch8 ArraysCh8 Arrays
Ch8 Arrays
 
MODULE_05-Matrix Decomposition.pptx
MODULE_05-Matrix Decomposition.pptxMODULE_05-Matrix Decomposition.pptx
MODULE_05-Matrix Decomposition.pptx
 

Mais de BioinformaticsInstitute

Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsBioinformaticsInstitute
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...BioinformaticsInstitute
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкBioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр ПредеусBioinformaticsInstitute
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...BioinformaticsInstitute
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...BioinformaticsInstitute
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)BioinformaticsInstitute
 

Mais de BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
Nanopores sequencing
Nanopores sequencingNanopores sequencing
Nanopores sequencing
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime & bioinformatics
Knime & bioinformaticsKnime & bioinformatics
Knime & bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Ch12 randalgs

  • 2. Outline 1. Randomized QuickSort 2. Randomized Algorithms 3. Greedy Profile Motif Search 4. Gibbs Sampler 5. Random Projections
  • 3. Outline - CHANGES • Greedy Profile Motif Search – Animate ―scoring a string with a profile‖ – Animate ―P=most probable l-mer‖ • Gibbs Sampler – Describe relative entropies
  • 5. Randomized Algorithms • Randomized Algorithm: Makes random rather than deterministic decisions. • The main advantage is that no input can reliably produce worst-case results because the algorithm runs differently each time. • These algorithms are commonly used in situations where no exact and fast algorithm is known.
  • 6. Introduction to QuickSort • QuickSort is a simple and efficient approach to sorting. 1. Select an element m from unsorted array c and divide the array into two subarrays: csmall = elements smaller than m clarge = elements larger than m 2. Recursively sort the subarrays and combine them together in sorted array csorted
  • 7. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 1: Choose the first element as m c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
  • 8. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { } clarge = { }
  • 9. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { 3 } clarge = { }
  • 10. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { 3, 2 } clarge = { }
  • 11. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { 3, 2 } clarge = { 8 }
  • 12. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { 3, 2, 4 } clarge = { 8 }
  • 13. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { 3, 2, 4, 5 } clarge = { 8 }
  • 14. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { 3, 2, 4, 5, 1 } clarge = { 8 }
  • 15. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { 3, 2, 4, 5, 1 } clarge = { 8, 7 }
  • 16. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 } csmall = { 3, 2, 4, 5, 1, 0 } clarge = { 8, 7 }
  • 17. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 2: Split the array into csmall and clarge based on m. csmall = { 3, 2, 4, 5, 1, 0 } clarge = { 8, 7, 9 } c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9 }
  • 18. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 3: Recursively do the same thing to csmall and clarge until each subarray has only one element or is empty. m = 3 m = 8 { 2, 1, 0 } { 4, 5 } { 7 } { 9 } m = 2 { 1, 0 } { empty} { empty } { 5 } m = 4 m = 1 { 0 } { empty } csmall = { 3, 2, 4, 5, 1, 0 } clarge = { 8, 7, 9 }
  • 19. QuickSort: Example • Given an array: c = { 6, 3, 2, 8, 4, 5, 1, 7, 0, 9} • Step 4: Combine the two arrays with m working back out of the recursion as we build together the sorted array. csmall = { 0, 1, 2, 3, 4, 5 } clarge = { 7, 8, 9 } { 0, 1, 2 } 3 { 4, 5 } { 7 } 8 { 9 } { 0, 1 } 2 { empty } { empty } 4 { 5 } { 0 } 1 { empty }
  • 20. QuickSort: Example • Finally, we can assemble csmall and clarge with our original choice of m, creating the sorted array csorted. csmall = { 0, 1, 2, 3, 4, 5 } clarge = { 7, 8, 9 }m = 6 csorted = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
  • 21. The QuickSort Algorithm 1. QuickSort(c) 2. if c consists of a single element 3. return c 4. m  c1 5. Determine the set of elements csmall smaller than m 6. Determine the set of elements clarge larger than m 7. QuickSort(csmall) 8. QuickSort(clarge) 9. Combine csmall, m, and clarge into a single array, csorted 10. return csorted
  • 22. QuickSort Analysis: Optimistic Outlook • Runtime is based on our selection of m: • A good selection will split c evenly so that |csmall | = |clarge |. • For a sequence of good selections, the recurrence relation is: • In this case, the solution of the recurrence gives a runtime of O(n log n). Time it takes to split the array into 2 parts The time it takes to sort two smaller arrays of size n/2 T n   2T n 2      constant  n
  • 23. QuickSort Analysis: Pessimistic Outlook • However, a poor selection of m will split c unevenly and in the worst case, all elements will be greater or less than m so that one subarray is full and the other is empty. • For a sequence of poor selection, the recurrence relation is: • In this case, the solution of the recurrence gives runtime O(n2). The time it takes to sort one array containing n-1 elements Time it takes to split the array into 2 parts where const is a positive constant  T n   T n  1   constant  n
  • 24. QuickSort Analysis • QuickSort seems like an ineffecient MergeSort. • To improve QuickSort, we need to choose m to be a good ―splitter.‖ • It can be proven that to achieve O(n log n) running time, we don’t need a perfect split, just a reasonably good one. In fact, if both subarrays are at least of size n/4, then the running time will be O(n log n). • This implies that half of the choices of m make good splitters.
  • 26. A Randomized Approach to QuickSort • To improve QuickSort, randomly select m. • Since half of the elements will be good splitters, if we choose m at random we will have a 50% chance that m will be a good choice. • This approach will make sure that no matter what input is received, the expected running time is small.
  • 27. The RandomizedQuickSort Algorithm 1. RandomizedQuickSort(c) 2. if c consists of a single element 3. return c 4. Choose element m uniformly at random from c 5. Determine the set of elements csmall smaller than m 6. Determine the set of elements clarge larger than m 7. RandomizedQuickSort(csmall) 8. RandomizedQuickSort(clarge) 9. Combine csmall , m, and clarge into a single array, csorted 10.return csorted *Lines in red indicate the differences between QuickSort and RandomizedQuickSort
  • 28. RandomizedQuickSort Analysis • Worst case runtime: O(m2) • Expected Runtime: O(m log m). • Expected runtime is a good measure of the performance of randomized algorithms; it is often more informative than worst case runtimes. • RandomizedQuickSort will always return the correct answer, which offers us a way to classify Randomized Algorithms.
  • 29. Two Types of Randomized Algorithms 1. Las Vegas Algorithm: Always produces the correct solution (ie. RandomizedQuickSort) 2. Monte Carlo Algorithm: Does not always return the correct solution. • Good Las Vegas Algorithms are always preferred, but they are often hard to come by.
  • 31. A New Motif Finding Approach • Motif Finding Problem: Given a list of t sequences each of length n, find the ―best‖ pattern of length l that appears in each of the t sequences. • Previously: We solved the Motif Finding Problem using an Exhaustive Search or a Greedy technique. • Now: Randomly select possible locations and find a way to greedily change those locations until we have converged to the hidden motif.
  • 32. Profiles Revisited • Let s=(s1,...,st) be the set of starting positions for l-mers in our t sequences. • • The substrings corresponding to these starting positions will form: • t x l alignment matrix • 4 x l profile matrix P. • We make a special note that the profile matrix will be defined in terms of the frequency of letters, and not as the count of letters.
  • 33. • Pr(a | P) is defined as the probability that an l-mer a was created by the profile P. • If a is very similar to the consensus string of P then Pr(a | P) will be high. • If a is very different, then Pr(a | P) will be low. • Formula for Pr(a | P): Scoring Strings with a Profile Pr a P  Pa i , i i1 n 
  • 34. Scoring Strings with a Profile • Given a profile: P = • The probability of the consensus string: • Pr(AAACCT | P) = ??? A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8
  • 35. • Given a profile: P = • The probability of the consensus string: • Pr(AAACCT | P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = 0.033646 Scoring Strings with a Profile A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8
  • 36. • Given a profile: P = • The probability of the consensus string: • Pr(AAACCT | P) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = 0.033646 • The probability of a different string: • Pr(ATACAG | P) = 1/2 x 1/8 x 3/8 x 5/8 x 1/8 x 1/8 = 0.001602 Scoring Strings with a Profile A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8
  • 37. P-Most Probable l-mer A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 P = • Define the P-most probable l-mer from a sequence as the l-mer contained in that sequence which has the highest probability of being generated by the profile P. • Example: Given a sequence = CTATAAACCTTACATC, find the P-most probable l-mer.
  • 38. • Find Pr(a | P) of every possible 6-mer: • First Try: C T A T A A A C C T A C A T C • Second Try: C T A T A A A C C T T A C A T C • Third Try: C T A T A A A C C T T A C A T C • Continue this process to evaluate every 6-mer. P-Most Probable l-mer A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8
  • 39. P-Most Probable l-mer • Compute Pr(a | P) for every possible 6-mer: String, Highlighted in Red Calculations prob(a|P) CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0 CTATAAACCTTACAT 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336 CTATAAACCTTACAT 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299 CTATAAACCTTACAT 1/2 x 0 x 1/2 x 0 1/4 x 0 0 CTATAAACCTTACAT 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0 CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004
  • 40. P-Most Probable l-mer • The P-Most Probable 6-mer in the sequence is thus AAACCT: String, Highlighted in Red Calculations Prob(a|P) CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/2 x 7/8 x 0 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/2 x 1/8 x 3/8 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/8 x 7/8 x 3/8 x 0 x 3/8 x 0 0 CTATAAACCTTACAT 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 .0336 CTATAAACCTTACAT 1/2 x 7/8 x 1/2 x 5/8 x 1/4 x 7/8 .0299 CTATAAACCTTACAT 1/2 x 0 x 1/2 x 0 1/4 x 0 0 CTATAAACCTTACAT 1/8 x 0 x 0 x 0 x 0 x 1/8 x 0 0 CTATAAACCTTACAT 1/8 x 1/8 x 0 x 0 x 3/8 x 0 0 CTATAAACCTTACAT 1/8 x 1/8 x 3/8 x 5/8 x 1/8 x 7/8 .0004
  • 41. Dealing with Zeroes • In our toy example Pr(a | P)=0 in many cases. • In practice, there will be enough sequences so that the number of elements in the profile with a frequency of zero is small. • To avoid many entries with Pr(a | P) = 0, there exist techniques to equate zero to a very small number so that having one zero in the profile matrix does not make the entire probability of a string zero (we will not address these techniques here).
  • 42. P-Most Probable l-mers in Many Sequences • Find the P-most probable l- mer in each of the sequences. CTATAAACGTTACATC ATAGCGATTCGACTG CAGCCCAGAACCCT CGGTATACCTTACATC TGCATTCAATAGCTTA TATCCTTTCCACTCAC CTCCAAATCCTTTACA GGTCATCCTTTATCCT A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 P=
  • 43. P-Most Probable l-mers in Many Sequences • The P-Most Probable l-mers form a new profile. CTATAAACGTTACATC ATAGCGATTCGACTG CAGCCCAGAACCCT CGGTGAACCTTACATC TGCATTCAATAGCTTA TGTCCTGTCCACTCAC CTCCAAATCCTTTACA GGTCTACCTTTATCCT 1 a a a c g t 2 a t a g c g 3 a a c c c t 4 g a a c c t 5 a t a g c t 6 g a c c t g 7 a t c c t t 8 t a c c t t A 5/8 5/8 4/8 0 0 0 C 0 0 4/8 6/8 4/8 0 T 1/8 3/8 0 0 3/8 6/8 G 2/8 0 0 2/8 1/8 2/8
  • 44. Comparing New and Old Profiles 1 a a a c g t 2 a t a g c g 3 a a c c c t 4 g a a c c t 5 a t a g c t 6 g a c c t g 7 a t c c t t 8 t a c c t t A 5/8 5/8 4/8 0 0 0 C 0 0 4/8 6/8 4/8 0 T 1/8 3/8 0 0 3/8 6/8 G 2/8 0 0 2/8 1/8 2/8 A 1/2 7/8 3/8 0 1/8 0 C 1/8 0 1/2 5/8 3/8 0 T 1/8 1/8 0 0 1/4 7/8 G 1/4 0 1/8 3/8 1/4 1/8 • Red = frequency increased, Blue – frequency decreased
  • 45. Greedy Profile Motif Search • Use P-Most probable l-mers to adjust start positions until we reach a ―best‖ profile; this is the motif. 1. Select random starting positions. 2. Create a profile P from the substrings at these starting positions. 3. Find the P-most probable l-mer a in each sequence and change the starting position to the starting position of a. 4. Compute a new profile based on the new starting positions after each iteration and proceed until we cannot increase the score anymore.
  • 46. GreedyProfileMotifSearch Algorithm 1. GreedyProfileMotifSearch(DNA, t, n, l ) 2. Randomly select starting positions s=(s1,…,st) from DNA 3. bestScore  0 4. while Score(s, DNA) > bestScore 5. Form profile P from s 6. bestScore  Score(s, DNA) 7. for i  1 to t 8. Find a P-most probable l-mer a from the ith sequence 9. si  starting position of a 10. return bestScore
  • 47. GreedyProfileMotifSearch Analysis • Since we choose starting positions randomly, there is little chance that our guess will be close to an optimal motif, meaning it will take a very long time to find the optimal motif. • It is actually unlikely that the random starting positions will lead us to the correct solution at all. • Therefore this is a Monte Carlo algorithm. • In practice, this algorithm is run many times with the hope that random starting positions will be close to the optimal solution simply by chance.
  • 49. Gibbs Sampling • GreedyProfileMotifSearch is probably not the best way to find motifs. • However, we can improve the algorithm by introducing Gibbs Sampling, an iterative procedure that discards one l-mer after each iteration and replaces it with a new one. • Gibbs Sampling proceeds more slowly and chooses new l- mers at random, increasing the odds that it will converge to the correct solution.
  • 50. Gibbs Sampling Algorithm 1. Randomly choose starting positions s = (s1,...,st) and form the set of l-mers associated with these starting positions. 2. Randomly choose one of the t sequences. 3. Create a profile P from the other t – 1 sequences. 4. For each position in the removed sequence, calculate the probability that the l-mer starting at that position was generated by P. 5. Choose a new starting position for the removed sequence at random based on the probabilities calculated in Step 4. 6. Repeat steps 2-5 until there is no improvement.
  • 51. Gibbs Sampling: Example • Input: t = 5 sequences, motif length l = 8 1. GTAAACAATATTTATAGC 2. AAAATTTACCTCGCAAGG 3. CCGTACTGTCAAGCGTGG 4. TGAGTAAACGACGTCCCA 5. TACTTAACACCCTGTCAA
  • 52. Gibbs Sampling: Example 1. Randomly choose starting positions, s =(s1,s2,s3,s4,s5) in the 5 sequences: s1=7 GTAAACAATATTTATAGC s2=11 AAAATTTACCTTAGAAGG s3=9 CCGTACTGTCAAGCGTGG s4=4 TGAGTAAACGACGTCCCA s5=1 TACTTAACACCCTGTCAA
  • 53. Gibbs Sampling: Example 2. Choose one of the sequences at random s1=7 GTAAACAATATTTATAGC s2=11 AAAATTTACCTTAGAAGG s3=9 CCGTACTGTCAAGCGTGG s4=4 TGAGTAAACGACGTCCCA s5=1 TACTTAACACCCTGTCAA
  • 54. Gibbs Sampling: Example 2. Choose one of the sequences at random: Sequence 2 s1=7 GTAAACAATATTTATAGC s2=11 AAAATTTACCTTAGAAGG s3=9 CCGTACTGTCAAGCGTGG s4=4 TGAGTAAACGACGTCCCA s5=1 TACTTAACACCCTGTCAA
  • 55. Gibbs Sampling: Example 3. Create profile P from l-mers in remaining 4 sequences: 1 A A T A T T T A 3 T C A A G C G T 4 G T A A A C G A 5 T A C T T A A C A 1/4 2/4 2/4 3/4 1/4 1/4 1/4 2/4 C 0 1/4 1/4 0 0 2/4 0 1/4 T 2/4 1/4 1/4 1/4 2/4 1/4 1/4 1/4 G 1/4 0 0 0 1/4 0 3/4 0 Consensus String T A A A T C G A
  • 56. Gibbs Sampling: Example 4. Calculate Pr(a | P) for every possible 8-mer in the removed sequence: Strings Highlighted in Red Pr(a | P) AAAATTTACCTTAGAAGG .000732 AAAATTTACCTTAGAAGG .000122 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG .000183 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0 AAAATTTACCTTAGAAGG 0
  • 57. Gibbs Sampling: Example 5. Create a distribution of probabilities of l-mers Pr(a | P), and randomly select a new starting position based on this distribution. • To create this distribution, divide each probability Pr(a | P) by the lowest probability: • Ratio = 6 : 1 : 1.5 Starting Position 1: Pr( AAAATTTA | P ) /.000122 = .000732 / .000122 = 6 Starting Position 2: Pr( AAATTTAC | P )/.000122 = .000122 / .000122 = 1 Starting Position 8: Pr( ACCTTAGA | P )/.000122 = .000183 / .000122 = 1.5
  • 58. • Define probabilities of starting positions according to the computed ratios. • Select the start position probabilistically based on these ratios. Turning Ratios into Probabilities Pr(Selecting Starting Position 1): 6/(6+1+1.5) = 0.706 Pr(Selecting Starting Position 2): 1/(6+1+1.5) = 0.118 Pr(Selecting Starting Position 8): 1.5/(6+1+1.5) = 0.176
  • 59. Gibbs Sampling: Example • Assume we select the substring with the highest probability— then we are left with the following new substrings and starting positions. s1=7 GTAAACAATATTTATAGC s2=1 AAAATTTACCTCGCAAGG s3=9 CCGTACTGTCAAGCGTGG s4=5 TGAGTAATCGACGTCCCA s5=1 TACTTCACACCCTGTCAA
  • 60. Gibbs Sampling: Example 6. We iterate the procedure again with the above starting positions until we cannot improve the score.
  • 61. Gibbs Sampling in Practice • Gibbs sampling needs to be modified when applied to samples with unequal distributions of nucleotides (relative entropy approach). • Gibbs sampling often converges to locally optimal motifs rather than globally optimal motifs. • Needs to be run with many randomly chosen seeds to achieve good results.
  • 63. Another Randomized Approach • The Random Projection Algorithm is an alternative way to solve the Motif Finding Problem. • Guiding Principle: Some instances of a motif agree on a subset of positions. • However, it is unclear how to find these ―non-mutated‖ positions. • To bypass the effect of mutations within a motif, we randomly select a subset of positions in the patter,n creating a projection of the pattern. • We then search for the projection in a hope that the selected positions are not affected by mutations in most instances of the motif.
  • 64. Projections: Formation • Choose k positions in a string of length l. • Concatenate nucleotides at the chosen k positions to form a k- tuple. • This can be viewed as a projection of l-dimensional space onto k-dimensional subspace. • Projection = (2, 4, 5, 7, 11, 12, 13) ATGGCATTCAGATTC TGCTGAT l = 15 k = 7Projection
  • 65. Random Projections Algorithm • Select k out of l positions uniformly at random. • For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. • Recover motif from enriched bucket that contains many l- tuples. Bucket TGCT TGCACCT Input sequence: …TCAATGCACCTAT...
  • 66. Random Projections Algorithm • Some projections will fail to detect motifs but if we try many of them the probability that one of the buckets fills in is increasing. • In the example below, the bucket **GC*AC is ―bad‖ while the bucket AT**G*C is ―good‖ ATGCGTC ...ccATCCGACca... ...ttATGAGGCtc... ...ctATAAGTCgc... ...tcATGTGACac... (7,2) motif
  • 67. Random Projections Algorithm: Example • l = 7 (motif size), k = 4 (projection size) • Projection: (1,2,5,7) GCTC ...TAGACATCCGACTTGCCTTACTAC... Buckets ATGC ATCCGAC GCCTTAC
  • 68. Hashing and Buckets • Hash function h(x) is obtained from k positions of projection. • Buckets are labeled by values of h(x). • Enriched Buckets: Contain more than s l-tuples, for some decided upon parameter s. ATTCCATCGCTCATGC
  • 69. Motif Refinement • How do we recover the motif from the sequences in the enriched buckets? • k nucleotides are from hash value of bucket. • Use information in other l-k positions as starting point for local refinement scheme, e.g. Gibbs sampler. Local refinement algorithm ATGCGAC Candidate motif ATGC ATCCGAC ATGAGGC ATAAGTC ATGCGAC
  • 70. Synergy Between Random Projection and Gibbs • Random Projection is a procedure for finding good starting points: Every enriched bucket is a potential starting point. • Feeding these starting points into existing algorithms (like Gibbs sampler) provides a good local search in vicinity of every starting point. • These algorithms work particularly well for ―good‖ starting points.
  • 71. Building Profiles from Buckets A 1 0 .25 .50 0 .50 0 C 0 0 .25 .25 0 0 1 G 0 0 .50 0 1 .25 0 T 0 1 0 .25 0 .25 0 Profile P Gibbs sampler Refined profile P* ATCCGAC ATGAGGC ATAAGTC ATGTGAC ATGC
  • 72. Motif Refinement • For each bucket h containing more than s sequences, form profile P(h). • Use Gibbs sampler algorithm with starting point P(h) to obtain refined profile P*.
  • 73. Random Projection Algorithm: A Single Iteration • Choose a random k-projection. • Hash each l-mer x in input sequence into bucket labeled by h(x). • From each enriched bucket (e.g., a bucket with more than s sequences), form profile P and perform Gibbs sampler motif refinement. • Candidate motif is best found by selecting the best motif among refinements of all enriched buckets.
  • 74. Choosing Projection Size • Choose k small enough so that several motif instances hash to the same bucket. • Choose k large enough to avoid contamination by spurious l- mers:  4 k  t n  l  1 
  • 75. How Many Iterations? • Planted Bucket: Bucket with hash value h(M), where M is the motif. • Choose m = number of iterations, such that Pr(planted bucket contains at least s sequences in at least one of m iterations) =0.95 • This probability is readily computable since iterations form a sequence of independent Bernoulli trials.
  • 76. • S = x(1),…x(t)} : set of input sequences • Given: A probabilistic motif model W(  ) depending on unknown parameters , and a background probability distribution P. • Find value  max that maximizes the likelihood ratio: • EM is local optimization scheme. Requires starting value 0. Expectation Maximization (EM)  Pr S W  max , P  Pr S P 
  • 77. EM Motif Refinement • For each input sequence x(i), return l-tuple y(i) which maximizes likelihood ratio: • T = { y(1), y(2),…,y(t) } • C(T) = consensus string  Pr y i  W  h     Pr y i  P 