Web & Social Media Analytics Previous Year Question Paper.pdf
Parallel Random Projection for Motif Discovery on GPUs
1. Finding Planted (l, d)-Motifs in Parallel
using Random Projection on GPUs
Jhoirene Barasi Clemente
Algorithms and Complexity Laboratory
Department of Computer Science
University of the Philippines-Diliman
jbclemente@up.edu.ph
March 31, 2012
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 1 / 88
2. Overview
Overview
Introduction
Definitions and Notations
Finding Motifs using Random Projection (FMURP)
Parallel Implementations of CUDA-FMURP
Results and Analysis
Conclusion
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 2 / 88
3. Introduction
In this work, we are interested in solving Planted (l, d)-Motif Problem
using Random Projection (FMURP).
The focus of this study is on parallelization of FMURP, where we
present three versions of the parallel algorithm. Correctness of the
parallelization is also discussed.
We implement two of these parallel algorithms on GPUs. Theoretical
and actual performance analyses are also presented.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 3 / 88
4. Introduction
Introduction
A DNA motif is defined as a nucleic acid sequence pattern that has some
biological significance such as being DNA binding sites for a regulatory
protein. i.e., a transcription factor [Das,2007].
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 4 / 88
5. Introduction
Introduction
DNA Sequences as Strings
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 5 / 88
6. Introduction
Introduction
The pattern is fairly short (5 to 20 base-pairs (bp) long) and is known to recur
in different genes or several times within gene [Rombauts,1999].
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 6 / 88
7. Introduction Notations
Notations
Set of t sequences S.
Example 1 (Sequences S = {S0 , S1 , . . . , S(t−1) })
S0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T A
S1 : T T T G A G G G T G C C C A A T A A A T G C C A C T C C A A A G C G G A C A A A
S2 : G G A T G C A A C T G A T G C C G T T T G A C G A C C T A A A T C A A C G G C C
S3 : A A G G A T G C A A C T C C A G G A G C G C C T T T G C T G G T T C T A C C T G
S4 : A A T T T T C T A A A A A G A T T A T A A T G T C G G T C C A T G C A A C T T C
S5 : C T G C T G T A C A A C T G A G A T C A T G C T G C A T G C A A C T T T C A A C
S6 : T A C A T G A T C T T T T G A T G C A A C G T G G A T G A G G G A A T G A T G C
Set of sequences S = {S0 , S1 , S2 , S3 , S4 , S5 , S6 }
defined over ΣDNA = {A, C, T, G},
where each sequence Si in S has length ni = 40 for all i ∈ {0, . . . , (t − 1)}
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 7 / 88
8. Introduction Notations
Notations
An l-mer is a string of length l defined over ΣDNA .
To denote an l-mer in S, we use
Si,j , where i ∈ {0, 1, . . . , (t − 1)} is the sequence number
and j ∈ {0, 1, . . . , (n − l)} is the starting position in Si .
Example 2 (Si,j in S)
For instance, an 8-mer S0,7 is
ATGGAACT
S0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T A
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 8 / 88
9. Introduction Notations
Notations
Let s = (a0 , a1 , . . . , a(t−1) ) be the set of starting positions in S,
where ai ∈ {0, 1, . . . , (n − l)}.
Let A(s) denotes the alignment made by l-mers in the set
{S0,a0 , S1,a1 , . . . , S(t−1),a(t−1) }.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 9 / 88
10. Introduction Notations
Notations
Example 3 (Alignment matrix A(s))
Suppose we have a starting position vector s = (7, 18, 2, 4, 30, 26, 14)
S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G
S0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T A
S1 : T T T G A G G G T G C C C A A T A A A T G C C A C T C C A A A G C G G A C A A A
S2 : G G A T G C A A C T G A T G C C G T T T G A C G A C C T A A A T C A A C G G C C
S3 : A A G G A T G C A A C T C C A G G A G C G C C T T T G C T G G T T C T A C C T G
S4 : A A T T T T C T A A A A A G A T T A T A A T G T C G G T C C A T G C A A C T T C
S5 : C T G C T G T A C A A C T G A G A T C A T G C T G C A T G C A A C T T T C A A C
S6 : T A C A T G A T C T T T T G A T G C A A C G T G G A T G A G G G A A T G A T G C
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 10 / 88
11. Introduction Notations
Notations
A profile matrix P(s) with dimension equal to (|ΣDNA | × l) is derived
from the frequency of each letter in each column of the A(s).
Example 4 (Profile Matrix P(s))
S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G
A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 11 / 88
12. Introduction Notations
Notations
From P(s), we define MP(s) (j), where 0 ≤ j ≤ (l − 1), be the maximum
number at jth column of the profile matrix.
Example 5 (MP(s),j )
S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G
A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 12 / 88
13. Introduction Notations
Notations
A consensus string is an l-mer, where each of its elements is the
nucleotide base corresponding to MP(s) (i).
Example 6 (Consensus String)
S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G
A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1
Consensus String A T G C A A C T
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 13 / 88
16. Introduction Motif Finding Problem
Motif Finding Problem
Definition 8 (Motif Finding Problem [Pevzner,2004])
INPUT:
A motif length l
A set of t sequences S = {S0 , S1 , S2 , . . . , S(t−1) },
where each Si is of length ni
OUTPUT:
An array of starting positions s = (a0 , a1 , . . . , a(t−1) )
maximizing consensus Score(s,S)
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 15 / 88
17. Introduction Motif Finding Problem
Naive MFP Solver [Pevzner,2004]
Input: DNA (sequences), motif length l
Output: Starting position s and consensus string corresponding to s
1 For each possible starting position in S,
i.e. s ∈ {(0, 0, . . . , 0), . . . , ((n − l), (n − l) . . . , (n − l))}.
1 Get alignment A(s).
2 Compute for P(s).
3 Evaluate Score(s, S).
2 From s with the maximum Score, get the consensus string.
3 Output consensus string.
Step 1 needs to iterate (n − l + 1)t times because all possible starting
positions s is equal to
s = (a0 , a1 , . . . , a(t−1) ), ∀ ai ∈ {0, . . . , (n − l)}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 16 / 88
18. Introduction Motif Finding Problem
Naive MFP Solver [Pevzner,2004]
Input: DNA (sequences), motif length l
Output: Starting position s and consensus string corresponding to s
1 For each possible starting position in S,
i.e. s ∈ {(0, 0, . . . , 0), . . . , ((n − l), (n − l) . . . , (n − l))}.
1 Get alignment A(s).
2 Compute for P(s).
3 Evaluate Score(s, S).
2 From s with the maximum Score, get the consensus string.
3 Output consensus string.
Step 1 needs to iterate (n − l + 1)t times because all possible starting
positions s is equal to
s = (a0 , a1 , . . . , a(t−1) ), ∀ ai ∈ {0, . . . , (n − l)}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 16 / 88
19. Introduction Planted (l, d)-Motif Finding Problem
Definitions
Definition 9 (Challenge Problem [Pevzner,2000])
INPUT:
Motif length l = 15,
Expected mismatches d,
20 DNA sequences each with ni = 600 nucleotide bases
OUTPUT:
A consensus string M from an alignment A(s), where each l-mer in A(s)
has Si,ai
dE (M, Si,ai ) = 4,
for all i ∈ {0, . . . , (t − 1)}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 17 / 88
20. Introduction Planted (l, d)-Motif Finding Problem
Why challenging?
Suppose we have A(s),
S0,a0 A C T T G G G G C A A G A G G
S1,a1 G G A C G G G G C A G A C T G
S2,a2 A C T T G C T A A A G A C T G
S3,a3 A C T G C G G G C A C A G T G
S4,a4 A C C T G G G T C G T A C T G
A: 4 0 1 0 0 0 0 1 1 4 1 4 1 0 0
C: 0 4 1 1 1 1 0 0 4 0 1 0 2 0 0
T: 0 0 3 3 0 0 1 1 0 0 1 0 1 4 0
G: 1 1 0 1 4 4 4 3 0 1 2 1 1 1 5
A C T T G G G G C A G A C T G
dE (S0,a0 , S1,a1 ) = 2d = 8
Score(s, S) = 4 + 4 + 3 + 3 + 4 + 4 + 4 + 3 + 4 + 4 + 2 + 4 + 2 + 4 + 5 = 54
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 18 / 88
21. Introduction Planted (l, d)-Motif Finding Problem
Definitions
Definition 10 (Planted (l, d)-Motif Finding Problem [Tompa,2001])
INPUT:
Motif length l,
Expected number of mismatches d, and
A set of t sequences S = {S0 , S1 , S2 , . . . , S(t−1) }, where each Si is of
length ni
OUTPUT:
A consensus string M from an alignment A(s), where each l-mer in A(s)
has Si,ai
dE (M, Si,ai ) = d,
for all i ∈ {0, . . . , (t − 1)}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 19 / 88
22. Introduction Planted (l, d)-Motif Finding Problem
Solutions for Planted (l, d)-Motif Finding
SP-STAR [Pevzner,2000]
Winnower [Pevzner,2000]
Random Projection [Tompa,2001]
Aggregation [Mohammed,2004]
GibbsDST [Shida,2006]
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 20 / 88
23. Finding Motifs using Random Projection (FMURP)
Finding Motifs using Random Projection (FMURP)
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 Projection
1 Get all l-mer Si,j s in S.
2 Get projection hI (Si,j ) for each Si,j in S.
3 Hash each Si,j to buckets with identifier hI (Si,j ).
4 Get enriched buckets.
2 Refine each enriched bucket using EM
3 Refine each enriched bucket using SP-STARσ
4 Maximize score to output best motif
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 21 / 88
24. Finding Motifs using Random Projection (FMURP)
Definition 11
Random Projection Given an l-mer Si,j , projection dimension k, and a set
I ⊂ L = {0, . . . , (l − 1)}, where |I| = k, elements in I are sorted in increasing
order and are randomly chosen from the set L, a k-dimensional projection of
Si,j is
hI (Si,j ) = Si,j (I0 ), Si,j (I1 ), . . . , Si,j (I(k−1) ),
where hI (Si , j) is a k-mer and Ii denotes the ith element in I.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 22 / 88
25. Finding Motifs using Random Projection (FMURP)
FMURP: Example
Example 12
Given a set of DNA sequences S, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3.
S0 : C G G T C A G G
S1 : T T C G A C A T
S2 : A C G A T G A A
Figure: Set of t = 3 sequences each with n = 8
Let I = {0, 1}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 23 / 88
26. Finding Motifs using Random Projection (FMURP)
Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 24 / 88
27. Finding Motifs using Random Projection (FMURP)
Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 25 / 88
28. Finding Motifs using Random Projection (FMURP)
Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 26 / 88
29. Finding Motifs using Random Projection (FMURP)
Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 27 / 88
30. Parallel Motif Finding using Random Projection
How do we parallelize FMURP?
1 Projection
1 Projection 1 Get all l-mer Si,j s in S in
1 Get all l-mer Si,j s in S. parallel.
2 Get projection hI (Si,j ) for each 2 Get projection hI (Si,j ) for each
Si,j in S. Si,j in S in parallel.
3 Hash each Si,j to buckets with 3 Hash each Si,j to buckets with
identifier hI (Si,j ). identifier hI (Si,j ) in parallel.
4 Get enriched buckets. 4 Get enriched buckets in
2 Refine each enriched bucket parallel.
using EM 2 Refine each enriched bucket
3 Refine each enriched bucket using EM in parallel
using SP-STARσ 3 Refine each enriched bucket
4 Maximize score to output best using SP-STARσ in parallel
motif 4 Maximize score to output best
motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 28 / 88
31. Parallel Motif Finding using Random Projection
Parallel Algorithms for Motif Finding
CUDA-MEME
CUDA-Gibbs Sampling
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 29 / 88
32. Parallel Motif Finding using Random Projection
CUDA
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 30 / 88
33. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Computing Framework
Figure: Flowchart showing the processes done in the CPU and GPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 31 / 88
34. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-FMURP v1
Figure: Thread ID is denoted by an ordered pair (i, j), 0 ≤ i ≤ w and 0 ≤ j ≤ v, where v is
the maximum thread per block and w is the number of allocated thread blocks in the grid. The
algorithm uses a total of x = t · (n − l + 1) threads that are linearly arranged in GPU.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 32 / 88
35. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-FMURP v1
INPUT: Set of sequences S, motif length l, expected mismatches d, projection dimension k,
and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for each Si,j in S,
∗
2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j .
∗
3 Perform a linear search over all ki,j s to determine which l-mers
∗
are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead
of the actual l-mer.
3 In CPU, identify the set of enriched buckets,
and prune duplicates in preparation for EM refinement.
4 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 33 / 88
36. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Integer Conversion
Step 2.2 represents each hI (Si,j ) to their corresponding integer representation
∗
ki,j . Given a unique k-mer from projection, a corresponding integer is
computed using the following mapping. Let us define
f : ΣDNA → {0, 1, 2, 3},
A → 0
C → 1
G → 2
T → 3
where each symbol in the DNA alphabet is mapped to a unique integer.
For a string v of length k,
f∗ : Σ+
DNA → Z+ ∪ {0}
k−1 i
v → i=0 f (vi )4
where vi denotes the symbol at ith position starting from the least significant
digit and the integer representation is only defined on the positive integers
including {0}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 34 / 88
37. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection v1: Example
Given a set of DNA sequences, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3. Projection in parallel is shown as follows
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 35 / 88
38. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection v1: Integer Conversion example
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 36 / 88
39. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection: Parallel Integer Conversion Example
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 37 / 88
40. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection: Getting enriched buckets
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 38 / 88
41. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection: Getting enriched buckets
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 39 / 88
42. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-EM
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 40 / 88
43. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-SP-STARσ
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 41 / 88
44. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for each Si,j in S,
∗
2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j .
∗
3 Perform a linear search over all ki,j s to determine which l-mers
∗
are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead
of the actual l-mer.
3 In CPU, identify the set of enriched buckets,
and prune duplicates in preparation for EM refinement.
4 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 42 / 88
45. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
The uniqueness of the representation we defined using f ∗ follows from the
results below.
Let Σk = {0, 1, 2, . . . , k − 1}, and let Ck a regular language such that,
Ck = { } ∪ (Σk − {0})Σ∗ .
k
Theorem 4.1 (Fundamental Theorem of base-k Representation
[Allouche,2003])
Let k ≥ 2 be an integer. Then every non-negative integer has a unique
representation of the form
t
N= ai ki ,
i=0
where at = 0 and 0 ≤ ai < k for 0 ≤ i ≤ t.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 43 / 88
46. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
In the case of our representation f ∗ , we have k = 4 and ai = f (vi ), where
vi ∈ ΣDNA . Note that the mapping f is one-to-one and onto by definition. Thus
we have the following:
Proposition 4.1
f ∗ provides a unique representation of hI (Si,j ), for each i, j, and element of I.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 44 / 88
47. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for each Si,j in S,
∗
2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j .
∗
3 Perform a linear search over all ki,j s to determine which l-mers
∗
are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead
of the actual l-mer.
3 In CPU, identify the set of enriched buckets,
and prune duplicates in preparation for EM refinement.
4 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 45 / 88
48. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.
EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R defined below.
Definition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )
Proposition 4.2
R is an equivalence relation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
49. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.
EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R defined below.
Definition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )
Proposition 4.2
R is an equivalence relation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
50. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.
EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R defined below.
Definition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )
Proposition 4.2
R is an equivalence relation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
51. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.
EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R defined below.
Definition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )
Proposition 4.2
R is an equivalence relation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
52. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
In CUDA-FMURP v1, an enriched bucket is defined as
¯ ¯ ¯
EB = {B| |B| ≥ δ}.
¯
where B is a bucket in CUDA-FMURP and two elements p and q belongs to
¯ ¯
the same bucket B if it follows the relation R defined below.
¯
Definition 14 (Relation R)
¯
(p, q) ∈ B ⇔ (p, q) ∈ R ¯
¯
(p, q) ∈ R ⇔ ∗ = k∗
ki,j ¯¯
i,j
where i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and
i
¯ = q mod (n − l + 1).
j
Lemma 15
¯
Relation R and R are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
53. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
In CUDA-FMURP v1, an enriched bucket is defined as
¯ ¯ ¯
EB = {B| |B| ≥ δ}.
¯
where B is a bucket in CUDA-FMURP and two elements p and q belongs to
¯ ¯
the same bucket B if it follows the relation R defined below.
¯
Definition 14 (Relation R)
¯
(p, q) ∈ B ⇔ (p, q) ∈ R ¯
¯
(p, q) ∈ R ⇔ ∗ = k∗
ki,j ¯¯
i,j
where i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and
i
¯ = q mod (n − l + 1).
j
Lemma 15
¯
Relation R and R are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
54. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
In CUDA-FMURP v1, an enriched bucket is defined as
¯ ¯ ¯
EB = {B| |B| ≥ δ}.
¯
where B is a bucket in CUDA-FMURP and two elements p and q belongs to
¯ ¯
the same bucket B if it follows the relation R defined below.
¯
Definition 14 (Relation R)
¯
(p, q) ∈ B ⇔ (p, q) ∈ R ¯
¯
(p, q) ∈ R ⇔ ∗ = k∗
ki,j ¯¯
i,j
where i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and
i
¯ = q mod (n − l + 1).
j
Lemma 15
¯
Relation R and R are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
55. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness
¯
Note that elements in B involves Si,j s while elements in B involves the set of
integers p ∈ {0, . . . , (x − 1)}. Using Equations
tid = i × (n − l + 1) + j (2)
tid
i= (3)
(n − l + 1)
j = tid mod (n − l + 1) (4)
we can retrieve the l-mer Si,j corresponding to tid and vice versa. The theorem
¯
below follows from the fact that R and R are equivalent.
Theorem 4.2
¯
Set of enriched buckets EB and EB are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 48 / 88
56. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness
¯
Note that elements in B involves Si,j s while elements in B involves the set of
integers p ∈ {0, . . . , (x − 1)}. Using Equations
tid = i × (n − l + 1) + j (2)
tid
i= (3)
(n − l + 1)
j = tid mod (n − l + 1) (4)
we can retrieve the l-mer Si,j corresponding to tid and vice versa. The theorem
¯
below follows from the fact that R and R are equivalent.
Theorem 4.2
¯
Set of enriched buckets EB and EB are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 48 / 88
57. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
CUDA-FMURP v2
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Si,j ) to its corresponding
∗
integer representation ki,j .
3 ∗
In CPU, hash the list of ki,j s .
4 In CPU, identify the set of enriched buckets.
5 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 49 / 88
58. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
CUDA-FMURP v2
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Si,j ) to its corresponding
∗
integer representation ki,j .
3 ∗
In CPU, hash the list of ki,j s.
4 In CPU, identify the set of enriched buckets.
5 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 50 / 88
59. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 51 / 88
60. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 52 / 88
61. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 53 / 88
62. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 54 / 88
63. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
∗
To avoid collision between two items with different ki,j s, linear probing is
implemented.
Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is not
empty,
i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j .
∗ ∗ ∗
We have to look for empty positions in table where we can place item p.
We explore positions
h (ki∗ ,j , i) = (h(ki,j ) + i)
∗
mod x
for i from 0 to (m − 1), until an empty hash table position is found.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
64. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
∗
To avoid collision between two items with different ki,j s, linear probing is
implemented.
Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is not
empty,
i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j .
∗ ∗ ∗
We have to look for empty positions in table where we can place item p.
We explore positions
h (ki∗ ,j , i) = (h(ki,j ) + i)
∗
mod x
for i from 0 to (m − 1), until an empty hash table position is found.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
65. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
∗
To avoid collision between two items with different ki,j s, linear probing is
implemented.
Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is not
empty,
i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j .
∗ ∗ ∗
We have to look for empty positions in table where we can place item p.
We explore positions
h (ki∗ ,j , i) = (h(ki,j ) + i)
∗
mod x
for i from 0 to (m − 1), until an empty hash table position is found.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
66. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
CUDA-FMURP v3
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (t − 1)},
1 Get hI (Stid,j )s for all Stid,j s in S,
where j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Stid,j ) to its corresponding
∗
integer representation ktid,j .
3 ∗
In CPU, hash the list of ki,j s.
4 In CPU, identify the set of enriched buckets.
5 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 56 / 88
67. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
CUDA-FMURP v3
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (t − 1)},
1 Get hI (Stid,j )s for all Stid,j s in S,
where j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Stid,j ) to its corresponding
∗
integer representation ktid,j .
3 ∗
In CPU, hash the list of ki,j s.
4 In CPU, identify the set of enriched buckets.
5 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 57 / 88
68. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
CUDA-Projection v3
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 58 / 88
69. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
CUDA-Projection v3
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 59 / 88
70. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
Integer Conversion
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 60 / 88
71. Result and Analysis
Running Time and Space Complexity
Algorithm Time Space Number of Processors
FMURP O(log(x)) O(x) 1
SEQ-FMURP O(x2 ) Oe(n − l + 1) 1
CUDA-FMURP v1 O(x) O(e(n − l + 1)) x
CUDA-FMURP v2 O(x) O(e(n − l + 1)) x
CUDA-FMURP v3 O(x) O(e(n − l + 1)) t
Table: Total running time and space complexity of the three parallel algorithms for
CUDA-FMURP in comparison with the two sequential implementations.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 61 / 88
72. Result and Analysis
Speedup and Efficiency
FMURP: O(x log x)
The computation of Speedup is the ratio of sequential and parallel running
time.
Sequential
SP =
Parallel
Comparison of Speedups SP , SP , and SP for CUDA-FMURP versions 1 to 3,
respectively is shown below.
O(x log x)
SP = SP = SP = = O(log x)
O(x)
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 62 / 88
73. Result and Analysis
Speedup and Efficiency
Computation of processor Efficiency makes use of the speedup SP and
number of processors used ˆ.
p
1
· SPEP =
ˆ
p
Comparison of Efficiencies EP , EP , and EP for CUDA-FMURP versions 1 to
3, respectively is shown below.
1 log x
EP = · O(log x) = (5)
x x
1 log x
EP = · O(log x) = (6)
x x
1 log x
EP = · O(log x) = (7)
t t
EP = EP < EP
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 63 / 88
74. Result and Analysis Dataset
Dataset
t n l d Instances generated
20 600 10 2 100
20 600 11 2 100
20 600 12 3 100
20 600 13 3 100
20 600 14 4 100
20 600 15 4 100
20 600 16 5 100
20 600 17 5 100
20 600 18 6 100
20 600 19 6 100
Table: Summary of generated dataset that is used to determine the accuracy of
CUDA-FMURP. For each of the instance generated, the search model OOPS is
assumed, that is each sequence contains exactly one occurrence of the planted motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 64 / 88
75. Result and Analysis Dataset
Accuracy
t n l d FMURP FMURP∗ SEQ-FMURP CUDA-FMURP m
20 600 10 2 13 100 98 98 72
20 600 11 2 99 100 100 100 16
20 600 12 3 3 96 83 83 259
20 600 13 3 81 100 100 100 62
20 600 14 4 1 86 79 79 645
20 600 15 4 49 100 100 100 172
20 600 16 5 0 77 53 53 1292
20 600 17 5 19 98 98 98 378
20 600 18 6 0 82 38 38 2217
20 600 19 6 9 98 94 94 711
Table: The table shows the number of correctly identified planted motif over 100
random input instances. For each of the instances, parameters k = 7 and s = 4 are
used. The column labelled FMURP∗ is based from the result presented in
[Tompa,2001] using the dataset they generated.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 65 / 88
76. Result and Analysis Machine Setups
Machine Setups
System specifications Values
System specifications Values Host processors (procs) Core(TM) i7-2600 CPU 3.40GHz
Host processors (procs) 2 × Intel Quad-core 2.26GHz Total number of cores 4 × 2 (hyperthreaded) = 8
Total number of cores 8 Max host RAM 8GB
Max host RAM 12GB Device/s (GPU/s) 1 × NVIDIA GeForce GTX 580
Device/s (GPU/s) 2 × NVIDIA GT120 Compute capability 2.0
Compute capability 1.1 CUDA Cores/GPU 16 (multiprocs) × 32 (cores/proc) = 512
CUDA Cores/GPU 4 (multiprocs) × 8 (cores/proc) = 32 GPU clock rate 1.54 GHz
GPU clock rate 1.40 GHz Memory clock rate 2004 Mhz
Memory clock rate 500 Mhz Max device global memory 1535MB
Max device global memory 512MB Operating system 64-bit Ubuntu 10.0.4
Operating system Mac OS X 10.6.8 CUDA version 4.1
CUDA version 3.2
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 66 / 88
77. Result and Analysis Actual Speedup
Actual speed of CUDA-Projection v3 with respect to
CUDA-Projection v1
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 67 / 88
78. Result and Analysis Actual Speedup
Actual speed of CUDA-FMURP v1 and CUDA-Projection
v3
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 68 / 88
79. Result and Analysis Actual Speedup
Actual Speed Result: Setup1
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 69 / 88
80. Result and Analysis Actual Speedup
Memory Requirement
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 70 / 88
81. Result and Analysis Actual Speedup
Actual speed comparison and speedup of CUDA-FMURP
v1 with respect to SEQ-FMURP and FMURP using Setup 2
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 71 / 88
82. Conclusion
Conclusion
In this work, we presented three versions of parallel algorithms for FMURP.
Algorithm Processors SP wrt FMURP SP wrt SEQ-FMURP Efficiency
CUDA-FMURP v1 x O(log x) O(x) (log x/x)
CUDA-FMURP v2 x O(log x) O(x) (log x/x)
CUDA-FMURP v3 t O(log x) O(x) (log x/t)
We implemented CUDA-FMURP v1 and CUDA-FMURP v2 and achieved a
maximum actual speedup of 6.8 and 6.6 respectively with respect to the
SEQ-FMURP.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 72 / 88
84. References
References
J.P. Allouche and J. Shallit, “Automatic Sequences: Theory Applications
and Generalizations”, Cambridge University Press,Chapter 3:
Numeration Systems, pp 70-73, 2003
P. Pevzner and S. H. Sze, “Combinatorial Approaches to Finding Subtle
Signals in DNA Sequences”, Proceedings of 8th Int. Conf. Intelligent
Systems for Molecular Biology (ISMB), 269-78, 2000
J. Buhler, M. Tompa, “Finding Motifs Using Random Projections”,
RECOMB ’01 Proceedings of the fifth annual international conference on
Computational biology, 2001
D. Kirk, W. Hwu, Programming Massively Parallel Processors: A Hands
On Approach, 1st ed. MA, USA: Morgan Kaufmann, 2010
M. Harris, “Mapping computational concepts to GPUs”, ACM
SIGGRAPH 2005 Courses, NY, USA, 2005
N. Jones, P. Pevzner,“An Introduction to Bioinformatics Algorithms”,
Massachusetts Institute of Technology Press, 2004
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 74 / 88
85. Extra Slides
Finding Motifs using Random Projection (FMURP)
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 Projection
1 Generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 For each Si,j in S,
1 Get hI (Si,j )s from all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Sort Si,j s with respect to hI (Si,j ).
3 Perform a linear search over all hI (Si,j )s to determine which l-mers
are ‘hashed’ in the same bucket.
2 Refine each enriched bucket using Expectation Maximization (EM)
3 Refine each enriched bucket using SP-STARσ
4 Maximize score to output best motif
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 75 / 88
86. Extra Slides Projection
Projection: Example
Given a set of DNA sequences S, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3.
S0 : C G G T C A G G
S1 : T T C G A C A T
S2 : A C G A T G A A
Figure: Set of t = 3 sequences each with n = 8
We generate the set of k random positions used in the actual projection.
Suppose we have the set I = {0, 1}.
For all Si,j in S, we get hI (Si,j ) using the random positions in I generated
in step 1.
To hash Si,j s to corresponding buckets using its hI (Si,j ), the list defined
above is sorted lexicographically in terms of hI (Si,j ) together with their
corresponding Si,j s .The sorted list is obtained.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 76 / 88
87. Extra Slides Projection
Projection: Example
Label Si,j hI (Si,j ) Label Sorted Si,j Sorted hI (Si,j )
S0,0 CGGT CG S2,0 ACGA AC
S0,1 GGTC GG S1,4 ACAT AC
S0,2 GTCA GT S2,3 ATCA AT
S0,3 TCAG TC S0,4 CAGG CA
S0,4 CAGG CA S0,0 CGGT CG
S1,0 TTCG TT S2,1 CGAT CG
S1,1 TCGA TC S1,2 CGAC CG
S1,2 CGAC CG S1,3 GACA GA
S1,3 GACA GA S2,2 GATC GA
S1,4 ACAT AC S0,1 GGTC GG
S2,0 ACGA AC S0,2 GTCA GT
S2,1 CGAT CG S1,1 TCGA TC
S2,2 GATC GA S0,3 TCAG TC
S2,3 ATCA AT S2,4 TGAA TG
S2,4 TGAA TG S1,0 TTCG TT
J.B. Clemente (ACLab, DCS, UPD) h (S )s computed from step 2. March 31, 2012
Figure: Illustration showing the set of CUDA-FMURP The sorted 77 / 88
88. Extra Slides Projection
Projection: Example
To get the list of buckets, we will perform a linear search over hI (Si,j )s to
get the corresponding Si,j with equivalent hI (Si,j )s.
hI (Si,j ) Count Si,j
AC 2 { ACGA, ACAT }
AT 1 { ATCA }
CA 1 {CAGG }
CG 3 {CGGT, CGAT , CGAC }
GA 2 {GACA, GATC }
GG 1 {GGTC }
GT 1 {GTCA }
TC 2 {TCGA, TCAG }
TG 1 {TGAA }
TT 1 {TTCG}
Figure: Buckets obtained from Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 78 / 88
89. Extra Slides Projection
Projection: Example
From the set of buckets obtained, we identify which of those contains at
least δ l-mers hashed and consider them enriched.
hI (Si,j ) Count Si,j
AC 2 { ACGA, ACAT }
AT 1 { ATCA }
CA 1 {CAGG }
CG 3 {CGGT, CGAT , CGAC }
GA 2 {GACA, GATC }
GG 1 {GGTC }
GT 1 {GTCA }
TC 2 {TCGA, TCAG }
TG 1 {TGAA }
TT 1 {TTCG}
Figure: Buckets obtained from Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 79 / 88
90. Extra Slides Projection
Projection: Example
From the set of buckets obtained, we identify which of those contains at
least δ l-mers hashed and consider them enriched.
hI (Si,j ) Count Si,j
AC 2 { ACGA, ACAT }
AT 1 { ATCA }
CA 1 {CAGG }
CG 3 {CGGT, CGAT , CGAC }
GA 2 {GACA, GATC }
GG 1 {GGTC }
GT 1 {GTCA }
TC 2 {TCGA, TCAG }
TG 1 {TGAA }
TT 1 {TTCG}
Figure: Buckets obtained from Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 80 / 88
91. Extra Slides Expectation Maximization (EM)
Expectation Maximization (EM)
INPUT: Motif model θ0 from one enriched bucket, maximum number of
iterations, and threshold for convergence δEM
OUTPUT: Motif model θy
1 For j in {1, . . . , y} or until convergence
1 E-step For all l-mer in each sequence Si ,
compute E(Si,ai |θj ) given the current motif model.
2 (M-step) For all Si in S,
get starting positions s such that for each ai ∈ s,
E(Si,ai |θj ) is maximum ∀ ai in {0, . . . , (n − l)}.
3 (Test for Convergence) Compute L(θj ). Compare previous
likelihood L(θj−1 ) to current L(θj ).
If the difference satisfies the threshold δEM , stop iteration.
4 (Update step) For the alignment made by starting position vector s
identified in M-step,
get motif model θj+1 .
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 81 / 88
92. Extra Slides Expectation Maximization (EM)
EM: Example
From the set of enriched bucket from Projection, EM performs the following
operations.
From EB , get the alignment made by hashed l-mers.
C G G T
C G A C
C G A T
From the alignment made, a profile matrix is computed.
C G G T
C G A C
C G A T
A: 0 0 2 0
C: 3 0 0 1
G: 0 3 1 0
T: 0 0 0 2
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 82 / 88
93. Extra Slides Expectation Maximization (EM)
EM: Example
Normalize the profile matrix obtained.
A: 0.00 0.00 0.33 0.00
C: 1.00 0.00 0.00 0.33
G: 0.00 1.00 0.66 0.00
T: 0.00 0.00 0.00 0.66
To avoid zero values for Pr(Si,j |θ), [Tompa,2001] performed Laplace
correction. For each row corresponding to a symbol say a, the
probability pa that symbol a appears in the sequence is added to its
corresponding row. Since all symbols in ΣDNA has uniform frequency
distribution, 0.25 is added for each row.
A: 0.25 0.25 0.58 0.25
C: 1.25 0.25 0.25 0.58
G: 0.25 1.25 0.91 0.25
T: 0.25 0.25 0.25 0.91
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 83 / 88
94. Extra Slides Expectation Maximization (EM)
EM: Example
Normalize the matrix obtained and let the resulting matrix be the initial
motif model θ0 .
A: 0.125 0.125 0.290 0.125
C: 0.625 0.125 0.125 0.290
G: 0.125 0.625 0.455 0.125
T: 0.125 0.125 0.125 0.455
For each Si in S get j such that for all j ∈ {0, . . . , (n − l)}, E(Si,j |θ0 ) is
maximum. For instance, let’s identify an l-mer in sequence S0 with
maximum expectation E(S0,j |θ0 ).
E(S0,0 |θ0 ) = E(CGGT|θ0 ) = ((0.625)(0.625)(0.455)(0.455))/(0.254 ) = 20.725
E(S0,1 |θ0 ) = E(GGTC|θ0 ) = ((0.125)(0.625)(0.125)(0.125))/(0.254 ) = 00.313
E(S0,2 |θ0 ) = E(GTCA|θ0 ) = ((0.125)(0.125)(0.125)(0.125))/(0.254 ) = 00.063
E(S0,3 |θ0 ) = E(TCAG|θ0 ) = ((0.125)(0.125)(0.455)(0.290))/(0.254 ) = 00.528
E(S0,4 |θ0 ) = E(CAGG|θ0 ) = ((0.625)(0.125)(0.455)(0.125))/(0.254 ) = 01.138
From all S0,j s in S0 , l-mer S0,0 obtains the highest expectation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 84 / 88
95. Extra Slides Expectation Maximization (EM)
EM: Example
The set of l-mers with the highest expectation in each sequence will
define another alignment, like in Step 1. From this set of l-mers, we can
obtain the next motif model θ1 .
S0,0 : C G G T : 20.73
S1,2 : C G A C : 08.41
S2,1 : C G A T : 13.20
We compute the likelihood of a motif model θy using the best
expectations.
L(θ) = 20.73 + 08.41 + 13.20 = 42.34
Update the motif model θ0 to get θ1 , using the set of l-mers from each
sequence that maximize the expectation.
Stop iteration if L(θy ) − L(θy−1 ) ≤ δEM .
The output of EM in this example is the consensus string CGAT.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 85 / 88
96. Extra Slides Expectation Maximization (EM)
EM: Example
The set of l-mers with the highest expectation in each sequence will
define another alignment, like in Step 1. From this set of l-mers, we can
obtain the next motif model θ1 .
S0,0 : C G G T : 20.73
S1,2 : C G A C : 08.41
S2,1 : C G A T : 13.20
We compute the likelihood of a motif model θy using the best
expectations.
L(θ) = 20.73 + 08.41 + 13.20 = 42.34
Update the motif model θ0 to get θ1 , using the set of l-mers from each
sequence that maximize the expectation.
Stop iteration if L(θy ) − L(θy−1 ) ≤ δEM .
The output of EM in this example is the consensus string CGAT.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 85 / 88
97. Extra Slides Expectation Maximization (EM)
SP-STARσ
INPUT: Consensus string M from θy and expected mismatches d
OUTPUT: Refined consensus string M ∗
1 For j in {1, . . . , y } or until convergence
1 Compute for Sb , where Sb is the set of all l-mers from each sequence that
has the least Edit distance from M.
Sb = {Si,j |dE (M, Si,j ) is minimum ∀Si,j in Si }
2 Compute for score σ(Sb ), where it is equal to the number of sequences in
Sb such that
dE (M, Si,j ) ≤ d
3 Compute the consensus string M from alignment made by Sb .
4 Compute Sb from M .
5 Compute σ(Sb ).
6 If σ(Sb ) > σ(Sb ), continue iteration using M = M ,
else M ∗ = M .
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 86 / 88
98. Extra Slides Expectation Maximization (EM)
SP-STARσ: Example
Using M =CGAT and expected mismatches d = 1.
Compute for Sb . For S0 the S0,j is identified as follows.
dE (M, S0,0 ) = dE (CGAT, CGGT) = 1
dE (M, S0,1 ) = dE (CGAT, GGTC) = 3
dE (M, S0,2 ) = dE (CGAT, GTCA) = 4
dE (M, S0,3 ) = dE (CGAT, TCAG) = 3
dE (M, S0,4 ) = dE (CGAT, CAGG) = 3
The set Sb contains
Sb = {S0,0 , S1,2 , S2,1 }
Sb = CGGT, CGAC, CGAT
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 87 / 88
99. Extra Slides Expectation Maximization (EM)
SP-STARσ: Example
Score for Sb is
σ(Sb ) = 3
because the least edit distance in each sequence is 1, 1, 0. That is all 3
sequences satisfies
dE (M, Si,j ) ≤ 1
Consensus string from Sb is M = CGAT.
Sb from M is similar to Sb .
Sb = {S0,0 , S1,2 , S2,1 }
Sb = {CGGT, CGAC, CGAT}
Since σ(Sb ) = σ(Sb ),
M ∗ = M = CGAT.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 88 / 88