Parallel Random Projection for Motif Discovery on GPUs

Finding Planted (l, d)-Motifs in Parallel
using Random Projection on GPUs

Jhoirene Barasi Clemente

Algorithms and Complexity Laboratory
Department of Computer Science
University of the Philippines-Diliman
jbclemente@up.edu.ph

March 31, 2012

J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 1 / 88

Overview

Overview

Introduction
Deﬁnitions and Notations
Finding Motifs using Random Projection (FMURP)
Parallel Implementations of CUDA-FMURP
Results and Analysis
Conclusion


Introduction

In this work, we are interested in solving Planted (l, d)-Motif Problem
using Random Projection (FMURP).
The focus of this study is on parallelization of FMURP, where we
present three versions of the parallel algorithm. Correctness of the
parallelization is also discussed.
We implement two of these parallel algorithms on GPUs. Theoretical
and actual performance analyses are also presented.


Introduction

Introduction

A DNA motif is deﬁned as a nucleic acid sequence pattern that has some
biological signiﬁcance such as being DNA binding sites for a regulatory
protein. i.e., a transcription factor [Das,2007].


Introduction

Introduction

DNA Sequences as Strings


Introduction

Introduction

The pattern is fairly short (5 to 20 base-pairs (bp) long) and is known to recur
in different genes or several times within gene [Rombauts,1999].


Introduction Notations

Notations

Set of t sequences S.

Example 1 (Sequences S = {S0 , S1 , . . . , S(t−1) })
S0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T A
S1 : T T T G A G G G T G C C C A A T A A A T G C C A C T C C A A A G C G G A C A A A
S2 : G G A T G C A A C T G A T G C C G T T T G A C G A C C T A A A T C A A C G G C C
S3 : A A G G A T G C A A C T C C A G G A G C G C C T T T G C T G G T T C T A C C T G
S4 : A A T T T T C T A A A A A G A T T A T A A T G T C G G T C C A T G C A A C T T C
S5 : C T G C T G T A C A A C T G A G A T C A T G C T G C A T G C A A C T T T C A A C
S6 : T A C A T G A T C T T T T G A T G C A A C G T G G A T G A G G G A A T G A T G C

Set of sequences S = {S0 , S1 , S2 , S3 , S4 , S5 , S6 }
deﬁned over ΣDNA = {A, C, T, G},
where each sequence Si in S has length ni = 40 for all i ∈ {0, . . . , (t − 1)}



Notations

An l-mer is a string of length l deﬁned over ΣDNA .
To denote an l-mer in S, we use
Si,j , where i ∈ {0, 1, . . . , (t − 1)} is the sequence number
and j ∈ {0, 1, . . . , (n − l)} is the starting position in Si .

Example 2 (Si,j in S)
For instance, an 8-mer S0,7 is

ATGGAACT




Notations

Let s = (a0 , a1 , . . . , a(t−1) ) be the set of starting positions in S,
where ai ∈ {0, 1, . . . , (n − l)}.
Let A(s) denotes the alignment made by l-mers in the set
{S0,a0 , S1,a1 , . . . , S(t−1),a(t−1) }.



Notations

Example 3 (Alignment matrix A(s))
Suppose we have a starting position vector s = (7, 18, 2, 4, 30, 26, 14)

S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G

S1 : T T T G A G G G T G C C C A A T A A A T G C C A C T C C A A A G C G G A C A A A
S2 : G G A T G C A A C T G A T G C C G T T T G A C G A C C T A A A T C A A C G G C C
S3 : A A G G A T G C A A C T C C A G G A G C G C C T T T G C T G G T T C T A C C T G
S4 : A A T T T T C T A A A A A G A T T A T A A T G T C G G T C C A T G C A A C T T C
S5 : C T G C T G T A C A A C T G A G A T C A T G C T G C A T G C A A C T T T C A A C
S6 : T A C A T G A T C T T T T G A T G C A A C G T G G A T G A G G G A A T G A T G C



Notations

A proﬁle matrix P(s) with dimension equal to (|ΣDNA | × l) is derived
from the frequency of each letter in each column of the A(s).

Example 4 (Proﬁle Matrix P(s))
S1,18 : A T G C C A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G

A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1



Notations

From P(s), we deﬁne MP(s) (j), where 0 ≤ j ≤ (l − 1), be the maximum
number at jth column of the proﬁle matrix.

Example 5 (MP(s),j )
S1,18 : A T G C C A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G

A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1



Notations
A consensus string is an l-mer, where each of its elements is the
nucleotide base corresponding to MP(s) (i).

Example 6 (Consensus String)
S1,18 : A T G C C A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G

A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1

Consensus String A T G C A A C T


Notations

We deﬁne the Score(s,S) to be equal to
l−1
Score(s, S) = MP(s) (i). (1)
i=0

Example 7 (Consensus Score())
A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1

Score(s, S) = 7 + 7 + 7 + 6 + 6 + 7 + 7 + 6 = 53


Introduction Motif Finding Problem

Motif Finding Problem

Deﬁnition 8 (Motif Finding Problem [Pevzner,2004])

INPUT:
A motif length l
A set of t sequences S = {S0 , S1 , S2 , . . . , S(t−1) },
where each Si is of length ni
OUTPUT:
An array of starting positions s = (a0 , a1 , . . . , a(t−1) )
maximizing consensus Score(s,S)


Introduction Motif Finding Problem

Naive MFP Solver [Pevzner,2004]

Input: DNA (sequences), motif length l
Output: Starting position s and consensus string corresponding to s
1 For each possible starting position in S,
i.e. s ∈ {(0, 0, . . . , 0), . . . , ((n − l), (n − l) . . . , (n − l))}.
1 Get alignment A(s).
2 Compute for P(s).
3 Evaluate Score(s, S).
2 From s with the maximum Score, get the consensus string.
3 Output consensus string.
Step 1 needs to iterate (n − l + 1)t times because all possible starting
positions s is equal to

s = (a0 , a1 , . . . , a(t−1) ), ∀ ai ∈ {0, . . . , (n − l)}.


Introduction Planted (l, d)-Motif Finding Problem

Deﬁnitions

Deﬁnition 9 (Challenge Problem [Pevzner,2000])
INPUT:
Motif length l = 15,
Expected mismatches d,
20 DNA sequences each with ni = 600 nucleotide bases
OUTPUT:
A consensus string M from an alignment A(s), where each l-mer in A(s)
has Si,ai

dE (M, Si,ai ) = 4,
for all i ∈ {0, . . . , (t − 1)}.



Why challenging?

Suppose we have A(s),

S0,a0 A C T T G G G G C A A G A G G
S1,a1 G G A C G G G G C A G A C T G
S2,a2 A C T T G C T A A A G A C T G
S3,a3 A C T G C G G G C A C A G T G
S4,a4 A C C T G G G T C G T A C T G
A: 4 0 1 0 0 0 0 1 1 4 1 4 1 0 0
C: 0 4 1 1 1 1 0 0 4 0 1 0 2 0 0
T: 0 0 3 3 0 0 1 1 0 0 1 0 1 4 0
G: 1 1 0 1 4 4 4 3 0 1 2 1 1 1 5
A C T T G G G G C A G A C T G

dE (S0,a0 , S1,a1 ) = 2d = 8
Score(s, S) = 4 + 4 + 3 + 3 + 4 + 4 + 4 + 3 + 4 + 4 + 2 + 4 + 2 + 4 + 5 = 54



Deﬁnitions

Deﬁnition 10 (Planted (l, d)-Motif Finding Problem [Tompa,2001])

INPUT:
Motif length l,
Expected number of mismatches d, and
A set of t sequences S = {S0 , S1 , S2 , . . . , S(t−1) }, where each Si is of
length ni
OUTPUT:

A consensus string M from an alignment A(s), where each l-mer in A(s)
has Si,ai
dE (M, Si,ai ) = d,
for all i ∈ {0, . . . , (t − 1)}.



Solutions for Planted (l, d)-Motif Finding

SP-STAR [Pevzner,2000]
Winnower [Pevzner,2000]
Random Projection [Tompa,2001]
Aggregation [Mohammed,2004]
GibbsDST [Shida,2006]




INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 Projection
1 Get all l-mer Si,j s in S.
2 Get projection hI (Si,j ) for each Si,j in S.
3 Hash each Si,j to buckets with identifier hI (Si,j ).
4 Get enriched buckets.
2 Refine each enriched bucket using EM
3 Refine each enriched bucket using SP-STARσ
4 Maximize score to output best motif



Deﬁnition 11
Random Projection Given an l-mer Si,j , projection dimension k, and a set
I ⊂ L = {0, . . . , (l − 1)}, where |I| = k, elements in I are sorted in increasing
order and are randomly chosen from the set L, a k-dimensional projection of
Si,j is
hI (Si,j ) = Si,j (I0 ), Si,j (I1 ), . . . , Si,j (I(k−1) ),
where hI (Si , j) is a k-mer and Ii denotes the ith element in I.



FMURP: Example

Example 12
Given a set of DNA sequences S, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3.

S0 : C G G T C A G G
S1 : T T C G A C A T
S2 : A C G A T G A A
Figure: Set of t = 3 sequences each with n = 8

Let I = {0, 1}.



Projection


Parallel Motif Finding using Random Projection

How do we parallelize FMURP?

1 Projection
1 Projection 1 Get all l-mer Si,j s in S in
1 Get all l-mer Si,j s in S. parallel.
2 Get projection hI (Si,j ) for each 2 Get projection hI (Si,j ) for each
Si,j in S. Si,j in S in parallel.
3 Hash each Si,j to buckets with 3 Hash each Si,j to buckets with
identifier hI (Si,j ). identifier hI (Si,j ) in parallel.
4 Get enriched buckets. 4 Get enriched buckets in
2 Refine each enriched bucket parallel.
using EM 2 Refine each enriched bucket
3 Refine each enriched bucket using EM in parallel
using SP-STARσ 3 Refine each enriched bucket
4 Maximize score to output best using SP-STARσ in parallel
motif 4 Maximize score to output best
motif.



Parallel Algorithms for Motif Finding

CUDA-MEME
CUDA-Gibbs Sampling



CUDA


Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)

Computing Framework

Figure: Flowchart showing the processes done in the CPU and GPU



CUDA-FMURP v1

Figure: Thread ID is denoted by an ordered pair (i, j), 0 ≤ i ≤ w and 0 ≤ j ≤ v, where v is
the maximum thread per block and w is the number of allocated thread blocks in the grid. The
algorithm uses a total of x = t · (n − l + 1) threads that are linearly arranged in GPU.



CUDA-FMURP v1
INPUT: Set of sequences S, motif length l, expected mismatches d, projection dimension k,
and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for each Si,j in S,
∗
2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j .
∗
3 Perform a linear search over all ki,j s to determine which l-mers
∗
are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead
of the actual l-mer.
3 In CPU, identify the set of enriched buckets,
and prune duplicates in preparation for EM reﬁnement.
4 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM reﬁnement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.


Integer Conversion
Step 2.2 represents each hI (Si,j ) to their corresponding integer representation
∗
ki,j . Given a unique k-mer from projection, a corresponding integer is
computed using the following mapping. Let us define
f : ΣDNA → {0, 1, 2, 3},
A → 0
C → 1
G → 2
T → 3
where each symbol in the DNA alphabet is mapped to a unique integer.
For a string v of length k,
f∗ : Σ+
DNA → Z+ ∪ {0}
k−1 i
v → i=0 f (vi )4

where vi denotes the symbol at ith position starting from the least significant
digit and the integer representation is only defined on the positive integers
including {0}.


CUDA-Projection v1: Example
Given a set of DNA sequences, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3. Projection in parallel is shown as follows



CUDA-Projection v1: Integer Conversion example



CUDA-Projection: Parallel Integer Conversion Example



CUDA-Projection: Getting enriched buckets



CUDA-EM



CUDA-SP-STARσ



Correctness v1

i|i
∗
∗
∗



Correctness v1

The uniqueness of the representation we deﬁned using f ∗ follows from the
results below.
Let Σk = {0, 1, 2, . . . , k − 1}, and let Ck a regular language such that,

Ck = { } ∪ (Σk − {0})Σ∗ .
k

Theorem 4.1 (Fundamental Theorem of base-k Representation
[Allouche,2003])
Let k ≥ 2 be an integer. Then every non-negative integer has a unique
representation of the form
t
N= ai ki ,
i=0

where at = 0 and 0 ≤ ai < k for 0 ≤ i ≤ t.



Correctness v1

In the case of our representation f ∗ , we have k = 4 and ai = f (vi ), where
vi ∈ ΣDNA . Note that the mapping f is one-to-one and onto by deﬁnition. Thus
we have the following:
Proposition 4.1

f ∗ provides a unique representation of hI (Si,j ), for each i, j, and element of I.



Correctness v1

i|i
∗
∗
∗



Correctness v1

We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.

EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R deﬁned below.
Deﬁnition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )

Proposition 4.2
R is an equivalence relation.



Correctness v1
In CUDA-FMURP v1, an enriched bucket is defined as

¯ ¯ ¯
EB = {B| |B| ≥ δ}.
¯
where B is a bucket in CUDA-FMURP and two elements p and q belongs to
¯ ¯
the same bucket B if it follows the relation R defined below.
¯
Definition 14 (Relation R)
¯
(p, q) ∈ B ⇔ (p, q) ∈ R ¯
¯
(p, q) ∈ R ⇔ ∗ = k∗
ki,j ¯¯
i,j

where i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and
i
¯ = q mod (n − l + 1).
j

Lemma 15
¯
Relation R and R are equivalent.



Correctness
¯
Note that elements in B involves Si,j s while elements in B involves the set of
integers p ∈ {0, . . . , (x − 1)}. Using Equations

tid = i × (n − l + 1) + j (2)

tid
i= (3)
(n − l + 1)

j = tid mod (n − l + 1) (4)
we can retrieve the l-mer Si,j corresponding to tid and vice versa. The theorem
¯
below follows from the fact that R and R are equivalent.
Theorem 4.2
¯
Set of enriched buckets EB and EB are equivalent.


Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)

CUDA-FMURP v2
OUTPUT: Motif
i|i
1 Get hI (Si,j )s for all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Si,j ) to its corresponding
∗
integer representation ki,j .
3 ∗
In CPU, hash the list of ki,j s .
4 In CPU, identify the set of enriched buckets.


CUDA-FMURP v2
OUTPUT: Motif
i|i
1 Get hI (Si,j )s for all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Si,j ) to its corresponding
∗
integer representation ki,j .
3 ∗
In CPU, hash the list of ki,j s.


Hash Table in CPU



Hash Table in CPU

∗
To avoid collision between two items with different ki,j s, linear probing is
implemented.
Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is not
empty,
i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j .
∗ ∗ ∗

We have to look for empty positions in table where we can place item p.
We explore positions

h (ki∗ ,j , i) = (h(ki,j ) + i)
∗
mod x
for i from 0 to (m − 1), until an empty hash table position is found.



CUDA-FMURP v3
OUTPUT: Motif
i|i
2 In GPU, for each thread tid in {0, . . . , (t − 1)},
1 Get hI (Stid,j )s for all Stid,j s in S,
where j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Stid,j ) to its corresponding
∗
integer representation ktid,j .
3 ∗


CUDA-Projection v3



Integer Conversion


Result and Analysis

Running Time and Space Complexity

Algorithm Time Space Number of Processors
FMURP O(log(x)) O(x) 1
SEQ-FMURP O(x2 ) Oe(n − l + 1) 1
CUDA-FMURP v1 O(x) O(e(n − l + 1)) x
CUDA-FMURP v2 O(x) O(e(n − l + 1)) x
CUDA-FMURP v3 O(x) O(e(n − l + 1)) t

Table: Total running time and space complexity of the three parallel algorithms for
CUDA-FMURP in comparison with the two sequential implementations.


Result and Analysis

Speedup and Efﬁciency

FMURP: O(x log x)
The computation of Speedup is the ratio of sequential and parallel running
time.

Sequential
SP =
Parallel
Comparison of Speedups SP , SP , and SP for CUDA-FMURP versions 1 to 3,
respectively is shown below.

O(x log x)
SP = SP = SP = = O(log x)
O(x)


Result and Analysis

Speedup and Efficiency
Computation of processor Efficiency makes use of the speedup SP and
number of processors used ˆ.
p

1
· SPEP =
ˆ
p
Comparison of Efficiencies EP , EP , and EP for CUDA-FMURP versions 1 to
3, respectively is shown below.

1 log x
EP = · O(log x) = (5)
x x
1 log x
EP = · O(log x) = (6)
x x
1 log x
EP = · O(log x) = (7)
t t

EP = EP < EP

Result and Analysis Dataset

Dataset

t n l d Instances generated
20 600 10 2 100
20 600 11 2 100
20 600 12 3 100
20 600 13 3 100
20 600 14 4 100
20 600 15 4 100
20 600 16 5 100
20 600 17 5 100
20 600 18 6 100
20 600 19 6 100
Table: Summary of generated dataset that is used to determine the accuracy of
CUDA-FMURP. For each of the instance generated, the search model OOPS is
assumed, that is each sequence contains exactly one occurrence of the planted motif.


Result and Analysis Dataset

Accuracy

t n l d FMURP FMURP∗ SEQ-FMURP CUDA-FMURP m
20 600 10 2 13 100 98 98 72
20 600 11 2 99 100 100 100 16
20 600 12 3 3 96 83 83 259
20 600 13 3 81 100 100 100 62
20 600 14 4 1 86 79 79 645
20 600 15 4 49 100 100 100 172
20 600 16 5 0 77 53 53 1292
20 600 17 5 19 98 98 98 378
20 600 18 6 0 82 38 38 2217
20 600 19 6 9 98 94 94 711

Table: The table shows the number of correctly identiﬁed planted motif over 100
random input instances. For each of the instances, parameters k = 7 and s = 4 are
used. The column labelled FMURP∗ is based from the result presented in
[Tompa,2001] using the dataset they generated.


Result and Analysis Machine Setups

Machine Setups

System speciﬁcations Values
System speciﬁcations Values Host processors (procs) Core(TM) i7-2600 CPU 3.40GHz
Host processors (procs) 2 × Intel Quad-core 2.26GHz Total number of cores 4 × 2 (hyperthreaded) = 8
Total number of cores 8 Max host RAM 8GB
Max host RAM 12GB Device/s (GPU/s) 1 × NVIDIA GeForce GTX 580
Device/s (GPU/s) 2 × NVIDIA GT120 Compute capability 2.0
Compute capability 1.1 CUDA Cores/GPU 16 (multiprocs) × 32 (cores/proc) = 512
CUDA Cores/GPU 4 (multiprocs) × 8 (cores/proc) = 32 GPU clock rate 1.54 GHz
GPU clock rate 1.40 GHz Memory clock rate 2004 Mhz
Memory clock rate 500 Mhz Max device global memory 1535MB
Max device global memory 512MB Operating system 64-bit Ubuntu 10.0.4
Operating system Mac OS X 10.6.8 CUDA version 4.1
CUDA version 3.2


Result and Analysis Actual Speedup

Actual speed of CUDA-Projection v3 with respect to
CUDA-Projection v1



Actual speed of CUDA-FMURP v1 and CUDA-Projection
v3



Actual Speed Result: Setup1



Memory Requirement



Actual speed comparison and speedup of CUDA-FMURP
v1 with respect to SEQ-FMURP and FMURP using Setup 2


Conclusion

Conclusion

In this work, we presented three versions of parallel algorithms for FMURP.

Algorithm Processors SP wrt FMURP SP wrt SEQ-FMURP Efﬁciency
CUDA-FMURP v1 x O(log x) O(x) (log x/x)
CUDA-FMURP v2 x O(log x) O(x) (log x/x)
CUDA-FMURP v3 t O(log x) O(x) (log x/t)

We implemented CUDA-FMURP v1 and CUDA-FMURP v2 and achieved a
maximum actual speedup of 6.8 and 6.6 respectively with respect to the
SEQ-FMURP.


Conclusion

curtain


References

References
J.P. Allouche and J. Shallit, “Automatic Sequences: Theory Applications
and Generalizations”, Cambridge University Press,Chapter 3:
Numeration Systems, pp 70-73, 2003
P. Pevzner and S. H. Sze, “Combinatorial Approaches to Finding Subtle
Signals in DNA Sequences”, Proceedings of 8th Int. Conf. Intelligent
Systems for Molecular Biology (ISMB), 269-78, 2000
J. Buhler, M. Tompa, “Finding Motifs Using Random Projections”,
RECOMB ’01 Proceedings of the ﬁfth annual international conference on
Computational biology, 2001
D. Kirk, W. Hwu, Programming Massively Parallel Processors: A Hands
On Approach, 1st ed. MA, USA: Morgan Kaufmann, 2010
M. Harris, “Mapping computational concepts to GPUs”, ACM
SIGGRAPH 2005 Courses, NY, USA, 2005
N. Jones, P. Pevzner,“An Introduction to Bioinformatics Algorithms”,
Massachusetts Institute of Technology Press, 2004

Extra Slides


OUTPUT: Motif
1 Projection
1 Generate k random positions for projection.
i|i
2 For each Si,j in S,
1 Get hI (Si,j )s from all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Sort Si,j s with respect to hI (Si,j ).
3 Perform a linear search over all hI (Si,j )s to determine which l-mers
are ‘hashed’ in the same bucket.
2 Reﬁne each enriched bucket using Expectation Maximization (EM)
3 Reﬁne each enriched bucket using SP-STARσ
4 Maximize score to output best motif


Extra Slides Projection

Projection: Example
Given a set of DNA sequences S, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3.

S0 : C G G T C A G G
S1 : T T C G A C A T
S2 : A C G A T G A A
Figure: Set of t = 3 sequences each with n = 8

We generate the set of k random positions used in the actual projection.
Suppose we have the set I = {0, 1}.
For all Si,j in S, we get hI (Si,j ) using the random positions in I generated
in step 1.
To hash Si,j s to corresponding buckets using its hI (Si,j ), the list deﬁned
above is sorted lexicographically in terms of hI (Si,j ) together with their
corresponding Si,j s .The sorted list is obtained.


Projection: Example
Label Si,j hI (Si,j ) Label Sorted Si,j Sorted hI (Si,j )
S0,0 CGGT CG S2,0 ACGA AC
S0,1 GGTC GG S1,4 ACAT AC
S0,2 GTCA GT S2,3 ATCA AT
S0,3 TCAG TC S0,4 CAGG CA
S0,4 CAGG CA S0,0 CGGT CG
S1,0 TTCG TT S2,1 CGAT CG
S1,1 TCGA TC S1,2 CGAC CG
S1,2 CGAC CG S1,3 GACA GA
S1,3 GACA GA S2,2 GATC GA
S1,4 ACAT AC S0,1 GGTC GG
S2,0 ACGA AC S0,2 GTCA GT
S2,1 CGAT CG S1,1 TCGA TC
S2,2 GATC GA S0,3 TCAG TC
S2,3 ATCA AT S2,4 TGAA TG
S2,4 TGAA TG S1,0 TTCG TT
J.B. Clemente (ACLab, DCS, UPD) h (S )s computed from step 2. March 31, 2012
Figure: Illustration showing the set of CUDA-FMURP The sorted 77 / 88


Projection: Example
To get the list of buckets, we will perform a linear search over hI (Si,j )s to
get the corresponding Si,j with equivalent hI (Si,j )s.

hI (Si,j ) Count Si,j
AC 2 { ACGA, ACAT }
AT 1 { ATCA }
CA 1 {CAGG }
CG 3 {CGGT, CGAT , CGAC }
GA 2 {GACA, GATC }
GG 1 {GGTC }
GT 1 {GTCA }
TC 2 {TCGA, TCAG }
TG 1 {TGAA }
TT 1 {TTCG}
Figure: Buckets obtained from Projection



Projection: Example
From the set of buckets obtained, we identify which of those contains at
least δ l-mers hashed and consider them enriched.

AC 2 { ACGA, ACAT }
AT 1 { ATCA }
CA 1 {CAGG }
GA 2 {GACA, GATC }
GG 1 {GGTC }
GT 1 {GTCA }
TC 2 {TCGA, TCAG }
TG 1 {TGAA }
TT 1 {TTCG}


Extra Slides Expectation Maximization (EM)

Expectation Maximization (EM)

INPUT: Motif model θ0 from one enriched bucket, maximum number of
iterations, and threshold for convergence δEM
OUTPUT: Motif model θy
1 For j in {1, . . . , y} or until convergence
1 E-step For all l-mer in each sequence Si ,
compute E(Si,ai |θj ) given the current motif model.
2 (M-step) For all Si in S,
get starting positions s such that for each ai ∈ s,
E(Si,ai |θj ) is maximum ∀ ai in {0, . . . , (n − l)}.
3 (Test for Convergence) Compute L(θj ). Compare previous
likelihood L(θj−1 ) to current L(θj ).
If the difference satisﬁes the threshold δEM , stop iteration.
4 (Update step) For the alignment made by starting position vector s
identiﬁed in M-step,
get motif model θj+1 .



EM: Example
From the set of enriched bucket from Projection, EM performs the following
operations.
From EB , get the alignment made by hashed l-mers.
C G G T
C G A C
C G A T
From the alignment made, a proﬁle matrix is computed.
C G G T
C G A C
C G A T
A: 0 0 2 0
C: 3 0 0 1
G: 0 3 1 0
T: 0 0 0 2



EM: Example

Normalize the proﬁle matrix obtained.
A: 0.00 0.00 0.33 0.00
C: 1.00 0.00 0.00 0.33
G: 0.00 1.00 0.66 0.00
T: 0.00 0.00 0.00 0.66
To avoid zero values for Pr(Si,j |θ), [Tompa,2001] performed Laplace
correction. For each row corresponding to a symbol say a, the
probability pa that symbol a appears in the sequence is added to its
corresponding row. Since all symbols in ΣDNA has uniform frequency
distribution, 0.25 is added for each row.
A: 0.25 0.25 0.58 0.25
C: 1.25 0.25 0.25 0.58
G: 0.25 1.25 0.91 0.25
T: 0.25 0.25 0.25 0.91



EM: Example

Normalize the matrix obtained and let the resulting matrix be the initial
motif model θ0 .
A: 0.125 0.125 0.290 0.125
C: 0.625 0.125 0.125 0.290
G: 0.125 0.625 0.455 0.125
T: 0.125 0.125 0.125 0.455
For each Si in S get j such that for all j ∈ {0, . . . , (n − l)}, E(Si,j |θ0 ) is
maximum. For instance, let’s identify an l-mer in sequence S0 with
maximum expectation E(S0,j |θ0 ).

E(S0,0 |θ0 ) = E(CGGT|θ0 ) = ((0.625)(0.625)(0.455)(0.455))/(0.254 ) = 20.725
E(S0,1 |θ0 ) = E(GGTC|θ0 ) = ((0.125)(0.625)(0.125)(0.125))/(0.254 ) = 00.313
E(S0,2 |θ0 ) = E(GTCA|θ0 ) = ((0.125)(0.125)(0.125)(0.125))/(0.254 ) = 00.063
E(S0,3 |θ0 ) = E(TCAG|θ0 ) = ((0.125)(0.125)(0.455)(0.290))/(0.254 ) = 00.528
E(S0,4 |θ0 ) = E(CAGG|θ0 ) = ((0.625)(0.125)(0.455)(0.125))/(0.254 ) = 01.138

From all S0,j s in S0 , l-mer S0,0 obtains the highest expectation.



EM: Example

The set of l-mers with the highest expectation in each sequence will
deﬁne another alignment, like in Step 1. From this set of l-mers, we can
obtain the next motif model θ1 .
S0,0 : C G G T : 20.73
S1,2 : C G A C : 08.41
S2,1 : C G A T : 13.20
We compute the likelihood of a motif model θy using the best
expectations.

L(θ) = 20.73 + 08.41 + 13.20 = 42.34

Update the motif model θ0 to get θ1 , using the set of l-mers from each
sequence that maximize the expectation.
Stop iteration if L(θy ) − L(θy−1 ) ≤ δEM .
The output of EM in this example is the consensus string CGAT.


SP-STARσ

INPUT: Consensus string M from θy and expected mismatches d
OUTPUT: Reﬁned consensus string M ∗
1 For j in {1, . . . , y } or until convergence
1 Compute for Sb , where Sb is the set of all l-mers from each sequence that
has the least Edit distance from M.

Sb = {Si,j |dE (M, Si,j ) is minimum ∀Si,j in Si }

2 Compute for score σ(Sb ), where it is equal to the number of sequences in
Sb such that
dE (M, Si,j ) ≤ d
3 Compute the consensus string M from alignment made by Sb .
4 Compute Sb from M .
5 Compute σ(Sb ).
6 If σ(Sb ) > σ(Sb ), continue iteration using M = M ,
else M ∗ = M .



SP-STARσ: Example

Using M =CGAT and expected mismatches d = 1.
Compute for Sb . For S0 the S0,j is identiﬁed as follows.
dE (M, S0,0 ) = dE (CGAT, CGGT) = 1
dE (M, S0,1 ) = dE (CGAT, GGTC) = 3
dE (M, S0,2 ) = dE (CGAT, GTCA) = 4
dE (M, S0,3 ) = dE (CGAT, TCAG) = 3
dE (M, S0,4 ) = dE (CGAT, CAGG) = 3
The set Sb contains

Sb = {S0,0 , S1,2 , S2,1 }
Sb = CGGT, CGAC, CGAT



SP-STARσ: Example

Score for Sb is
σ(Sb ) = 3
because the least edit distance in each sequence is 1, 1, 0. That is all 3
sequences satisﬁes
dE (M, Si,j ) ≤ 1
Consensus string from Sb is M = CGAT.
Sb from M is similar to Sb .

Sb = {S0,0 , S1,2 , S2,1 }

Sb = {CGGT, CGAC, CGAT}
Since σ(Sb ) = σ(Sb ),
M ∗ = M = CGAT.


Parallel Random Projection for Motif Discovery on GPUs

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Parallel Random Projection for Motif Discovery on GPUs

Semelhante a Parallel Random Projection for Motif Discovery on GPUs (6)

Último

Último (20)

Parallel Random Projection for Motif Discovery on GPUs