How to Troubleshoot Apps for the Modern Connected Worker
Accelerating Machine Learning Algorithms by integrating GPUs into MapReduce Clusters
1. ACCELERATING MACHINE LEARNING ALGORITHMS BY INTEGRATING
GPUS INTO MAPREDUCE CLUSTERS
Sergio Herrero-Lopez
Intelligent Engineering Systems Laboratory (IESL)
November 30, 2011
1 Accelerating ML algorithms by integrating GPUs in MR Clusters
2. INTRODUCTION
ABOUT ME:
Ph.D (December 2011) at Massachusetts Institute of Technology (USA)
M.Sc (2007) and B.Sc (2005) in Electrical Engineering at University of Navarra (Spain)
Microsoft Research (Redmond WA, 2008), Tampere University of Technology (Finland,
2005) and IKUSI (Spain, 2003)
ABOUT PROF. WILLIAMS RESEARCH GROUP (ENGINEERING SYSTEMS DIVISION):
High Performance Price Analytics for the Smart Grid (2008-2009)
Large-Scale Simulator for Global Data Infrastructure Optimization (2009-2011)
Music Event Detection from Tweets in New York (2010-2011)
Accelerating Machine Learning Algorithms by integrating GPUs into
MapReduce Clusters
2 Accelerating ML algorithms by integrating GPUs in MR Clusters
3. AGENDA
o PROBLEM STATEMENT: Big Data & Need for scale and/or speed
o PROPOSITION: Modify MapReduce runtime to
o Satisfy the particular requirements of ML algorithms
o Integrate Massively Parallel Processors in the system
o PREVIOUS WORK MapReduce for ML in Multicore/Single-GPU/Multi-
GPU/GPU-Cluster/FPGA
o IMPLEMENTATION of new MR runtime using Port abstractions
o PERFORMANCE results running SVMs on the proposed system
o CONCLUSIONS: Contributions and Limitations. Lessons learned
o FUTURE WORK
3 Accelerating ML algorithms by integrating GPUs in MR Clusters
4. MACHINE LEARNING PARALLELIZATION
{ xi, yi },i =1… n, "i, n -Representative sample 1. Does not fit in resources
d -Feature selection 2. Takes too long
xi Î R d , yi Î Y = {1… k} k -Consolidate classes 3. Accuracy was sacrificed
Algorithm 1 Algorithm 1 Independent Runs
L1 Worker X Worker Y
(Cluster)
Algorithm 1 Summation Form
L2 (MapReduce)
Worker X Worker Y
Algorithm 1
L3 Structural Parallelism
(MPPs)
Machine Learning Algorithms decomposable into MR primitives
Naïve Bayes
K-means Expectation Maximization
Neural Network Support Vector Machine Classification
Principal Component Analysis Hidden Markov Models
4 Accelerating ML algorithms by integrating GPUs in MR Clusters
5. MAPREDUCE PRIMITIVES & RUNTIME
Input
M [ k1, v1 ] ® [ k2, v2 ]
Split
R ék2 , {v2,i }k
ë
ù® v WORKER 1 WORKER 2 WORKER M-1 WORKER M
2,i =k2 û
3
Map
Sort
WORKER 1 WORKER 2 WORKER N-1 WORKER N
Reduce
Merge
Output
5 Accelerating ML algorithms by integrating GPUs in MR Clusters
6. MAPREDUCE REPRESENTATION OF K-MEANS
M ékit , xi ù ® éki¢t , xi ù
ë û ë û
kit
{
ki¢t = x j : x j - mit £ x j - mit¢ "i¢ =1… k } ki¢t
Rék¢t , { xi }k¢t =k¢t ù ® mk¢t
ë û
t+1
{ xi }k¢ =k¢
i
t t
i
å
1
mk¢t =
t+1
x
xi ki¢t =k ¢t x Î{ xi }k¢t =k¢t
t+1
mk¢t
i
6 Accelerating ML algorithms by integrating GPUs in MR Clusters
7. MAPREDUCE REPRESENTATION OF EM FOR MIXTURE OF GAUSSIANS
M [(i, k), xi ] ® é(i, k), pi,k ù
ë û xi
a f ( xi | m , S
t
k
t
k
t
k )
pi,k = K
åa f ( x | m , S )
t
k i
t
k
t
k pi,k
k=1
Rék, { pi,k¢ }k¢=k ù ® ak
t+1 { pi,k¢ }k¢=k
ë û
n
åp i,k t+1
a t+1
k = i=1 ak
n
Rék, { xi , pi,k¢ }k¢=k ù ® mk
ë û
t+1
{ xi, pi,k¢ }k¢=k
n
åx p i i,k
m t+1
k = i=1
t+1 mk
t+1
nak
Rék, { xi , pi,k¢ }k¢=k ù ® St+1
ë û k
{ xi, pi,k¢ }k¢=k
n
å p (x i - m k ) ( xi - m k )
t+1 t+1 T
i,k
S t+1
k = i=1
na t+1 St+1
k
k
7 Accelerating ML algorithms by integrating GPUs in MR Clusters
8. MAPREDUCE REPRESENTATION OF SVM (SMO)
M [i, fi ] ® [i, fi¢] fi
fi¢= fi + Da Iup yIup k(x Iup , x i )+ Da Ilow yIlow k(x Ilow , x i ) fi¢
M [i, ai ] ® [i, ki ]
I 0 = {i : yi = {1, -1}, 0 < a i < C}
ai
I1 = {i : yi = 1, a i = 0} È {i : yi = -1, a i = C }
I 2 = {i : yi = 1, ai = C} È {i : yi = -1, a i = 0} ki
kup = {i Î I 0 È I1 }, klow = {i Î I 0 È I 2 }
ki Î kup , klow
R ék, { fi }k =k ù ® (b, I )
ë i û
{ fi }k =k
i
bup = min{ fi : ki = kup }, Iup = argmin ki =kup fi
blow = max{ fi : ki = klow }, I low = argmax ki =klow fi (b, I)
M [i, ai ] ® [i, ai¢]
yIup ( fIlow - fIup ) ai
a¢ = aI -
Iup up
2k(xIlow , xIup ) - k(xIlow , xIlow ) - k(xIup , xIup )
a ¢ = a I + yI yI (a I - a ¢ )
I
low low low Iup up up
a i¢
8 Accelerating ML algorithms by integrating GPUs in MR Clusters
9. MAPREDUCE FOR ML WISHLIST
Static Variable
mk
Static vs Variable data
xi x
Static: Largest, fixed, used in every iteration
i (a , mk , St+1 )
t+1 t+1
( xi, yi )
k k
Variable: Results of each iteration, consumed in the next
iteration ( fi, ai )
DFS
Iterate until convergence
Avoid reloading static data between iterations MEM
Utilize memory hierarchy as opposed to DFS or LFS
DFS
Massively Threaded MapReduce Tasks
Map is embarrassingly parallel
CPU MPP
Reduce is highly parallelizable
Dimensionality & Algebra - b xi -x j
2
Map Tasks may encapsulate high dimensional matrix-vector k(xi , x j ) = e
or matrix-matrix operations
Interleave multithreaded BLAS operations using static data i = 1...n, j Î { I up, I low }
Sparse data structures
9 Accelerating ML algorithms by integrating GPUs in MR Clusters
10. COMPUTING ECOSYSTEM
COMMODITY HIGH PERFORMANCE/SUPER
COMPUTING COMPUTING
RELATIONAL DB
HADOOP INFINIBAND
BIGTABLE
DRYAD
CASSANDRA OPENMPI
GPU
DYNAMO GPU
1/10 GB ETHERNET
FPGA
COLUMN DB
HADOOP
20 GB INFINIBAND
SSD
DATA APPLIANCE/ WAREHOUSE
COMPUTING
10 Accelerating ML algorithms by integrating GPUs in MR Clusters
11. MAPREDUCE CLUSTER: ARCHITECTURE
Client 1) Distributed File System.
- Unstructured data
File Job - Scales to thousands of nodes
- High reliability through
NameNode
replication
DFS MRF
2) Map Reduce Framework Runtime
JobTracker - Batch processing system
- Load balancing
Task Task
Task
DataNode 1 Block DataNode 2 DataNode 3
Block
Block
MRF MRF MRF
TaskTracker TaskTracker TaskTracker
DFS DFS DFS
11 Accelerating ML algorithms by integrating GPUs in MR Clusters
12. MAPREDUCE CLUSTER: LIMITATIONS
DataNode 1 DataNode 2
Task Task
MRF Tracker
MRF Tracker
One (or two) tasks per node
DFS Block DFS Block
One Task One Data Block
CPU CPU
One Core One Thread
Map Map
Task Task
HD Block HD Block
Synchronization by materialization of
intermediate results
CPU CPU
Reduce Reduce
Task Task
DFS Block DFS Block
No support for iterative jobs
12 Accelerating ML algorithms by integrating GPUs in MR Clusters
13. MASSIVELY PARALLEL PROCESSORS: NVIDIA TESLA ARCHITECTURE
Host Device
Stream Multiprocessor N
Stream Multiprocessor 2 Memory
Shared
1 Cycle coalesced
Stream Multiprocessor 1 Memory
Shared ~10 Cycles uncoalesced
Registers Registers Registers
Shared Memory
Registers Registers Registers Instruction
Registers Unit
ProcessorRegisters
1 Processor 2 Registers
…. Processor M Instruction
Unit
0 Cycles Processor 1 Processor 2 …. Processor M Instruction
Constant Cache Unit
SP 1 SP 2 …. SP M
Constant Cache
Texture Cache
~10 Cycles Cache Hit
Constant Memory
Texture Cache
Texture Memory
~400 Cycles ~400 Cycles
102 GB/s 102 GB/s
Host
Memory Device Memory
PCI-E 16x
(8GB/s)
13 Accelerating ML algorithms by integrating GPUs in MR Clusters
14. NVIDIA TESLA: REPRESENTATIONS
Logical Representation Physical Representation
Thread Processor
Block MultiProcessor
Maximum
(512,512,64)
But max 512
threads per
block
Grid Device
Shared
Shared
Register Memor
Register Register
Maximum
Register Memor s
s Register yRegister
s
Processs y …. s
s Process Process
(65535, Process or ConstantM
Process…. or
65535) or 1 2 Process
or 1 or ConstantM
2 Texture
or
Cache
Cache
Cache
14 Accelerating ML algorithms by integrating GPUs in MR Clusters
15. PROPOSED RUNTIME: MR + GPU
Block Block
DFS MRF Task Tracker
HState
HMem
Split
H->D Transfers
DMem DState Pre-Map BLAS
GPU Map
DMem DState Post-Map D->H Transfers
HState Cross-Node
HMem Sort
DMem DState H->D Transfers
Pre-Reduce BLAS
Local
GPU Reduce
DMem DState D->H Transfers
Post-Reduce
HState
Cross-Node Global
HMem
Reduce
Block Block
State Snapshot every
DFS x iterations
15 Accelerating ML algorithms by integrating GPUs in MR Clusters
17. PREVIOUS WORK
MAPREDUCE ON SINGLE GPU/ SINGLE FPGA Interleave Multithreaded BLAS
•Mars (He et al. PACT 2008)
•NVIDIA (Catanzaro et al. STMCS 2008)
•Cell (de Kruijf and Sankaralingam IBM Journal R&D 2009)
Massively Multithreaded MR Tasks
MAPREDUCE ON MULTICORE Shared-Memory
•Phoenix (Ranger et al. HPCA 2007)
•Phoenix 2 (Yoo et al. IISWC 2009)
•Phoenix ++ (Talbot et al. MAPREDUCE 2011)
Fault-Tolerance Relaxation
MAPREDUCE ON MULTI-GPU/GPU CLUSTERS Intermediate data in-memory
•CellMR (Rafique et al. IPDPS 2009)
•GPMR (Stuart and Owens IPDPS 2011)
Local/Global Reduction
MAPREDUCE FOR MACHINE LEARNING
•Mahout (Apache) Long running (iterative) Tasks
•Multicore (Chu et al. NIPS 2006)
•FGPA (Xu NIPS 2009)
•Twister (Ekanayake et al. MAPREDUCE 2010)
•SystemML (Ghoting et al. ICDE 2011) Static vs Variable Data
17 Accelerating ML algorithms by integrating GPUs in MR Clusters
18. PORT-BASED PROGRAMMING: ABSTRACTION
Message
Port
Single Item Receiver
Arbiter Multiple Item Receiver
Dispatcher
Handler
Handler Task
Handler
Join Receiver
Dispatcher
Queue Choice Receiver
Teardown
State Handler Concurrent
Exclusive
Scatter
Gather
18 Accelerating ML algorithms by integrating GPUs in MR Clusters
21. BINARY SVM
Binary Classification:
Given l samples x1, y1 ,, xl , yl with xi Rn , yi Y , i and Y 1,1 ,
a binary classifier predicts the label y Y of an unseen sample x Rn
1
f*
f*
2
xi x j
k ( xi , x j ) e
21 Accelerating ML algorithms by integrating GPUs in MR Clusters
22. PRIMAL & DUAL FORM OF THE SVM
Find the function f that solves the following regularization problem:
l k maxk,0
1 2
min f HC 1 yi f xi f where
i 1 2 C 0
Then slack variables i are introduced to classify non-separable data:
Primal form: Dual form:
l l
1 2 1 T
min f H C i f max K
i 1 2 Rl i 1
i
2
subject to: subject to:
l
yi f xi 1 yi i 0
i
i 1, , l i 1 i 1, , l
i 0 0 i C
where Kij yi y j k xi , x j
is the kernel function
l
Solving the dual: f ( x ) yi i k x , xi b where b is an unreagularized bias term
i 1
22 Accelerating ML algorithms by integrating GPUs in MR Clusters
23. MULTICLASS CLASSIFICATION
Multiclass Classification:
Given l samples x1, y1 ,, xl , yl with xi Rn , yi Y , i and Y 1, M ,
,
a multiclass classifier predicts the label y Y of an unseen sample x Rn
Multiclass SVM: Combination of N independent binary classification tasks. Binary tasks
are defined by an output code matrix R of size MxN and R ij 1,0,1
1 1 0 1 1 1
M
All vs All (AVA): R 1 0 1 N
2 One vs All (OVA): R 1 1 1 N M
0 1 1 1 1 1
23 Accelerating ML algorithms by integrating GPUs in MR Clusters
24. BINARY SVM AS MAP REDUCE PRIMITIVES IN A SINGLE-GPU
GPU
Processor 1 Processor p Processor P
fi
MAP
f i'
MAP
(ai , ki )
LOCAL
REDUCE (ki , fi ' ) (ki , fi ' )
GLOBAL
REDUCE (bup , I up ) Pre-MAP (blow, Ilow)
MAP ' '
up low
2
- b xi -x j
k(xi , x j ) = e
i = 1...n, j Î { I up, I low }
Device State: (xi , yi ) ( fi , ai , ki , b, I, K) LRU Cache
Static Variable
24 Accelerating ML algorithms by integrating GPUs in MR Clusters
26. EXPERIMENTS AND HARDWARE
Host Device
Ubuntu 8.10 64bit 4x Tesla C1060
Dual Socket Intel Xeon
# Stream Processors: 240
E5520
Frequency of Frequency of Processors:
Cores: 2.26 GHz 1.3GHz
145 GFlops 933 GFlops
Memory: Memory:
32GB DDR3 4GB DDR3
Memory Bandwidth: Memory Bandwidth:
25.6GB/s 102GB/s
Host <-> Device
PCIe x16 (8GB/s)
LIBSVM Hadoop Multicore Single GPU Multi GPU
• Single threaded • 4 VMs with one • 8 Worker Threads • 1 Worker Thread • 4 Worker Threads
• Double precision datanode each in H-Dispatch • 1 GPU • 4 GPUs
• Sparse • Pegasos SVM • 1 Block – 1 Thread • Single Precision • Single Precision
• Double Precision • Double Precision • Dense-Sparse • Dense-Sparse
• Sparse • Dense
26 Accelerating ML algorithms by integrating GPUs in MR Clusters
27. PERFORMANCE RESULTS: DATASETS
SVM Experiment Setup # Training # Testing # (Features,
Dataset (C,β)
Points Points Classes)
Same kernel types (RBF) WEB 49749 14951 (300,2) (64,7.8125)
Same regularization parameter C
Same stopping criteria: 0.001 MNIST 60000 10000 (780,10) (10,0.125)
SMO based (Except Hadoop version)
RCV1 518571 15564 (47236,53) (1,0.1)
One vs All in multiclass problems
1GB kernel cache PROTEIN 17766 6621 (357,3) (10,0.05)
SENSIT 78823 19705 (100,3) (1,0.7)
27 Accelerating ML algorithms by integrating GPUs in MR Clusters
28. PERFORMANCE RESULT COMPARISON
Single Multi
Dataset (Non-Zero %) LIBSVM Hadoop Multicore
GPU(Dense) GPU(Dense)
Time(s) 2364.2 1698.7 912.81 154.3 73.6
WEB (3%) Gain (x) 1.00 1.39 2.59 15.32 32.12
Accuracy (%) 82.69 82.69 82.69 82.69 82.69
Time(s) 118943.5 66753.5 22873.75 2010.3 726.9
MNIST (19%) Gain (x) 1.00 1.78 5.20 59.17 163.63
Accuracy (%) 95.76 95.76 95.76 95.76 95.76
Time(s) 710664 231486 N/A N/A N/A
RCV1 (0.1%) Gain (x) 1.00 3.07 N/A N/A N/A
Accuracy (%) 94.67 94.67 94.67 94.67 94.67
Time(s) 861 717.5 260.12 32.93 16.06
PROTEIN (29%) Gain (x) 1.00 1.20 3.31 26.15 53.61
Accuracy (%) 70.03 70.03 70.03 70.03 70.03
Time(s) 8162 4295.78 2005.4 134.67 58.29
SENSIT (100%) Gain (x) 1.00 1.90 4.07 60.61 140.02
Accuracy (%) 83.46 83.46 83.46 83.46 83.46
28 Accelerating ML algorithms by integrating MapReduce Clusters
SVMs by integrating GPUs in GPUs in MR
29. ELLPACK-R (Vazquez et al. IEEE CIT 2010)
Dataset Single Multi
(Non-Zero %) GPU(Sparse) GPU(Sparse)
Time(s) 107.35 57.3
WEB (3%) Gain (x) 22.02 (1.43) 41.26 (1.26)
Accuracy (%) 82.69 82.69
Time(s) N/A 3686
RCV1 (0.1%) Gain (x) N/A 192.80
Accuracy (%) 94.67 94.67
~8.2 days -> ~1hour
29 Accelerating ML algorithms by integrating GPUs in MR Clusters
30. CONCLUSIONS
CONCLUSIONS:
Constructed a MR runtime that satisfies the requirements of many ML algorithms and integrates GPUs.
Iterative stateful jobs
Multithreaded BLAS to prepare Map or Reduce Tasks
Static/Variable data
Tested the runtime solving popular classification problems.
Delivered up to two orders of magnitude of acceleration using 4 GPUs
Compared different runtimes
LIMITATIONS:
H-Dispatch (Pull) dependent on H->D state transfers
Relaxation of Fault-tolerance must be acceptable
d>>n -> MapReduce will have little benefit
30 Accelerating ML algorithms by integrating GPUs in MR Clusters
31. FUTURE WORK
FUTURE:
GPU Technology:
Concurrent Kernel Execution-> Maximize utilization
GPUDirect-> Facilitate Sort operation
Distributed Memory -> Intermediate Results
Shared memory space CPU-GPU
Communication
Cross-Node performance
GPU-Port-Abstraction
In-node: Cross-Thread pointer exchange
Out-node: MVAPICH2 and MVAPICH2-GPU
Algorithms
Requirements for incremental classification and clustering
31 Accelerating ML algorithms by integrating GPUs in MR Clusters
32. CONCURRENT KERNEL EXECUTION
Port
CPU
Task
Thread 1
Queue
CPU
Thread 2
• CUDA Compute Capability 2.0
allows up to sixteen concurrent
kernels.
• Concurrent kernels need to run
on the same context.
32 Accelerating ML algorithms by integrating GPUs in MR Clusters
33. INTEGRATING THE MPP IN THE MR CLUSTER ARCHITECTURE
Block Block
DFS MRF Task Tracker
HState
HMem
DMem DState
GPUDirect:
GPU • GPU to GPU memory copy
DState • Communication with network
DMem devices
HState Cross-Node
HMem
DMem DState
Minimal Communication to HState
GPU
DState
DMem
HState
Cross-Node
HMem
Block Block
State Snapshot every
DFS x iterations
33 Accelerating ML algorithms by integrating GPUs in MR Clusters
34. PIPELINING/MEMCACHED
DataNode 1 DataNode 2
Task Task
MRF Tracker
MRF Tracker
DFS Block DFS Block
Memcached
node
CPU CPU
Map Map
Task Task
node
MEM MEM
node
CPU CPU
Reduce Reduce
Task Task
DFS Block DFS Block
34 Accelerating ML algorithms by integrating GPUs in MR Clusters
35. QUESTIONS
35 Accelerating ML algorithms by integrating MapReduce Clusters
SVMs by integrating GPUs in GPUs in MR
36. APPLICATION I: EVENT DETECTION USING TWEETS
Sakaki et al: Detect Tweet
outbreaks about large-scale and
infrequent events: Natural
Disasters: Earthquakes, floods.
Accidents: Fire, road accidents
INFREQUENT EVENTS
36 Accelerating ML algorithms by integrating GPUs in MR Clusters
37. APPLICATION I: EVENT DETECTION USING TWEETS
Listening to the New
York Philarmonic,
amazing performance
Lots of people trying
to enter the MSG for
the Alice in Chains
concert. I wish I had
tickets.
Goal: Detect popular Nassau County Museum of Art is
events on locations with looking for volunteers to greet,
high volume of tweets. work in gift shop or perform
clerical support.
37 Accelerating ML algorithms by integrating GPUs in MR Clusters
38. APPLICATION I: FEATURE VECTOR
It/PRP is/VBZ a/DT good/JJ day/NN when/WRB the/DT CEO/NN
of/IN a/DT multinational/JJ ,/, multi-million/JJ
dollar/NN company/NN tells/VBZ you/PRP you/PRP 're/VBP
a/DT genius/NN ./.:/: D/NNP
Lots/NNS of/IN people/NNS trying/VBG to/TO enter/VB
the/DT MSG/NNP for/IN the/DT Alice/NNP in/IN
Chains/NNP concert/NN ./.I/PRP wish/VBP I/PRP
had/VBD tickets/NNS ./.
Feature Vectors:
- Has unigram with POS
ì 1 If (x,y) contains___
- Has bigram with POSs
hi (x, y) = í - Has trigram with POSs
î 0 otherwise - X1 is subject of X2
- ….
38 Accelerating ML algorithms by integrating GPUs in MR Clusters
39. APPLICATION I: EXPERIMENT
Used NYC.com event calendar (Oct 9-11,2009). Extracted ~400 features
Title Location Description
Alice in Chains has sold more than twenty million albums in the
Madison Square Garden, 2 United States (and an estimated 40 million worldwide), released
Alice in
Penn Plaza, New York, NY, two number-one albums and 19 top 40 singles, and has received
Chains
10001 six Grammy nominations…
EXPERIMENT 1:
• 2000 Tweets from the same weekend (160 (%8) “Concert”, 1840 (%92) “Background”)
• RBF Kernel (C=10, gamma=1.0). Testing 20% -> Accuracy of %97
• “False positives”
EXPERIMENT 2:
• 2000 Tweets from the next weekend (160 (%8) “Concert”, 1840 (%92) “Background”)
• RBF Kernel (C=10, gamma=1.0). Testing 100% -> Accuracy of %93
• “False positives” + “False negative”
• After using NYC.com again -> Accuracy of %96
39 Accelerating ML algorithms by integrating GPUs in MR Clusters
40. APPLICATION II: PRICE CALCULATIONS FOR EACH HOUSEHOLD
30 x 96 = 2880 Values
8
40 Accelerating ML algorithms by integrating GPUs in MR Clusters
41. APPLICATION II: PRICE CALCULATIONS FOR EACH HOUSEHOLD
41 Accelerating ML algorithms by integrating GPUs in MR Clusters