SlideShare uma empresa Scribd logo
1 de 42
MALT: Distributed Data Parallelism for
Existing ML Applications
Hao Li*, Asim Kadav, Erik Kruus, Cristian Ungureanu
* University of Maryland-College Park
NEC Laboratories, Princeton
Software
User-generated
content
Hardware
Data
generated by
Transactions,
website visits,
other metadata
facebook,
twitter, reviews,
emails
Camera feeds,
Sensors
Applications
(usually
based on ML)
Ad-click/Fraud
prediction,
Recommendatio
ns
Sentiment
analysis,
Targeted
advertising
Surveillance,
Anomaly
detection
Data, data everywhere…
2
Timely insights depend on updated models
Surveillance/Safe Driving
• Usually trained in real time
• Expensive to train HD
videos
Advertising (ad prediction,
ad-bidding)
• Usually trained hourly
• Expensive to train
millions of requests
Knowledge banks
(automated answering)
• Usually trained daily
• Expensive to train
large corpus
Data (such as
image, label pairs)
model parameters
Training
Test/Deploy
model parameters
labrado
r
3
Model training challenges
• Large amounts of data to train
• Explosion in types, speed and scale of data
• Types : Image, time-series, structured, sparse
• Speed : Sensor, Feeds, Financial
• Scale : Amount of data generated growing exponentially
• Public datasets: Processed splice genomic dataset is 250 GB and
data subsampling is unhelpful
• Private datasets: Google, Baidu perform learning over TBs of data
• Model sizes can be huge
• Models with billions of parameters do not fit in a single machine
• E.g. : Image classification, Genome detection
Model accuracy generally improves by using
larger models with more data
4
Properties of ML training workloads
• Fine-grained and Incremental:
• Small, repeated updates to model vectors
• Asynchronous computation:
• E.g. Model-model communication, back-propagation
• Approximate output:
• Stochastic algorithms, exact/strong consistency maybe an overkill
• Need rich developer environment:
• Require rich set of libraries, tools, graphing abilities
5
MALT: Machine Learning Toolset
• Run existing ML software in data-parallel fashion
• Efficient shared memory over RDMA writes to
communicate model information
• Communication: Asynchronously push (scatter) model information,
gather locally arrived information
• Network graph: Specify which replicas to send updates
• Representation: SPARSE/DENSE hints to store model vectors
• MALT integrates with existing C++ and Lua applications
• Demonstrate fault-tolerance and speedup with SVM, matrix
factorization and neural networks
• Re-uses existing developer tools
6
Outline
Introduction
Background
MALT Design
Evaluation
Conclusion
7
Distributed Machine Learning
• ML algorithms learn incrementally from data
• Start with an initial guess of model parameters
• Compute gradient over a loss fn, and update the model
• Data Parallelism: Train over large data
• Data split over multiple machines
• Model replicas train over different parts of data and
communicate model information periodically
• Model parallelism: Train over large models
• Models split over multiple machines
• A single training iteration spans multiple machines
8
• SGD trains over one (or few)
training example at a time
• Every data example processed is
an iteration
• Update to the model is gradient
• Number of iterations to compute
gradient is batch size
• One pass over the entire data is
an epoch
• Acceptable performance over test
set after multiple epochs is
convergence
Stochastic Gradient Descent (SGD)
Can train wide range of ML methods : k-means, SVM,
matrix factorization, neural-networks etc.
9
Data-parallel SGD: Mini-batching
• Machines train in parallel over a batch and exchange model
information
• Iterate over data examples faster (in parallel)
• May need more passes over data than single SGD (poor convergence)
10
Approaches to data-parallel SGD
• Hadoop/Map-reduce: A variant of bulk-synchronous parallelism
• Synchronous averaging of model updates every epoch (during reduce)
Data
Data
Data
Data
reducemap
model parameter 1
model parameter 2
model parameter 3
merged model parameter
1)Infrequent communication produces low accuracy models
2)Synchronous training hurts performance due to stragglers
11
Parameter server
• Central server to merge updates every few iterations
• Workers send updates asynchronously and receive whole models from the server
• Central server merges incoming models and returns the latest model
• Example: Distbelief (NIPS 2012), Parameter Server (OSDI 2014), Project Adam (OSDI 2014)
Data
Data
Data
workers server
model parameter 1
model parameter 2
model parameter 3
merged model parameter
12
Peer-to-peer approach (MALT)
• Workers send updates to one another asynchronously
• Workers communicate every few iterations
• No separate master/slave code to port applications)
• No central server/manager: simpler fault tolerance
DataData Data Data
workers
model parameter 1 model parameter 2 model parameter 3 model parameter 4
13
Outline
Introduction
Background
MALT Design
Evaluation
Conclusion
14
MALT framework
infiniBand communication substrate (such as MPI, GASPI)
MALT dStorm
(distributed one-sided remote memory)
MALT VOL
(Vector Object Library)
Existing ML
applications
MALT VOL
(Vector Object Library)
Existing ML
applications
MALT VOL
(Vector Object Library)
Existing ML
applications
Model replicas train in parallel. Use shared memory to
communicate. Distributed file system to load datasets in parallel.
16
O3O2
dStorm: Distributed one-sided remote memory
• RDMA over infiniBand allows high-throughput/low latency networking
• RDMA over Converged Ethernet (RoCE) support for non-RDMA hardware
• Shared memory abstraction based over RDMA one-sided writes (no reads)
Memory Memory Memory
O2 O3
S1.create(size, ALL) S2.create(size, ALL) S3.create(size, ALL)
O1 O3 O1 O2O2O1 O3
Machine 1 Machine 2 Machine 3
Similar to partitioned global address space languages -
Local vs Global memory
17
Primary
copy
Receive
queue for O1
Receive
queue for O1
MemoryMemory
O2 O3O2
scatter() propagates using one-sided RDMA
• Updates propagate based on communication graph
Memory
O1 O2 O3
S1.scatter() S2.scatter() S3.scatter()
O1 O2 O3 O1 O3O2O1 O2O1 O3
Machine 1 Machine 2 Machine 3
Remote CPU not involved: Writes over RDMA. Per-sender
copies do not need to be immediately merged by receiver
18
MemoryMemory
O2 O3O2
gather() function merges locally
• Takes a user-defined function (UDF) as input such as average
Memory
O1 O2 O3
S1.gather(AVG) S2.gather(AVG) S3.gather(AVG)
O1 O2 O3 O1 O3O2O1O1 O3O2 O3O3 O1 O1 O2
Useful/General abstraction for data-parallel algos: Train and
scatter() the model vector, gather() received updates
Machine 1 Machine 2 Machine 3
19
VOL: Vector Object Library
• Expose vectors/tensors instead of
memory objects
• Provide representation optimizations
• sparse/dense parameters store as
arrays or key-value stores
• Inherits scatter()/gather() calls
from dStorm
• Can use vectors/tensors in existing
vectors
Memory
V1
20
O1
Propagating updates to everyone
Data
Data
Data
Data
Data
Data
model 6
model 5
model 1
model 2
model 3
model 4
21
O(N
2
) communication rounds for N nodes
Data
Data
Data
Data
Data
Data
model 6
model 5
model 4
model 3
model 2
model 1
22
In-direct propagation of model updates
Data
Data
Data
Data
Data
Data
model 6
model 5
model 1
model 3
model 4
model 2
23
Use a uniform random sequence to determine where to
send updates to ensure all updates propagate uniformly.
Each node sends to fewer than N nodes (such as logN)
O(Nlog(N)) communication rounds for N nodes
Data
Data
Data
Data
Data
Data
• MALT proposes sending models to
fewer nodes (log N instead of N)
• Requires the node graph be
connected
• Use any uniform random sequence
• Reduces processing/network times
• Network communication time reduces
• Time to update the model reduces
• Iteration speed increases but may
need more epochs to converge
• Key Idea: Balance communication
with computation
• Send to less/more than log(N) nodes
25
Trade-off model
information recency with
savings in network and
update processing time
Converting serial algorithms to parallel
Serial SGD
Gradient g;
Parameter w;
for epoch = 1:maxEpochs do
for i = 1:N do
g = cal_gradient(data[i]);
w = w + g;
Data-Parallel SGD
• scatter() performs one-sided RDMA writes to other machines.
• “ALL” signifies communication with all other machines.
• gather(AVG) applies average to the received gradients.
• Optional barrier() makes the training synchronous.
maltGradient g(sparse, ALL);
Parameter w;
for epoch = 1:maxEpochs do
for i = 1:N/ranks do
g = cal_gradient(data[i]);
g.scatter(ALL);
g.gather(AVG);
w = w + g;
26
Consistency guarantees with MALT
• Problem: Asynchronous scatter/gather may cause models to
diverge significantly
• Problem scenarios:
• Torn reads: Model information may get re-written while being read
• Stragglers: Slow machines send stale model updates
• Missed updates: Sender may overwrite its queue if receiver is slow
• Solution: All incoming model updates carry iteration count in their header
and trailer
• Read header-trailer-header to skip torn reads
• Slow down current process if the incoming updates are too stale to limit
stragglers and missed updates (Bounded Staleness [ATC 2014])
• Few inconsistencies are OK for model training (Hogwild [NIPS 2011])
• Use barrier to train in BSP fashion for stricter guarantees
27
MALT fault tolerance: Remove failed peers
Distributed file-system (such as NFS or
HDFS)
MALT dStorm
(distributed one-sided remote memory)
VOL
Application
VOL
Application
VOL
Application
fault
monitor
• Each replica has a fault monitor
• Detects local failures (processor
exceptions such as divide-by-zero)
• Detects failed remote writes
(timeouts)
• When failure occurs
• Locally: Terminate local training,
other monitors see failed writes
• Remote: Communicate with other
monitors and create a new group
• Survivor nodes: Re-register queues,
re-assign data, and resume training
• Cannot detect byzantine failures
(such as corrupt gradients)
28
Outline
Introduction
Background
MALT Design
Evaluation
Conclusion
29
Integrating existing ML applications with MALT
• Support Vector Machines (SVM)
• Application: Various classification applications
• Existing application: Leon Bottou’s SVM SGD
• Datasets: RCV1, PASCAL suite, splice (700M - 250 GB size, 47K -
16.6M parameters)
• Matrix Factorization (Hogwild)
• Application: Movie recommendation (Netflix)
• Existing application: HogWild (NIPS 2011)
• Datasets: Netflix (1.6 G size, 14.9M parameters)
• Neural networks (NEC RAPID)
• Application: Ad-click prediction (KDD 2012)
• Existing application: NEC RAPID (Lua frontend, C++ backend)
• Datasets: KDD 2012 (3.1 G size, 12.8M parameters)
+ +
++
---- -
-
M U
V
=
30
Cluster: Eight Intel 2.2 Ghz with 64 GB RAM machines
connected with Mellanox 56 Gbps infiniband backplane
Speedup using SVM-SGD with RCV1 dataset
goal: Loss value as achieved by
single rank SGD
cb size : Communication batch
size - Data examples processed
before model communication
31
Speedup using RAPID with KDD 2012 dataset
32
MALT provides speedup over single process SGD
Speedup with the Halton scheme
Indirect propagation of model improves performance
33
Goal: 0.01245
BSP: Bulk-Synchronous Processing
ASYNC: Asynchronous 6X
ASYNC Halton: Asynchronous with the
Halton network 11X
Data transferred over the network for different designs
34
MALT-Halton provides network efficient learning
Conclusions
• MALT integrates with existing ML software to provide data-
parallel learning
• General purpose scatter()/gather() API to send model updates using
one-sided communication
• Mechanisms for network/representation optimizations
• Supports applications written in C++ and Lua
• MALT provides speedup and can process large datasets
• More results on speedup, network efficiency, consistency models, fault
tolerance, and developer efforts in the paper
• MALT uses RDMA support to reduce model propagation costs
• Additional primitives such as fetch_and_add() may further reduce model
processing costs in software
35
Thanks
Questions?
36
Extra slides
37
Other approaches to parallel SGD
• GPUs
• Orthogonal approach to MALT, MALT segments can be created over
GPUs
• Excellent for matrix operations, smaller sized models
• However, a single GPU memory is limited to 6-10 GB
• Communication costs dominate for larger models
• Small caches: Require regular memory access for good performance
• Techniques like sub-sampling/dropout perform poorly
• Hard to implement convolution (need matrix expansion for good performance)
• MPI/Other global address space approaches
• Offer a low level API (e.g. perform remote memory management)
• A system like MALT can be built over MPI
38
Evaluation setup
Application Model Dataset/Size
# training
items
# testing
items
# parameters
Document
Classification
SVM
RCV1/480M 781K 23K 47K
Image
Classification
Alpha/1G 250K 250K 500
DNA
Classification
DNA/10G 23M 250K 800
Webspam
detection
webspam/10G 250K 100K 16.6M
Genome
classification
splice-
site/250G
10M 111K 11M
Collaborative
filtering
Matrix
factorization
netflix/1.6G 100M 2.8M 14.9M
Click Prediction
Neural
networks
KDD2012/
3.1G
150M 100K 12.8M
• Eight Intel 8-core, 2.2 GHz Ivy-Bridge, with 64 GB
• All machines connected via Mellanox 56 Gbps infiniband
39
Speedup using NMF with netflix dataset
40
Data
Data
Data
Data
Data
Data
model 1
model 2
model 3
model 4
model 6
model 5
Halton sequence: n —> n/2, n/4, 3n/4, 5n/8, 7n/8
41
Speedup with different consistency models
Benefit with asynchronous training
42
• BSP: Bulk-synchronous
parallelism (training with
barriers)
• ASYNC: Fully asynchronous
training
• SSP: Stale synchronous
parallelism (limit stragglers
by stalling fore-runners)
Developer efforts to make code data-parallel
Application Dataset LOC Modified LOC Added
SVM RCV1 105 107
Matrix
Factorization
Netflix 76 82
Neural
Network
KDD 2012 82 130
On an average, about 15% of lines modified/added
44
Fault tolerance
45
MALT provides robust model training

Mais conteĂşdo relacionado

Mais procurados

Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 

Mais procurados (20)

Snorkel: Dark Data and Machine Learning with Christopher RĂŠ
Snorkel: Dark Data and Machine Learning with Christopher RĂŠSnorkel: Dark Data and Machine Learning with Christopher RĂŠ
Snorkel: Dark Data and Machine Learning with Christopher RĂŠ
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
How To Do A Project
How To Do A ProjectHow To Do A Project
How To Do A Project
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
AutoML Toolkit – Deep Dive
AutoML Toolkit – Deep DiveAutoML Toolkit – Deep Dive
AutoML Toolkit – Deep Dive
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 

Destaque

Machine Learning Methods for Parameter Acquisition in a Human ...
Machine Learning Methods for Parameter Acquisition in a Human ...Machine Learning Methods for Parameter Acquisition in a Human ...
Machine Learning Methods for Parameter Acquisition in a Human ...
butest
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
Rui Pedro Paiva
 

Destaque (17)

Machine Learning Methods for Parameter Acquisition in a Human ...
Machine Learning Methods for Parameter Acquisition in a Human ...Machine Learning Methods for Parameter Acquisition in a Human ...
Machine Learning Methods for Parameter Acquisition in a Human ...
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
Large Scale Distributed Deep Networks
Large Scale Distributed Deep NetworksLarge Scale Distributed Deep Networks
Large Scale Distributed Deep Networks
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Inspection of CloudML Hyper Parameter Tuning
Inspection of CloudML Hyper Parameter TuningInspection of CloudML Hyper Parameter Tuning
Inspection of CloudML Hyper Parameter Tuning
 
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
Introduction to Machine Learning and Deep Learning
Introduction to Machine Learning and Deep LearningIntroduction to Machine Learning and Deep Learning
Introduction to Machine Learning and Deep Learning
 

Semelhante a MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed Machine Learning)

Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Lightbend
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
Rod Soto
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 

Semelhante a MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed Machine Learning) (20)

Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
IncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the CloudIncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the Cloud
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
distributed system lab materials about ad
distributed system lab materials about addistributed system lab materials about ad
distributed system lab materials about ad
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Pregel
PregelPregel
Pregel
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOL
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 

Mais de asimkadav

Tolerating Hardware Device Failures in Software
Tolerating Hardware Device Failures in SoftwareTolerating Hardware Device Failures in Software
Tolerating Hardware Device Failures in Software
asimkadav
 

Mais de asimkadav (6)

Pro-social cues to reduce digital deception
Pro-social cues to reduce digital deceptionPro-social cues to reduce digital deception
Pro-social cues to reduce digital deception
 
Understanding and Improving Device Access Complexity
Understanding and Improving Device Access ComplexityUnderstanding and Improving Device Access Complexity
Understanding and Improving Device Access Complexity
 
Live Migration of Direct-Access Devices
Live Migration of Direct-Access DevicesLive Migration of Direct-Access Devices
Live Migration of Direct-Access Devices
 
Fine-grained fault tolerance using device checkpoints
Fine-grained fault tolerance using device checkpointsFine-grained fault tolerance using device checkpoints
Fine-grained fault tolerance using device checkpoints
 
Tolerating Hardware Device Failures in Software
Tolerating Hardware Device Failures in SoftwareTolerating Hardware Device Failures in Software
Tolerating Hardware Device Failures in Software
 
Understanding Modern Device Drivers
Understanding Modern Device DriversUnderstanding Modern Device Drivers
Understanding Modern Device Drivers
 

Último

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Último (20)

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 

MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed Machine Learning)

  • 1. MALT: Distributed Data Parallelism for Existing ML Applications Hao Li*, Asim Kadav, Erik Kruus, Cristian Ungureanu * University of Maryland-College Park NEC Laboratories, Princeton
  • 2. Software User-generated content Hardware Data generated by Transactions, website visits, other metadata facebook, twitter, reviews, emails Camera feeds, Sensors Applications (usually based on ML) Ad-click/Fraud prediction, Recommendatio ns Sentiment analysis, Targeted advertising Surveillance, Anomaly detection Data, data everywhere… 2
  • 3. Timely insights depend on updated models Surveillance/Safe Driving • Usually trained in real time • Expensive to train HD videos Advertising (ad prediction, ad-bidding) • Usually trained hourly • Expensive to train millions of requests Knowledge banks (automated answering) • Usually trained daily • Expensive to train large corpus Data (such as image, label pairs) model parameters Training Test/Deploy model parameters labrado r 3
  • 4. Model training challenges • Large amounts of data to train • Explosion in types, speed and scale of data • Types : Image, time-series, structured, sparse • Speed : Sensor, Feeds, Financial • Scale : Amount of data generated growing exponentially • Public datasets: Processed splice genomic dataset is 250 GB and data subsampling is unhelpful • Private datasets: Google, Baidu perform learning over TBs of data • Model sizes can be huge • Models with billions of parameters do not fit in a single machine • E.g. : Image classification, Genome detection Model accuracy generally improves by using larger models with more data 4
  • 5. Properties of ML training workloads • Fine-grained and Incremental: • Small, repeated updates to model vectors • Asynchronous computation: • E.g. Model-model communication, back-propagation • Approximate output: • Stochastic algorithms, exact/strong consistency maybe an overkill • Need rich developer environment: • Require rich set of libraries, tools, graphing abilities 5
  • 6. MALT: Machine Learning Toolset • Run existing ML software in data-parallel fashion • Efficient shared memory over RDMA writes to communicate model information • Communication: Asynchronously push (scatter) model information, gather locally arrived information • Network graph: Specify which replicas to send updates • Representation: SPARSE/DENSE hints to store model vectors • MALT integrates with existing C++ and Lua applications • Demonstrate fault-tolerance and speedup with SVM, matrix factorization and neural networks • Re-uses existing developer tools 6
  • 8. Distributed Machine Learning • ML algorithms learn incrementally from data • Start with an initial guess of model parameters • Compute gradient over a loss fn, and update the model • Data Parallelism: Train over large data • Data split over multiple machines • Model replicas train over different parts of data and communicate model information periodically • Model parallelism: Train over large models • Models split over multiple machines • A single training iteration spans multiple machines 8
  • 9. • SGD trains over one (or few) training example at a time • Every data example processed is an iteration • Update to the model is gradient • Number of iterations to compute gradient is batch size • One pass over the entire data is an epoch • Acceptable performance over test set after multiple epochs is convergence Stochastic Gradient Descent (SGD) Can train wide range of ML methods : k-means, SVM, matrix factorization, neural-networks etc. 9
  • 10. Data-parallel SGD: Mini-batching • Machines train in parallel over a batch and exchange model information • Iterate over data examples faster (in parallel) • May need more passes over data than single SGD (poor convergence) 10
  • 11. Approaches to data-parallel SGD • Hadoop/Map-reduce: A variant of bulk-synchronous parallelism • Synchronous averaging of model updates every epoch (during reduce) Data Data Data Data reducemap model parameter 1 model parameter 2 model parameter 3 merged model parameter 1)Infrequent communication produces low accuracy models 2)Synchronous training hurts performance due to stragglers 11
  • 12. Parameter server • Central server to merge updates every few iterations • Workers send updates asynchronously and receive whole models from the server • Central server merges incoming models and returns the latest model • Example: Distbelief (NIPS 2012), Parameter Server (OSDI 2014), Project Adam (OSDI 2014) Data Data Data workers server model parameter 1 model parameter 2 model parameter 3 merged model parameter 12
  • 13. Peer-to-peer approach (MALT) • Workers send updates to one another asynchronously • Workers communicate every few iterations • No separate master/slave code to port applications) • No central server/manager: simpler fault tolerance DataData Data Data workers model parameter 1 model parameter 2 model parameter 3 model parameter 4 13
  • 15. MALT framework infiniBand communication substrate (such as MPI, GASPI) MALT dStorm (distributed one-sided remote memory) MALT VOL (Vector Object Library) Existing ML applications MALT VOL (Vector Object Library) Existing ML applications MALT VOL (Vector Object Library) Existing ML applications Model replicas train in parallel. Use shared memory to communicate. Distributed file system to load datasets in parallel. 16
  • 16. O3O2 dStorm: Distributed one-sided remote memory • RDMA over infiniBand allows high-throughput/low latency networking • RDMA over Converged Ethernet (RoCE) support for non-RDMA hardware • Shared memory abstraction based over RDMA one-sided writes (no reads) Memory Memory Memory O2 O3 S1.create(size, ALL) S2.create(size, ALL) S3.create(size, ALL) O1 O3 O1 O2O2O1 O3 Machine 1 Machine 2 Machine 3 Similar to partitioned global address space languages - Local vs Global memory 17 Primary copy Receive queue for O1 Receive queue for O1
  • 17. MemoryMemory O2 O3O2 scatter() propagates using one-sided RDMA • Updates propagate based on communication graph Memory O1 O2 O3 S1.scatter() S2.scatter() S3.scatter() O1 O2 O3 O1 O3O2O1 O2O1 O3 Machine 1 Machine 2 Machine 3 Remote CPU not involved: Writes over RDMA. Per-sender copies do not need to be immediately merged by receiver 18
  • 18. MemoryMemory O2 O3O2 gather() function merges locally • Takes a user-defined function (UDF) as input such as average Memory O1 O2 O3 S1.gather(AVG) S2.gather(AVG) S3.gather(AVG) O1 O2 O3 O1 O3O2O1O1 O3O2 O3O3 O1 O1 O2 Useful/General abstraction for data-parallel algos: Train and scatter() the model vector, gather() received updates Machine 1 Machine 2 Machine 3 19
  • 19. VOL: Vector Object Library • Expose vectors/tensors instead of memory objects • Provide representation optimizations • sparse/dense parameters store as arrays or key-value stores • Inherits scatter()/gather() calls from dStorm • Can use vectors/tensors in existing vectors Memory V1 20 O1
  • 20. Propagating updates to everyone Data Data Data Data Data Data model 6 model 5 model 1 model 2 model 3 model 4 21
  • 21. O(N 2 ) communication rounds for N nodes Data Data Data Data Data Data model 6 model 5 model 4 model 3 model 2 model 1 22
  • 22. In-direct propagation of model updates Data Data Data Data Data Data model 6 model 5 model 1 model 3 model 4 model 2 23 Use a uniform random sequence to determine where to send updates to ensure all updates propagate uniformly. Each node sends to fewer than N nodes (such as logN)
  • 23. O(Nlog(N)) communication rounds for N nodes Data Data Data Data Data Data • MALT proposes sending models to fewer nodes (log N instead of N) • Requires the node graph be connected • Use any uniform random sequence • Reduces processing/network times • Network communication time reduces • Time to update the model reduces • Iteration speed increases but may need more epochs to converge • Key Idea: Balance communication with computation • Send to less/more than log(N) nodes 25 Trade-off model information recency with savings in network and update processing time
  • 24. Converting serial algorithms to parallel Serial SGD Gradient g; Parameter w; for epoch = 1:maxEpochs do for i = 1:N do g = cal_gradient(data[i]); w = w + g; Data-Parallel SGD • scatter() performs one-sided RDMA writes to other machines. • “ALL” signifies communication with all other machines. • gather(AVG) applies average to the received gradients. • Optional barrier() makes the training synchronous. maltGradient g(sparse, ALL); Parameter w; for epoch = 1:maxEpochs do for i = 1:N/ranks do g = cal_gradient(data[i]); g.scatter(ALL); g.gather(AVG); w = w + g; 26
  • 25. Consistency guarantees with MALT • Problem: Asynchronous scatter/gather may cause models to diverge significantly • Problem scenarios: • Torn reads: Model information may get re-written while being read • Stragglers: Slow machines send stale model updates • Missed updates: Sender may overwrite its queue if receiver is slow • Solution: All incoming model updates carry iteration count in their header and trailer • Read header-trailer-header to skip torn reads • Slow down current process if the incoming updates are too stale to limit stragglers and missed updates (Bounded Staleness [ATC 2014]) • Few inconsistencies are OK for model training (Hogwild [NIPS 2011]) • Use barrier to train in BSP fashion for stricter guarantees 27
  • 26. MALT fault tolerance: Remove failed peers Distributed file-system (such as NFS or HDFS) MALT dStorm (distributed one-sided remote memory) VOL Application VOL Application VOL Application fault monitor • Each replica has a fault monitor • Detects local failures (processor exceptions such as divide-by-zero) • Detects failed remote writes (timeouts) • When failure occurs • Locally: Terminate local training, other monitors see failed writes • Remote: Communicate with other monitors and create a new group • Survivor nodes: Re-register queues, re-assign data, and resume training • Cannot detect byzantine failures (such as corrupt gradients) 28
  • 28. Integrating existing ML applications with MALT • Support Vector Machines (SVM) • Application: Various classification applications • Existing application: Leon Bottou’s SVM SGD • Datasets: RCV1, PASCAL suite, splice (700M - 250 GB size, 47K - 16.6M parameters) • Matrix Factorization (Hogwild) • Application: Movie recommendation (Netflix) • Existing application: HogWild (NIPS 2011) • Datasets: Netflix (1.6 G size, 14.9M parameters) • Neural networks (NEC RAPID) • Application: Ad-click prediction (KDD 2012) • Existing application: NEC RAPID (Lua frontend, C++ backend) • Datasets: KDD 2012 (3.1 G size, 12.8M parameters) + + ++ ---- - - M U V = 30 Cluster: Eight Intel 2.2 Ghz with 64 GB RAM machines connected with Mellanox 56 Gbps infiniband backplane
  • 29. Speedup using SVM-SGD with RCV1 dataset goal: Loss value as achieved by single rank SGD cb size : Communication batch size - Data examples processed before model communication 31
  • 30. Speedup using RAPID with KDD 2012 dataset 32 MALT provides speedup over single process SGD
  • 31. Speedup with the Halton scheme Indirect propagation of model improves performance 33 Goal: 0.01245 BSP: Bulk-Synchronous Processing ASYNC: Asynchronous 6X ASYNC Halton: Asynchronous with the Halton network 11X
  • 32. Data transferred over the network for different designs 34 MALT-Halton provides network efficient learning
  • 33. Conclusions • MALT integrates with existing ML software to provide data- parallel learning • General purpose scatter()/gather() API to send model updates using one-sided communication • Mechanisms for network/representation optimizations • Supports applications written in C++ and Lua • MALT provides speedup and can process large datasets • More results on speedup, network efficiency, consistency models, fault tolerance, and developer efforts in the paper • MALT uses RDMA support to reduce model propagation costs • Additional primitives such as fetch_and_add() may further reduce model processing costs in software 35
  • 36. Other approaches to parallel SGD • GPUs • Orthogonal approach to MALT, MALT segments can be created over GPUs • Excellent for matrix operations, smaller sized models • However, a single GPU memory is limited to 6-10 GB • Communication costs dominate for larger models • Small caches: Require regular memory access for good performance • Techniques like sub-sampling/dropout perform poorly • Hard to implement convolution (need matrix expansion for good performance) • MPI/Other global address space approaches • Offer a low level API (e.g. perform remote memory management) • A system like MALT can be built over MPI 38
  • 37. Evaluation setup Application Model Dataset/Size # training items # testing items # parameters Document Classification SVM RCV1/480M 781K 23K 47K Image Classification Alpha/1G 250K 250K 500 DNA Classification DNA/10G 23M 250K 800 Webspam detection webspam/10G 250K 100K 16.6M Genome classification splice- site/250G 10M 111K 11M Collaborative filtering Matrix factorization netflix/1.6G 100M 2.8M 14.9M Click Prediction Neural networks KDD2012/ 3.1G 150M 100K 12.8M • Eight Intel 8-core, 2.2 GHz Ivy-Bridge, with 64 GB • All machines connected via Mellanox 56 Gbps infiniband 39
  • 38. Speedup using NMF with netflix dataset 40
  • 39. Data Data Data Data Data Data model 1 model 2 model 3 model 4 model 6 model 5 Halton sequence: n —> n/2, n/4, 3n/4, 5n/8, 7n/8 41
  • 40. Speedup with different consistency models Benefit with asynchronous training 42 • BSP: Bulk-synchronous parallelism (training with barriers) • ASYNC: Fully asynchronous training • SSP: Stale synchronous parallelism (limit stragglers by stalling fore-runners)
  • 41. Developer efforts to make code data-parallel Application Dataset LOC Modified LOC Added SVM RCV1 105 107 Matrix Factorization Netflix 76 82 Neural Network KDD 2012 82 130 On an average, about 15% of lines modified/added 44
  • 42. Fault tolerance 45 MALT provides robust model training

Notas do Editor

  1. This is the overall design of MALT framework. At the lowest layer, there is the infiniband communication framework to provide an RDMA interface. This can also be ethernet if RocE is used.
  2. Halton is a uniform random sequence as shown here for N. Hence, if we mark the individual nodes in training cluster as 1, ..., N, Node 1 sends its updates to nodes generated using the Halton sequence which is N/2, N/4, 3N/4,… ENTER and so on limited to log (N) nodes.
  3. smallish workloads: general idea is good
  4. GPU approach is orthogoal. GPUs limited to smallish models + data (MALT DP also limited to models in single machine) GPUs needs simple memory access pattern for good performance
  5. We run 7 applications over SVM, MF and NN over small and large datasets. The small data sets help us observe convergence behavior and measure speedup over single machines. The large datasets help us test scalability of different network efficiency and consistency protocols. We run multiple processes, across these 8 machines, and we refer each process as a rank (from HPC terminology). Each machine has an Intel Xeon 8-core, 2.2 GHz Ivy- Bridge processor with support for SSE 4.2/AVX instructions, and 64 GB DDR3 DRAM. Each machine is connected via Mellanox Connect-V3 56 Gbps infiniband card and all machines are connected using a Mellanox managed-switch with copper cables. All machines share storage using a 10TB NFS partition that we use for loading input data. For all our experiments, we randomize the input data and assign random subsets to each node. In this talk, I will only show few results, and more results are in the paper.
  6. This is how the whole network looks like for 6 nodes with each node sending its update to only 2 nodes. Eventually after multiple rounds, all nodes send their updates to all other nodes. Hence, in this scheme, the total updates sent in every iteration is only log (n).
  7. This figure shows the speedup with MALT different consistency methodsbulk synchronous (BSP), asynchronous (ASP) and bounded staleness (SSP) models for the splice-site dataset over 8 machines. This training dataset consists of 10M examples (250 GB) and does not fit in a single machine. Each machine loads a portion of the training data (about 30 GB). SSP was proposed using the parameter server approach [21] and we implement it for MALT. Our implementation of bounded staleness (SSP) only merges the updates if they are within a specific range of iteration counts of its neighbor. Furthermore, if a specific rank lags too far behind, other ranks stall their training for a fixed time, waiting for the straggler to catch up. Our ASP implementation skips merging of updates from the stragglers to avoid incorporating stale updates. For the splice-site dataset, we find that SSP converges to the desired value first, followed by ASP and BSP.
  8. The traditional gradient descent algorithms compute the gradient of a loss function over the entire set of training examples. This gradient is used to compute the final model parameters to minimize the loss function [11]. PROBABLY DROP
  9. We evaluate the ease of implementing parallel learning in MALT by adding support to the four applications listed in Table 3. For each application we show the amount of code we modified as well as the number of new lines added. In Section 4, we described the specific changes required. The new code adds support for creating MALT objects, to scatter-gather the model updates. In comparison, implementing a whole new algorithm takes hundreds of lines new code assuming underlying data parsing and arithmetic libraries are provided by the processing framework. On average, we moved 87 lines of code and added 106 lines, representing about 15% of overall code.