Distributed Model Training using MXNet with Horovod

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Lin Yuan, Yuxi (Darren) Hu
Distributed Training Using Apache
MXNet with Horovod
Feb 11, 2019

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Outline
• What is distributed model training
• Introduction to Apache MXNet and Horovod
• Integrating MXNet with Horovod
• Performance results
• Demo

Model Training 101
data
model
optimizer
gradients
converge? done

The Growing Pain of DNN
• Increasing model complexity
• ResNet50 network has over 25 million parameters [1]
• Huge training data
• ImageNet has 14,197,122 images [2]
• How to leverage the computing resource
• Cost/energy efficiency

Model Training Going Distributed

Data Parallelism vs Model Parallelism
data1 data2 datan
model model model
global state of parameters
machine1 machine2 machinen
machine1 machine2
machine3 machine4
…

Data Parallel: Parameter Server based approach
data1
model
optimizer
worker1
server1
data2
model
optimizer
worker2
datan
model
optimizer
workern
server2
…

Data Parallel: Ring-Allreduce based approach
worker1
model
worker4
model
worker3
model
worker2
model
data1
data4
data3
data2

Apache MXNet
• Apache (incubating) open source project
• Framework for DNNs
• Created by academia (CMU and UW)
• Adopted byAWS as DNN framework of choice,
Nov 2016
• Widely used within Amazon
http://mxnet.io

Asynchronous Engine
Operations appear to return
immediately, but just pushed to engine
queue on backend.
Allows for much greater parallelism.
Serial or parallel? “The execution of
any two functions when one of them
modifies at least one common variable
is serialized in their push order.”
Must wait_to_read() or similar to
retrieve value and this blocks.
wait_to_read()
frontend
backend
Sync Async

Horovod
• An open source framework (under Linux
Foundation) for distributed model training
• Support TensorFlow, Keras, MXNet, and PyTorch
• Developed at Uber since Oct 2017
• Implement the ring-allreduce approach using MPI
and NCCL
• MPI: a message passing interface to
communicate between worker nodes
• NCCL: an efficient communicator methods
between GPUs
https://eng.uber.com/
horovod/

Integrating MXNet with Horovod
Horovod MXNet
broadcast
parameters
model
optimizer
distributed
optimizer
allreduce update gradients

Leverage the power of asynchronous engine in MXNet
• MXNet engine
starts executing the
operation
asynchronously
• Task dependency is
taken care of
automatically
• Improves the
training throughput
MXNet Engine
horovod.broadcast
horovod.allreduce
PushAsync

Performance Optimization
• Mixed precision: using float32 for training and float16 for passing
gradients
• Tensor fusion: combine all tensors that are ready to be reduced into
one reduction operation
• Hierarchical Allreduce (only supported in NCCL)
• Aggregate SGD*: aggregate multiple weights in a single call to
optimizer to reduce synchronizing overhead
* contributed by NVIDIA (https://devblogs.nvidia.com/new-optimizations-
accelerate-deep-learning-training-gpu/)

Next Steps
• Fused operators such as BatchNorm-ReLU and BatchNorm-Add-
ReLU to reduce unnecessary data transfer between CPU and GPU
memory*
• Provide different layout (NHWC) to improve convolution operators in
GPU*
* contributed by NVIDIA (https://devblogs.nvidia.com/new-optimizations-
accelerate-deep-learning-training-gpu/)

Benchmark Setup
• Model and data
• ResNet50-v1b (~25 million parameters) [1]
• ImageNet (~14 million images) [2]
• Training setup
• batch size (per device): 256
• learning rate: 0.1 (scaled linearly with number of GPUs) [3]
• number of epochs: 90
• Software
• CUDA 9.2
• Ubuntu 16.04
• cuDNN 7.2.1
• NCCL 2.2.13
• OpenMPI 3.1.1
• Hardware
• GPU instance: p3.16xlarge (8 NVIDIA Tesla V100 GPUs, each pairing 5,120 CUDA Cores and 640 Tensor
Cores)
• CPU instance: c5.18xlarge (72 vCPU and 144GiB memory)
• Network bandwidth: 25Gbps
[1] He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016
[2] http://image-net.org/challenges/LSVRC/2015/
[3] Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, CVPR 2018

Scaling Efficiency
0
10000
20000
30000
40000
50000
60000
1 8 16 32 64
Images/sec
Training ResNet50 model with ImageNet data on
NVIDIA Tesla V100 GPUs
Parameter Server Horovod Ideal
82.6%
48.7%

Cost Comparison
• Adding extra machines as parameter servers can help increase
throughput at the cost of computation resource ($$)
Setup Time to train
(min)
Throughput
(images/sec)
Top-1 Validation
Accuracy
Cost ($$) *
Horovod on 8
p3.16xlarge
43.5 44900 75.69% 142
Parameter Server
on 8 p3.16xlarge
and 16 c5.18xlarge
44.1 43482 74.81% 190
Parameter Server
(collocated)
76 26500 74.72% 248
* cost is calculated based on AWS on demand EC2 instance hourly rate of p3.16xlarge and c5.18xlarge

MLPerf Benchmark of ResNet-50v1.5 on ImageNet*
Submitter Hardware Software Time (mins) Speedup
Reference Pascal P100 Unoptimized
reference
8831.3 1.0x
Google TPUv2.512 +
TPUv2.8 (260
cores)
TensorFlow
1.12
11.3 781.5x
Intel 8x2S
SKX8180 (16
processors)
Intel Caffe
1.1.2a
1312.8 6.7x
NVIDIA 80xDGX-1
(640 Volta
GPU)
MXNet-
ngc18.11,
cuDNN 7.4
6.2 1424.4x
*MLPerf: https://mlperf.org/results/

How to run distributed training using MXNet with
Horovod
• Install MXNet
• Currently we recommend users to build MXNet from source if you are running on
machines with GCC 5.x and beyond: https://github.com/apache/incubator-mxnet
• If you are running on machines with GCC 4.x, you may install MXNet using pip:
• pip install mxnet-cu92
• Install Horovod
• Currently MXNet is supported in Horovod by building from source:
https://github.com/uber/horovod
• Horovod 0.16.0 will include MXNet in PyPI package:
• pip install horovod
• Run MPI in cluster
• Specify cluster in a host file
• mpirun -np <num of gpu/cpu devices> -H <hostfile> -bind-to none -map-by slot
python <training script>

Changes needed in training script
Single GPU training Distributed training in Horovod
import mxnet as mx
# Set context to GPU
context = mx.gpu(0)
# Build model
model = …
# Define hyper parameters
optimizer_params = …
# Create optimizer
opt = mx.optimizer.create(…)
# Initialize parameters
initializer = …
model.bind(data=…,label=…)
model.init_params(initializer)
# Train model
model.fit(train_data, optimizer=opt,num_epoch=…
import mxnet as mx
import horovod.mxnet as hvd
# Initialize Horovod
hvd.init()
# Set conext to GPU by local rank
context = mx.gpu(hvd.local_rank())
# Build model
model = …
# Define hyper parameters
optimizer_params = …
# Create distributed optimizer
opt = mx.optimizer.create(…)
opt = hvd.DistributedOptimizer(opt)
# Initialize parameters
initializer = …
model.bind(data=…,label=…)
model.init_params(initializer)
# Fetch and broadcast parameters
hvd.broadcast_parameters(model.get_params())
# Train model
model.fit(train_data, optimizer=opt,num_epoch=…)

Demo
• MXNet + Hovorod MNIST example Jupyter Notebook
• MXNet + Horovod MNIST example full scripts: Gluon, Module

How to Get Started with Apache MXNet on AWS
• Get started with Apache MXNet onAWS:
https://aws.amazon.com/mxnet/get-started/
• UsingApache MXNet with Amazon SageMaker:
https://docs.aws.amazon.com/sagemaker/latest/dg/mxnet.html
• Contact: mxnet-info@amazon.com

Using Apache MXNet with AWS ML Services
• Amazon SageMaker: https://aws.amazon.com/sagemaker/
• Amazon SageMaker Neo: https://aws.amazon.com/sagemaker/neo/
• Amazon Elastic Inference: https://aws.amazon.com/machine-learning/elastic-inference/
• Amazon Reinforcement Learning: https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-sagemaker-announces-support-
for-reinforcement-learning/
• AWS IoT Greengrass ML Inference: https://aws.amazon.com/greengrass/ml/
• Dynamic Training with Apache MXNet on AWS: https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-dynamic-
training-with-apache-mxnet/

Thank you for coming!
Q&A

Distributed Model Training using MXNet with Horovod

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed Model Training using MXNet with Horovod

Similar to Distributed Model Training using MXNet with Horovod (20)

Recently uploaded

Recently uploaded (20)

Distributed Model Training using MXNet with Horovod

Editor's Notes