Thanks for coming to our meetup today. My colleage Darren and I will present traininig deep neural network models on multiple GPU instances using Apache MXNet with Horovod
First, I will give an overview of distributed model training. Next, I will briefly introduce MXNet a deep learning library and Horovod a framework for distributed training. After that, I will describe how we support running MXNet on Horovod and show you some performance results we achieved. Finally, we will give you a short demo of running MXNet with Horovod on multiple hosts.
This is typical flow of today’s model training especially for deep neural networks.
As the DNN becomes a popular models for machine learning applications, model training has become a challenging task.
There are two trends in todays model training tasks. GPU has become the dominant hardware architecture for training due to its massive parallel computing capability for matrix operations. Second, more training jobs are running on multiple nodes than on single node.
ring-allreduce utilizes the network in an optimal way if the tensors are large enough, but does not work as efficiently or quickly if they are very small. Up to 65% improvement by doing tensor fusion using fusion buffer
hierarchical allreduce can further boost performance by 10% ~ 30%