The training process for modern deep neural networks requires big data and large computational power. Though HPCC Systems excels at both of these, HPCC Systems is limited to utilizing the CPU only. It has been shown that GPU acceleration vastly improves Deep Learning training time. In this talk, Robert will explain how HPCC Systems became the first GPU accelerated library while also greatly expanding its deep neural network capabilities.
Expanding HPCC Systems Deep Neural Network Capabilities
1. 2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Robert Kennedy, PhD Candidate at Florida Atlantic University
Taghi M. Khoshgoftaar, PhD | Advisor
Timothy Humphrey | LexisNexis Mentor
Expanding HPCC Systems Deep Neural Network
Capabilities
2. Overview
• Both topics covered here are a result from my Summer Internship
• Work is available on GitHub
• Tool for creating “Standard” HPCC Systems Platform Virtual Machines
• Hyper-V, AWS, Azure, VirtualBox, etc…
• https://github.com/xwang2713/cloud-image-build
• In addition, used for creating NVIDIA GPU Enabled VMs (AWS AMI)
• Started a GPU Enabled Deep Learning Bundle
• Demonstrating GPU accelerated Deep Learning on HPCC Systems
• https://github.com/hpcc-systems/GPU-Deep-Learning
GPU Accelerated HPCC Systems | Robert Kennedy 2
3. HPCC Systems on Hyper-V
• Used Packer.io to generate machine images
• To create a Hyper-V Image:
• https://github.com/xwang2713/cloud-image-build/tree/master/packer/hyper-v
• Hyper-V VMs can be used similarly to the VirtualBox VMs you might already be
using
• Hyper-V Images build locally, on a Hyper-V enabled machine
• Installed programs list can be easily modified in a .JSON format
• HPCC Systems Platform running on Hyper-V allows for Docker Desktop
(windows) use
• Docker Desktop uses Hyper-V and Hyper-V and VirtualBox can’t run
concurrently
GPU Accelerated HPCC Systems | Robert Kennedy 3
4. Config File
• Packer.io uses .json file as config
• Defines network (ex. for VirtualBox)
• Defines size of machine (for cloud
providers)
• Config defines which software to be
installed via standard Linux
commands
GPU Accelerated HPCC Systems | Robert Kennedy 4
5. GPU Enabled Virtual Machines
• Using the same tool, GPU enabled VMs can be created
• Cloud images build in cloud, local images build locally
• This work supports the use of Python 3.6, CUDA 10.0, TensorFlow 1.14, and
PyTorch 1.1
• AWS GPU Instances:
• K80s, V100s
• Azure GPU Instances:
• K80s [12 gigs vram]
• V100s [16 gigs vram] (with and without NVLink)
• P100s [16 gigs vram]
GPU Accelerated HPCC Systems | Robert Kennedy 5
7. HPCC Systems and GPU Accelerated Deep Learning
• Current HPCC Systems are CPU only, and so is its DL runtimes
• My previous work was with Distributed DL on HPCC Systems using only
CPUs
• Traditional HPCC Systems use commodity computers connected via standard
network protocols
• With respect to Deep Learning, this presents a large communication bottle
neck, partly due to its iterative nature
• Graphics Processing Units (GPU) are used to decrease the computation time for
Neural Networks
• Single or Multiple GPUs are connected to the CPU (central node) via much
faster hardware connections
• A new bundle was started to enable GPU accelerated Deep Learning on HPCC
Systems Platform
GPU Accelerated HPCC Systems | Robert Kennedy 7
8. GPU Accelerated Deep Learning
• With this bundle, you can train NN models on the GPU
• Sprayed data is used as training data
• Bundle is in its infancy, but you can build, train, and use neural networks
• Using only ECL
• Using ECL and Python, allows for more customized NN architectures and
training routines
• A trained model (either in ECL or ECL+Python) can be used to predict on sprayed
data
• It returns its predictions via records in a one-hot-encoded format
GPU Accelerated HPCC Systems | Robert Kennedy 8
9. Bundle Implementation Overview
• Current work uses only one Thor node
• Single Thor node still can use multiple GPUs
• ECL/HPCC Systems handles the data storage and execution of the NN runtimes
• The implementation is uses data parallelism across one ore more GPUs
• Currently limited to only a single physical computer
• The pyembed plugin allows for Python to run on HPCC Systems Platform
• We use Python 3, as Python 2 is nearing EOL
• Python code handles the NN training and interfaces with GPUs directly using
NVIDIA’s CUDA language
GPU Accelerated HPCC Systems | Robert Kennedy 9
10. TensorFlow | Keras
• The Python code is in the form of
TensorFlow
• TensorFlow
• Google’s Popular Deep Learning
Library
• Keras
• Deep Learning Library API – uses
TensorFlow or other ‘backend’
• Much less code to produce same
model
10
12. Biological Neuron
• Basis for artificial neural networks
• Such as the ones in deep learning
• Dendrites
• Input vector, from previous
neurons
• Weights
• Soma
• Summation Function
• Axon
• Activation Function
• A neuron 'fires” when there is enough
of an input stimulus
GPU Accelerated HPCC Systems | Robert Kennedy 12
Dendrite
Axon
Soma
13. Artificial Neuron
• First concept in 1943
• Inputs of the neuron are the outputs
of the previous layer’s neurons
• The input weights are summed with a
bias
• Then passed into an activation
function
• Activation Functions are like the
biological neurons ‘deciding’ to fire
• ReLu activation – gives output x if
x>0, and outputs 0, if x<0, where x is
the input
GPU Accelerated HPCC Systems | Robert Kennedy 13
14. A Fully Connected Network
• Fully Connected Network
• Each neuron is connected to
every neuron in the subsequent
layer
• Neural Network Visualization
• 2 hidden layers, fully connected, 3
class classification output
• Multi-Layer Perceptron is an example
GPU Accelerated HPCC Systems | Robert Kennedy 14
15. Neural Network Training
• Forwardpropagation
• Backpropagation
• Optimize Model with respect to Loss
Function
• Quantification of how “right or wrong” the
model for any given datum
• Gradient Descent
• Stochastic Gradient Descent (SGD)
• Mini-batch SGD
• Right: visualization of gradient
descent over an example loss
function
GPU Accelerated HPCC Systems | Robert Kennedy 15
Gradient Descent In Action
16. Where Exactly Do the GPUs Come Into Play?
• Training a NN Model is the most
time-consuming part, this is where
the GPU is used to dramatically
reduce computation time
• Two main training steps
• Forward pass – weights and
errors
• Backward pass – gradients and
weight updates
• Computationally expensive
convolutions are offloaded onto
GPUs
• These steps are done for each data
point, multiple times GPU Accelerated HPCC Systems | Robert Kennedy 16
17. Parallel Paradigms
• Data Parallelism
• Model Parallelism
• Synchronous and
Asynchronous
• Parallel SGD
GPU Accelerated HPCC Systems | Robert Kennedy 17
Data Parallelism Model Parallelism
18. Model Parallelism
• Neural Network Model is split across
nodes
• For models larger than a GPU’s
memory
• Requires significantly higher
communication bandwidths between
nodes
• Not well suited for a cluster system
• However, this paradigm is feasible for a
multi-GPU system due to faster hardware
speeds
GPU Accelerated HPCC Systems | Robert Kennedy 18
19. Data Parallelism
• Data is partitioned and distributed to
nodes
• A singe NN model is replicated onto
each node
• Only weight updates are communicated
and aggregated
• As defined by the specific parallel
training method
• Suitable for parallelizing across multiple
nodes in HPCC Systems cluster or
across GPUs in a single system
• This is the paradigm that is used
GPU Accelerated HPCC Systems | Robert Kennedy 19
20. Not Your Average HPCC Systems
• Slightly different than traditional HPCC
Systems topologies
• Whole figure represents a single physical
computer and Thor Node
• Parameter Server
• This is the CPU on the system
• Nodes (blue)
• Each node represents a single
physical GPU
• Connections are high speed
hardware
• PCI Express is up to 985 MB/s
per each 16 lanes
• NVLINK is roughly 10x faster
than PCIe Gen 3
GPU Accelerated HPCC Systems | Robert Kennedy 20
22. • We will create a Convolutional Neural Network (CNN) and train on the MNIST
Dataset
• MNIST is a 10-class image classification dataset, handwritten digits 0-9
• The CNN takes 784 pixels as an input (each with range 0-255)
• Two Convolutional Layers
• One fully connected layer with 128 neurons
• 10 Output neurons (one for each class)
• Total of 1,199,882 trainable parameters
• Processing through 720,000 MNIST images
Bundle Usage Example Architecture
GPU Accelerated HPCC Systems | Robert Kennedy 22
23. Spray MNIST Dataset
• MNIST included in bundle
• Test and Train, 785 fixed length
• 60,000 28x28 grayscale images
• 10,000 28x28 grayscale images
• Both are labeled as one of 10
classes, 0-9
GPU Accelerated HPCC Systems | Robert Kennedy 23
24. Image Visualization
• Imported RAW MNIST
Data
• Visualization of a single
MNIST image in the
“data” format
• Each pixel has value
between 0-255,
represented as 2-digit
hex numbers
• Each pixel is a feature
GPU Accelerated HPCC Systems | Robert Kennedy 24
25. Preparing the Data
• Currently, the bundle demonstrates how to train on image data
• Includes Example NN and the example dataset (MNSIT and Fashion
MNIST)
• Training data and labels is molded into a NumPy array with specified shape
before training
• Here, shape is the dimensions of the image
• i.e. the dimensions of the input features
• These get flattened to an array of 784 inputs for 784 input neurons
GPU Accelerated HPCC Systems | Robert Kennedy 25
26. Creating a CNN – model.add() method
• First, we define the optimizer and its
parameters
• Next, we define the training scheme
• Batch size = 128
• We’ll train for 20 epochs
GPU Accelerated HPCC Systems | Robert Kennedy 26
27. Creating a CNN – model.add() method
• Next, we define the NN architecture
• Input shape, 28x28x1 grayscale
images
• Initialize the model
• The “nnOutputLayer” is the final layer
and is, at this point, the entire NN
model thus far
GPU Accelerated HPCC Systems | Robert Kennedy 27
28. • “nnOutputLayer” is passed into model.train() along with hyperparameters and
training data
Train the CNN – model.add() method
GPU Accelerated HPCC Systems | Robert Kennedy 28
GPU:
CPU:
29. Create CNN – ECL and Python
GPU Accelerated HPCC Systems | Robert Kennedy 29
30. Example Input and Output
GPU Accelerated HPCC Systems | Robert Kennedy 30
Image Input
One-Hot-Encoded Output
32. Performance Evaluation
• A case study was performed to measure the performance improvements
• 5 identical Convolutional Neural Networks are trained on the MNIST dataset
• 10 times each to provide statistical significance
• Measuring the required training time for the same model on same data using fixed
training parameters
• Faster training time is desired
• CPU Alone, 1, 2, 3, and 4 GPUs
• Older K80’s are used
• Newer GPUs will only increase performance and efficiency
• Compared against each other and compared against the “optimal” speed up
• i.e. linear speedup
GPU Accelerated HPCC Systems | Robert Kennedy 32
33. Performance Boost: GPU vs. CPU
• Time, in seconds, to train a CNN on
MNIST dataset
• Training time speedup is 5.4x
between a Xeon CPU vs a K80 GPU
• Speedup is large, even for a
simple model on small and simple
data
• The training time is measuring NN
training time, not necessarily any
HPCC-specific computations that
would be the same during CPU or
GPU
GPU Accelerated HPCC Systems | Robert Kennedy 33
34. Performance Boost: CPU vs. GPU vs Optimal Speedup
• Optimal Speed up is linear
• i.e. twice the nodes is twice as fast
• Speedup is not expected to be linear
due to communication overheads
• Results show that additional GPUs
have minimal cost
GPU Accelerated HPCC Systems | Robert Kennedy 34
35. Conclusion
• Tool used to create HPCC Systems Virtual Images on various new platforms
• Good use case is to create GPU enabled images
• Brief overview of Neural Networks and their optimization
• Demonstrated that GPU accelerated deep learning is possible on HPCC Systems
Platform
• Demonstrated that GPU provides significant performance increase, even on non-
traditional cluster
GPU Accelerated HPCC Systems | Robert Kennedy 35
36. • Implementing generalizable data loaders
• To allow for a training on data with less knowledge of NumPy (Python)
• Continue adding to the supported methods and ECL modeling functions
• Research and Development on integrating model parallelism
• Research on NN training on multi-node clusters where each node can have one
or more GPUs
Future Work
GPU Accelerated HPCC Systems | Robert Kennedy 36
37. Links
• GitHub
• https://github.com/hpcc-systems/GPU-Deep-Learning
• https://github.com/xwang2713/cloud-image-build
• NVIDIA CUDA
• https://developer.nvidia.com/cuda-toolkit
• TensorFlow
• https://www.tensorflow.org/
• Keras
• https://keras.io/
• NumPy
• https://numpy.org/
GPU Accelerated HPCC Systems | Robert Kennedy 37
38. GPU Accelerated HPCC Systems | Robert Kennedy 38
Robert Kennedy
PhD Candidate, Florida Atlantic
University
rkennedy@fau.edu
Questions?
39. GPU Accelerated HPCC Systems | Robert Kennedy 39
View this presentation on YouTube:
https://www.youtube.com/watch?v=GMt-_Io4Jys&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=8&t=0s (4:02)