Overview
• Both topics covered here are a result from my Summer Internship
• Work is available on GitHub
• Tool for creating “Standard” HPCC Systems Platform Virtual Machines
• Hyper-V, AWS, Azure, VirtualBox, etc…
• https://github.com/xwang2713/cloud-image-build
• In addition, used for creating NVIDIA GPU Enabled VMs (AWS AMI)
• Started a GPU Enabled Deep Learning Bundle
• Demonstrating GPU accelerated Deep Learning on HPCC Systems
• https://github.com/hpcc-systems/GPU-Deep-Learning
GPU Accelerated HPCC Systems | Robert Kennedy 2

HPCC Systems on Hyper-V
• Used Packer.io to generate machine images
• To create a Hyper-V Image:
• https://github.com/xwang2713/cloud-image-build/tree/master/packer/hyper-v
• Hyper-V VMs can be used similarly to the VirtualBox VMs you might already be
using
• Hyper-V Images build locally, on a Hyper-V enabled machine
• Installed programs list can be easily modified in a .JSON format
• HPCC Systems Platform running on Hyper-V allows for Docker Desktop
(windows) use
• Docker Desktop uses Hyper-V and Hyper-V and VirtualBox can’t run
concurrently

Config File
• Packer.io uses .json file as config
• Defines network (ex. for VirtualBox)
• Defines size of machine (for cloud
providers)
• Config defines which software to be
installed via standard Linux
commands

GPU Enabled Virtual Machines
• Using the same tool, GPU enabled VMs can be created
• Cloud images build in cloud, local images build locally
• This work supports the use of Python 3.6, CUDA 10.0, TensorFlow 1.14, and
PyTorch 1.1
• AWS GPU Instances:
• K80s, V100s
• Azure GPU Instances:
• K80s [12 gigs vram]
• V100s [16 gigs vram] (with and without NVLink)
• P100s [16 gigs vram]

HPCC Systems and GPU Accelerated Deep Learning
• Current HPCC Systems are CPU only, and so is its DL runtimes
• My previous work was with Distributed DL on HPCC Systems using only
CPUs
• Traditional HPCC Systems use commodity computers connected via standard
network protocols
• With respect to Deep Learning, this presents a large communication bottle
neck, partly due to its iterative nature
• Graphics Processing Units (GPU) are used to decrease the computation time for
Neural Networks
• Single or Multiple GPUs are connected to the CPU (central node) via much
faster hardware connections
• A new bundle was started to enable GPU accelerated Deep Learning on HPCC
Systems Platform

GPU Accelerated Deep Learning
• With this bundle, you can train NN models on the GPU
• Sprayed data is used as training data
• Bundle is in its infancy, but you can build, train, and use neural networks
• Using only ECL
• Using ECL and Python, allows for more customized NN architectures and
training routines
• A trained model (either in ECL or ECL+Python) can be used to predict on sprayed
data
• It returns its predictions via records in a one-hot-encoded format

Bundle Implementation Overview
• Current work uses only one Thor node
• Single Thor node still can use multiple GPUs
• ECL/HPCC Systems handles the data storage and execution of the NN runtimes
• The implementation is uses data parallelism across one ore more GPUs
• Currently limited to only a single physical computer
• The pyembed plugin allows for Python to run on HPCC Systems Platform
• We use Python 3, as Python 2 is nearing EOL
• Python code handles the NN training and interfaces with GPUs directly using
NVIDIA’s CUDA language

TensorFlow | Keras
• The Python code is in the form of
TensorFlow
• TensorFlow
• Google’s Popular Deep Learning
Library
• Keras
• Deep Learning Library API – uses
TensorFlow or other ‘backend’
• Much less code to produce same
model
10

Biological Neuron
• Basis for artificial neural networks
• Such as the ones in deep learning
• Dendrites
• Input vector, from previous
neurons
• Weights
• Soma
• Summation Function
• Axon
• Activation Function
• A neuron 'fires” when there is enough
of an input stimulus
Dendrite
Axon
Soma

Artificial Neuron
• First concept in 1943
• Inputs of the neuron are the outputs
of the previous layer’s neurons
• The input weights are summed with a
bias
• Then passed into an activation
function
• Activation Functions are like the
biological neurons ‘deciding’ to fire
• ReLu activation – gives output x if
x>0, and outputs 0, if x<0, where x is
the input

A Fully Connected Network
• Fully Connected Network
• Each neuron is connected to
every neuron in the subsequent
layer
• Neural Network Visualization
• 2 hidden layers, fully connected, 3
class classification output
• Multi-Layer Perceptron is an example

Neural Network Training
• Forwardpropagation
• Backpropagation
• Optimize Model with respect to Loss
Function
• Quantification of how “right or wrong” the
model for any given datum
• Gradient Descent
• Stochastic Gradient Descent (SGD)
• Mini-batch SGD
• Right: visualization of gradient
descent over an example loss
function
Gradient Descent In Action

Where Exactly Do the GPUs Come Into Play?
• Training a NN Model is the most
time-consuming part, this is where
the GPU is used to dramatically
reduce computation time
• Two main training steps
• Forward pass – weights and
errors
• Backward pass – gradients and
weight updates
• Computationally expensive
convolutions are offloaded onto
GPUs
• These steps are done for each data
point, multiple times GPU Accelerated HPCC Systems | Robert Kennedy 16

Parallel Paradigms
• Data Parallelism
• Model Parallelism
• Synchronous and
Asynchronous
• Parallel SGD
Data Parallelism Model Parallelism

Model Parallelism
• Neural Network Model is split across
nodes
• For models larger than a GPU’s
memory
• Requires significantly higher
communication bandwidths between
nodes
• Not well suited for a cluster system
• However, this paradigm is feasible for a
multi-GPU system due to faster hardware
speeds

Data Parallelism
• Data is partitioned and distributed to
nodes
• A singe NN model is replicated onto
each node
• Only weight updates are communicated
and aggregated
• As defined by the specific parallel
training method
• Suitable for parallelizing across multiple
nodes in HPCC Systems cluster or
across GPUs in a single system
• This is the paradigm that is used

Not Your Average HPCC Systems
• Slightly different than traditional HPCC
Systems topologies
• Whole figure represents a single physical
computer and Thor Node
• Parameter Server
• This is the CPU on the system
• Nodes (blue)
• Each node represents a single
physical GPU
• Connections are high speed
hardware
• PCI Express is up to 985 MB/s
per each 16 lanes
• NVLINK is roughly 10x faster
than PCIe Gen 3

• We will create a Convolutional Neural Network (CNN) and train on the MNIST
Dataset
• MNIST is a 10-class image classification dataset, handwritten digits 0-9
• The CNN takes 784 pixels as an input (each with range 0-255)
• Two Convolutional Layers
• One fully connected layer with 128 neurons
• 10 Output neurons (one for each class)
• Total of 1,199,882 trainable parameters
• Processing through 720,000 MNIST images
Bundle Usage Example Architecture

Spray MNIST Dataset
• MNIST included in bundle
• Test and Train, 785 fixed length
• 60,000 28x28 grayscale images
• 10,000 28x28 grayscale images
• Both are labeled as one of 10
classes, 0-9

Image Visualization
• Imported RAW MNIST
Data
• Visualization of a single
MNIST image in the
“data” format
• Each pixel has value
between 0-255,
represented as 2-digit
hex numbers
• Each pixel is a feature

Preparing the Data
• Currently, the bundle demonstrates how to train on image data
• Includes Example NN and the example dataset (MNSIT and Fashion
MNIST)
• Training data and labels is molded into a NumPy array with specified shape
before training
• Here, shape is the dimensions of the image
• i.e. the dimensions of the input features
• These get flattened to an array of 784 inputs for 784 input neurons

Creating a CNN – model.add() method
• First, we define the optimizer and its
parameters
• Next, we define the training scheme
• Batch size = 128
• We’ll train for 20 epochs

Creating a CNN – model.add() method
• Next, we define the NN architecture
• Input shape, 28x28x1 grayscale
images
• Initialize the model
• The “nnOutputLayer” is the final layer
and is, at this point, the entire NN
model thus far

• “nnOutputLayer” is passed into model.train() along with hyperparameters and
training data
Train the CNN – model.add() method
GPU:
CPU:

Create CNN – ECL and Python

Example Input and Output
Image Input
One-Hot-Encoded Output

Performance Evaluation
• A case study was performed to measure the performance improvements
• 5 identical Convolutional Neural Networks are trained on the MNIST dataset
• 10 times each to provide statistical significance
• Measuring the required training time for the same model on same data using fixed
training parameters
• Faster training time is desired
• CPU Alone, 1, 2, 3, and 4 GPUs
• Older K80’s are used
• Newer GPUs will only increase performance and efficiency
• Compared against each other and compared against the “optimal” speed up
• i.e. linear speedup

Performance Boost: GPU vs. CPU
• Time, in seconds, to train a CNN on
MNIST dataset
• Training time speedup is 5.4x
between a Xeon CPU vs a K80 GPU
• Speedup is large, even for a
simple model on small and simple
data
• The training time is measuring NN
training time, not necessarily any
HPCC-specific computations that
would be the same during CPU or
GPU

Performance Boost: CPU vs. GPU vs Optimal Speedup
• Optimal Speed up is linear
• i.e. twice the nodes is twice as fast
• Speedup is not expected to be linear
due to communication overheads
• Results show that additional GPUs
have minimal cost

Conclusion
• Tool used to create HPCC Systems Virtual Images on various new platforms
• Good use case is to create GPU enabled images
• Brief overview of Neural Networks and their optimization
• Demonstrated that GPU accelerated deep learning is possible on HPCC Systems
Platform
• Demonstrated that GPU provides significant performance increase, even on non-
traditional cluster

• Implementing generalizable data loaders
• To allow for a training on data with less knowledge of NumPy (Python)
• Continue adding to the supported methods and ECL modeling functions
• Research and Development on integrating model parallelism
• Research on NN training on multi-node clusters where each node can have one
or more GPUs
Future Work

Links
• GitHub
• https://github.com/hpcc-systems/GPU-Deep-Learning
• https://github.com/xwang2713/cloud-image-build
• NVIDIA CUDA
• https://developer.nvidia.com/cuda-toolkit
• TensorFlow
• https://www.tensorflow.org/
• Keras
• https://keras.io/
• NumPy
• https://numpy.org/

Robert Kennedy
PhD Candidate, Florida Atlantic
University
rkennedy@fau.edu
Questions?

View this presentation on YouTube:
https://www.youtube.com/watch?v=GMt-_Io4Jys&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=8&t=0s (4:02)

Expanding HPCC Systems Deep Neural Network Capabilities

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Expanding HPCC Systems Deep Neural Network Capabilities

Semelhante a Expanding HPCC Systems Deep Neural Network Capabilities (20)

Mais de HPCC Systems

Mais de HPCC Systems (20)

Último

Último (20)

Expanding HPCC Systems Deep Neural Network Capabilities