Big Data LDN 2018: DEEP LEARNING AND THE CHALLENGE OF SCALE

Cristiana Dinea
DEEP LEARNING
AND THE CHALLENGE OF SCALE
Master Instructor
NVIDIA Corporation

NEURAL NETWORKS ARE NOT NEW
And are surprisingly simple as an algorithm

They just historically never worked well
0 5 10 15 20 25
Algorithm performance in small data regime
ML1Dataset Size
Accuracy
Andrew Ng, “Nuts and Bolts of Applying Deep Learning”, https://www.youtube.com/watch?v=F1ka6a13S9I

Dataset Size
Accuracy
0 5 10 15 20 25
ML1 ML2 ML3

Dataset Size
Accuracy
0 5 10 15 20 25
Small NN ML1
ML2 ML3

DATA
Historically we never had large datasets or computers
Dataset Size
Accuracy
0 5 10 15 20 25
Small NN ML1
ML2 ML3
The MNIST (1999) database
contains 60,000 training images
and 10,000 testing images.

CONTEXT
1.759 petaFLOPs in November 2009

CONTEXT
2 petaFLOPs – today -2018

Availability of data and compute
Dataset Size
Accuracy
0 50 100 150 200 250
Algorithm performance in big data regime
Small NN ML1
ML2 ML3

2016 – Baidu Deep Speech 2
Superhuman Voice Recognition
2015 – Microsoft ResNet
Superhuman Image Recognition
2017 – Google Neural Machine Translation
Near Human Language Translation
100 ExaFLOPS
8700 Million Parameters
20 ExaFLOPS
7 ExaFLOPS
To Tackle Increasingly Complex Challenges
NEURAL NETWORK COMPLEXITY IS EXPLODING

100 EXAFLOPS
=
2 YEARS ON A DUAL CPU SERVER

Data and model size the key to accuracy
Dataset Size
Accuracy
0 50 100 150 200 250 300 350 400 450
Algorithm performance in big data regime
Small NN ML1 ML2 ML3 Big NN

Exceeding human level performance
Dataset Size
Accuracy
0 500 1000 1500 2000 2500
Algorithm performance in large data regime
Small NN ML1 ML2 ML3 Big NN Bigger NN

EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
Logarithmic relationship between the dataset size and accuracy
• Translation
• Language Models
• Character Language Models
• Image Classification
• Attention Speech Models

EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
Logarithmic relationship between the dataset size and accuracy

Transforming complex problems
into easy problems

Transforming impossible
problems into expensive
problems

FUNDAMENTAL CHANGE TO THE ECONOMY
Impact

EXPLOIDING DATASETS
You can calculate how much data you need
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409
Step 1.Establish how much accuracy you need

EXPLOIDING DATASETS
Step 2. Conduct a number of experiments with datasets of varying size(in the Power-
law Region)

EXPLOIDING DATASETS
Step 3. Interpolate

EXPLOIDING DATASETS
The bad news- The log scale

IMPLICATIONS
Automotive example
Majority of useful problems are too complex for a single GPU training
VERY CONSERVATIVE CONSERVATIVE
Fleet size (data capture per hour) 100 cars / 1TB/hour 125 cars / 1.5TB/hour
Duration of data collection 260 days * 8 hours 325 days * 10 hours
Data Compression factor 0.0005 0.0008
Total training set 104 TB 487.5 TB
InceptionV3 training time
(with 1 Pascal GPU)
9.1 years 42.6 years
AlexNet training time
(with 1 Pascal GPU)
1.1 years 5.4 years
Adam Grzywaczewski ,Training AI for Self-Driving Vehicles: the Challenge of Scale,NVIDIA blog

IMPLICATIONS
Experimental Nature of Deep Learning - Unacceptable training time
Idea
Code
Experiment

CONCLUSIONS
What does your team do in the mean time
Xkcd.com

ITERATION TIME
1740
264
60 48 40
24
15
1
10
100
1000
10000
Microsoft
(Dec 2015)
Prefered
Netoworks
(Feb 17)
Facebook
(Jun 17)
IBM (Aug 17) SURFsara
(Sep 17)
UCB (Oct 17) Prefered
Netoworks
(Nov 17)
ResNet 50 Training Time in minutes
Short iteration time is fundamental for success

HARDWARE
SOFTWARE
ALGORITHMS
SKILLED PEOPLE

END-TO-END PRODUCT FAMILY
FULLY INTEGRATED AI SYSTEMS
DESKTOP
TITAN
WORKSTATION
DGX Station
DATA
CENTER
Tesla V100
DATA CENTER
Tesla V100
AUTOMOTIVE EMBEDDED
Tesla P4/T4
Drive AGX Pegasus Jetson AGX Xavier
VIRTUAL
WS
Virtual GPU
SERVER
PLATFORM
HGX1/ HGX2
HPC / TRAINING INFERENCE
DGX-1 DGX-2

BALANCED HARDWARE
DGX-1 as a reference point for solution design

PERFORMANCE
Example:Choosing the right communication technology
0
10
20
30
40
50
60
4 QPI 4 CPU 4 PCI DGX-1
AllReduce bandwidth (OMB, size=128MB, in GB/s)

NVSWITCH: ALL-TO-ALL CONNECTIVITY
Example:Choosing the right communication technology
GPU
8
GPU
9
GPU
10
GPU
11
GPU
12
GPU
13
GPU
14
GPU
15
GPU
0
GPU
1
GPU
2
GPU
3
GPU
4
GPU
5
GPU
6
GPU
7
NVSwitch Fabric

BALANCED HARDWARE
10X performance gain in less than a year
DGX-1, SEP’17 DGX-2, Q3‘18
software improvements across the stack including NCCL, cuDNN, etc.
Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
Time to Train (days)
1.5
15
0 5 10 15 20
DGX-2
DGX-1 with V100
10 Times Fasterdays
days

MULTI-NODE SCALING WHITEPAPER
Use this asset to aid the
design process, ensuring
you develop the
optimized architecture
for your multi-node
cluster, following NVIDIA
best practices learned
from our customer
deployments and our
own DGX SATURNV

CHALLENGES WITH COMPLEX SOFTWARE
Current DIY GPU_accelerated AI and HPC
deployments are complex and time
consuming to build,test and maintain
Development of the software frameworks
by the community is moving very fast
Requires high level of expertise to
manage drivers , libraries and framework
dependencies

INNOVATE IN MINUTES,
NOT WEEKS WITH DEEP
LEARNING CONTAINERS
Benefits of Containers:
Simplify deployment of
GPU-accelerated applications,
eliminating time-consuming software
integration work
Isolate individual frameworks
or applications
Share, collaborate,
and test applications across
different environments 46

NVCAFFE V0.16 TRAINING ALEXNET

THE QUALITY OF CODE DOES MATTER

THE QUALITY OF CODE DOES MATTER
Latest results - ResNET

NGC
Support for Multi GPU workload
https://github.com/uber/horovod

NGC – EFFICIENT SOFTWARE
ngc.nvidia.com
Regardless of the hardware
solution you choose,
whether it is in the
datacenter or in the cloud
make sure you use NGC. For
free and on a monthly basis
we provide a set of docker
containers with the latest
and highly optimized version
of all of the deep learning
frameworks.

ALGORITHMS
STOCHASTIC GRADIENT DESCENT VARIANTS
Continuous space

THE CHOICE OF ALGORITHM HAS HIGH
IMPACT ON ACCURACY
Naive approaches lead to degraded accuracy
You, Y., Zhang, Z., Hsieh, C., Demmel, J., & Keutzer, K. (2017). ImageNet training in minutes. CoRR, abs/1709.05011.

HANDS ON TRAINING WITH NVIDIA DEEP
LEARNING INSTITUTE

NVIDIA DEEP LEARNING INSTITUTE
Helping the world to solve challenging
problems using AI and deep learning
On-site workshops and online courses
presented by certified instructors
Covering complete workflows for
proven application use cases
Fundamentals, Autonomous Vehicles, Finance,
Healthcare, Video Analytics, IoT/Robotics, …
Hands-on training for data scientists and software engineers
www.nvidia.com/dli

www.gputechconf.com
Come join us at a GTC near you - www.gputechconf.com
DISCOVER
See how GPU technologies
are creating amazing
breakthroughs in important
fields such as deep learning
CONNECT
Connect with technology experts
from NVIDIA and other leading
organizations
LEARN
Gain insight and valuable
hands-on training through
hundreds of sessions and
research posters
INNOVATE
Hear about disruptive
innovations as early-stage
companies and startups present
their work
ADVANCE YOUR DEEP LEARNING TRAINING AT GTC
Don’t miss the world’s most important event for GPU developers

DEEP LEARNING INSTITUTE WORKSHOP
TRAINING FOR MULTI GPU
— Full day workshop (8 hrs)
— Theory of Data
Parallelism
— Algorithmic challenges
of Multi GPU training
— Engineering challenges
of Multi GPU training
GPU1 GPU2Averaging
Averaging
Averaging
Averaging

READY TO GET STARTED?
Project checklist
What problem are you solving, what are the AI/DL tasks?
What data do you have/need, and how is it labeled?
Which tools & environment will you use?
On what platform(s) will you train and deploy?
Autonomous Vehicles
Digital Content Creation
Internet Services
Media &
Entertainment
Security & DefenseIntelligent Video
Analytics
Finance HealthcareGenomics

Benefits
Technology access
AI expertise
Go-to-market support
Inception: Virtual Accelerator For AI Startups
~10x
Membership Impact
Showcasing innovation
Creating a global community
Inception: Virtual Accelerator For AI Startups
www.nvidia.com/inception

•
TALK TO US ,WE CAN HELP!
AT THE SCAN BOOTH –STAND 123
ONLINE AT NVIDIA.COM/DLI

CUT FOR TIME
INDUSTRY APPLICATIONS

IBM
64 IBM Power Systems
Training ImageNet with ResNet 101 in 7 hours

FACEBOOK
• 128 * DGX-1
• 10.5 PFLOPS total FP32
• 21 PFLOPS total FP16
• Non-blocking IB fabric
Training ImageNet with ResNet 50 in 1 hour
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., ... & He, K. (2017).
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677.

PREFERRED NETWORKS
It consists of 128 nodes with 8 NVIDIA
P100 GPUs each, for 1024 GPUs in
total.
The nodes are connected with two FDR
Infiniband links (56Gbps x 2).
Training ImageNet in 15 minutes
Akiba, T., Suzuki, S., & Fukuda, K. (2017). Extremely large minibatch sgd: Training resnet-50 on
imagenet in 15 minutes. arXiv preprint arXiv:1711.04325.

BAIDU SVAIL
11 PFLOPS across 1500 GPUs
Investigating the log linear nature of the relationship between dataset
size and generalization accuracy

UBER
Investing heavily in Deep Learning scalability
https://github.com/uber/horovod

SATURN V
660 DGX-1 Volta Nodes
660 Nodes with a total of 5280
Volta GPUs
660 PFLOPs for AI training

Big Data LDN 2018: DEEP LEARNING AND THE CHALLENGE OF SCALE

Recommended

Recommended

More Related Content

More from Matt Stubbs

More from Matt Stubbs (20)

Recently uploaded

Recently uploaded (20)

Big Data LDN 2018: DEEP LEARNING AND THE CHALLENGE OF SCALE