Date: 14th November 2018
Location: AI Lab Theatre
Time: 12:30 - 13:30
Speaker: Cristiana Dinea
Organisation: NVIDIA
About: Many of the deep learning problems which industries are trying to tackle are complex. While it is relatively easy to build an early proof of concept (POC) of a system, it takes a huge amount of effort to build a solution that meets all functional and non-functional requirements.
For example, it’s straightforward to build a POC for self-driving vehicles that will drive across a small number of streets with human supervision. On the other hand, building a self-driving car which is robust and safe is an engineering feat requiring petabytes of data for training and validation.
In this session we tackle the key challenges faced when developing complex deep learning systems and focus on the algorithmic challenges involved in large scale training, distributed training algorithms and the degradation of performance associated with large batch sizes and engineering challenges involved in designing and utilizing large-scale training.
4. NEURAL NETWORKS ARE NOT NEW
They just historically never worked well
0 5 10 15 20 25
Algorithm performance in small data regime
ML1Dataset Size
Accuracy
Andrew Ng, “Nuts and Bolts of Applying Deep Learning”, https://www.youtube.com/watch?v=F1ka6a13S9I
5. NEURAL NETWORKS ARE NOT NEW
They just historically never worked well
Dataset Size
Accuracy
0 5 10 15 20 25
Algorithm performance in small data regime
ML1 ML2 ML3
Andrew Ng, “Nuts and Bolts of Applying Deep Learning”, https://www.youtube.com/watch?v=F1ka6a13S9I
6. NEURAL NETWORKS ARE NOT NEW
They just historically never worked well
Dataset Size
Accuracy
0 5 10 15 20 25
Algorithm performance in small data regime
Small NN ML1
ML2 ML3
Andrew Ng, “Nuts and Bolts of Applying Deep Learning”, https://www.youtube.com/watch?v=F1ka6a13S9I
8. DATA
Historically we never had large datasets or computers
Dataset Size
Accuracy
0 5 10 15 20 25
Algorithm performance in small data regime
Small NN ML1
ML2 ML3
The MNIST (1999) database
contains 60,000 training images
and 10,000 testing images.
Andrew Ng, “Nuts and Bolts of Applying Deep Learning”, https://www.youtube.com/watch?v=F1ka6a13S9I
12. NEURAL NETWORKS ARE NOT NEW
Availability of data and compute
Dataset Size
Accuracy
0 50 100 150 200 250
Algorithm performance in big data regime
Small NN ML1
ML2 ML3
Andrew Ng, “Nuts and Bolts of Applying Deep Learning”, https://www.youtube.com/watch?v=F1ka6a13S9I
13. 2016 – Baidu Deep Speech 2
Superhuman Voice Recognition
2015 – Microsoft ResNet
Superhuman Image Recognition
2017 – Google Neural Machine Translation
Near Human Language Translation
100 ExaFLOPS
8700 Million Parameters
20 ExaFLOPS
300 Million Parameters
7 ExaFLOPS
60 Million Parameters
To Tackle Increasingly Complex Challenges
NEURAL NETWORK COMPLEXITY IS EXPLODING
15. NEURAL NETWORKS ARE NOT NEW
Data and model size the key to accuracy
Dataset Size
Accuracy
0 50 100 150 200 250 300 350 400 450
Algorithm performance in big data regime
Small NN ML1 ML2 ML3 Big NN
16. NEURAL NETWORKS ARE NOT NEW
Exceeding human level performance
Dataset Size
Accuracy
0 500 1000 1500 2000 2500
Algorithm performance in large data regime
Small NN ML1 ML2 ML3 Big NN Bigger NN
Andrew Ng, “Nuts and Bolts of Applying Deep Learning”, https://www.youtube.com/watch?v=F1ka6a13S9I
17. EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
Logarithmic relationship between the dataset size and accuracy
• Translation
• Language Models
• Character Language Models
• Image Classification
• Attention Speech Models
18. EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
Logarithmic relationship between the dataset size and accuracy
24. EXPLODING DATASETS
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409.
Logarithmic relationship between the dataset size and accuracy
25. EXPLOIDING DATASETS
You can calculate how much data you need
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409
Step 1.Establish how much accuracy you need
26. EXPLOIDING DATASETS
You can calculate how much data you need
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409
Step 2. Conduct a number of experiments with datasets of varying size(in the Power-
law Region)
27. EXPLOIDING DATASETS
You can calculate how much data you need
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409
Step 3. Interpolate
28. EXPLOIDING DATASETS
The bad news- The log scale
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., ... & Zhou, Y. (2017). Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409
30. IMPLICATIONS
Automotive example
Majority of useful problems are too complex for a single GPU training
VERY CONSERVATIVE CONSERVATIVE
Fleet size (data capture per hour) 100 cars / 1TB/hour 125 cars / 1.5TB/hour
Duration of data collection 260 days * 8 hours 325 days * 10 hours
Data Compression factor 0.0005 0.0008
Total training set 104 TB 487.5 TB
InceptionV3 training time
(with 1 Pascal GPU)
9.1 years 42.6 years
AlexNet training time
(with 1 Pascal GPU)
1.1 years 5.4 years
Adam Grzywaczewski ,Training AI for Self-Driving Vehicles: the Challenge of Scale,NVIDIA blog
34. ITERATION TIME
1740
264
60 48 40
24
15
1
10
100
1000
10000
Microsoft
(Dec 2015)
Prefered
Netoworks
(Feb 17)
Facebook
(Jun 17)
IBM (Aug 17) SURFsara
(Sep 17)
UCB (Oct 17) Prefered
Netoworks
(Nov 17)
ResNet 50 Training Time in minutes
Short iteration time is fundamental for success
38. END-TO-END PRODUCT FAMILY
FULLY INTEGRATED AI SYSTEMS
DESKTOP
TITAN
WORKSTATION
DGX Station
DATA
CENTER
Tesla V100
DATA CENTER
Tesla V100
AUTOMOTIVE EMBEDDED
Tesla P4/T4
Drive AGX Pegasus Jetson AGX Xavier
VIRTUAL
WS
Virtual GPU
SERVER
PLATFORM
HGX1/ HGX2
HPC / TRAINING INFERENCE
DGX-1 DGX-2
42. BALANCED HARDWARE
10X performance gain in less than a year
DGX-1, SEP’17 DGX-2, Q3‘18
software improvements across the stack including NCCL, cuDNN, etc.
Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
Time to Train (days)
1.5
15
0 5 10 15 20
DGX-2
DGX-1 with V100
10 Times Fasterdays
days
43. MULTI-NODE SCALING WHITEPAPER
Use this asset to aid the
design process, ensuring
you develop the
optimized architecture
for your multi-node
cluster, following NVIDIA
best practices learned
from our customer
deployments and our
own DGX SATURNV
45. CHALLENGES WITH COMPLEX SOFTWARE
Current DIY GPU_accelerated AI and HPC
deployments are complex and time
consuming to build,test and maintain
Development of the software frameworks
by the community is moving very fast
Requires high level of expertise to
manage drivers , libraries and framework
dependencies
46. INNOVATE IN MINUTES,
NOT WEEKS WITH DEEP
LEARNING CONTAINERS
Benefits of Containers:
Simplify deployment of
GPU-accelerated applications,
eliminating time-consuming software
integration work
Isolate individual frameworks
or applications
Share, collaborate,
and test applications across
different environments 46
52. NGC – EFFICIENT SOFTWARE
ngc.nvidia.com
Regardless of the hardware
solution you choose,
whether it is in the
datacenter or in the cloud
make sure you use NGC. For
free and on a monthly basis
we provide a set of docker
containers with the latest
and highly optimized version
of all of the deep learning
frameworks.
55. THE CHOICE OF ALGORITHM HAS HIGH
IMPACT ON ACCURACY
Naive approaches lead to degraded accuracy
You, Y., Zhang, Z., Hsieh, C., Demmel, J., & Keutzer, K. (2017). ImageNet training in minutes. CoRR, abs/1709.05011.
59. NVIDIA DEEP LEARNING INSTITUTE
Helping the world to solve challenging
problems using AI and deep learning
On-site workshops and online courses
presented by certified instructors
Covering complete workflows for
proven application use cases
Fundamentals, Autonomous Vehicles, Finance,
Healthcare, Video Analytics, IoT/Robotics, …
Hands-on training for data scientists and software engineers
www.nvidia.com/dli
60. www.gputechconf.com
Come join us at a GTC near you - www.gputechconf.com
DISCOVER
See how GPU technologies
are creating amazing
breakthroughs in important
fields such as deep learning
CONNECT
Connect with technology experts
from NVIDIA and other leading
organizations
LEARN
Gain insight and valuable
hands-on training through
hundreds of sessions and
research posters
INNOVATE
Hear about disruptive
innovations as early-stage
companies and startups present
their work
ADVANCE YOUR DEEP LEARNING TRAINING AT GTC
Don’t miss the world’s most important event for GPU developers
61. DEEP LEARNING INSTITUTE WORKSHOP
TRAINING FOR MULTI GPU
— Full day workshop (8 hrs)
— Theory of Data
Parallelism
— Algorithmic challenges
of Multi GPU training
— Engineering challenges
of Multi GPU training
GPU1 GPU2Averaging
Averaging
Averaging
Averaging
63. READY TO GET STARTED?
Project checklist
What problem are you solving, what are the AI/DL tasks?
What data do you have/need, and how is it labeled?
Which tools & environment will you use?
On what platform(s) will you train and deploy?
Autonomous Vehicles
Digital Content Creation
Internet Services
Media &
Entertainment
Security & DefenseIntelligent Video
Analytics
Finance HealthcareGenomics
64. Benefits
Technology access
AI expertise
Go-to-market support
Inception: Virtual Accelerator For AI Startups
~10x
Membership Impact
Showcasing innovation
Creating a global community
Inception: Virtual Accelerator For AI Startups
www.nvidia.com/inception
65. •
TALK TO US ,WE CAN HELP!
AT THE SCAN BOOTH –STAND 123
ONLINE AT NVIDIA.COM/DLI
69. IBM
64 IBM Power Systems
Training ImageNet with ResNet 101 in 7 hours
70. FACEBOOK
• 128 * DGX-1
• 10.5 PFLOPS total FP32
• 21 PFLOPS total FP16
• Non-blocking IB fabric
Training ImageNet with ResNet 50 in 1 hour
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., ... & He, K. (2017).
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677.
71. PREFERRED NETWORKS
It consists of 128 nodes with 8 NVIDIA
P100 GPUs each, for 1024 GPUs in
total.
The nodes are connected with two FDR
Infiniband links (56Gbps x 2).
Training ImageNet in 15 minutes
Akiba, T., Suzuki, S., & Fukuda, K. (2017). Extremely large minibatch sgd: Training resnet-50 on
imagenet in 15 minutes. arXiv preprint arXiv:1711.04325.
72. BAIDU SVAIL
11 PFLOPS across 1500 GPUs
Investigating the log linear nature of the relationship between dataset
size and generalization accuracy