SlideShare uma empresa Scribd logo
1 de 51
Scaling Deep Learning
Bryan Catanzaro
@ctnzr
Bryan Catanzaro
What do we want AI to do?
Drive us to work
Serve drinks?
Help us
communicate
帮助我们沟通
Keep us
organized
Help us find
things
Guide us to
content
Bryan Catanzaro
OCR-based Translation App
Baidu IDL
hello
Bryan Catanzaro
Face Analysis
Baidu IDL
Gender
Age Range
Ethnicity
Mood
Bryan Catanzaro
Medical Diagnostics App
Baidu BDL
AskADoctor can assess
520 different diseases,
representing ~90 percent
of the most common
medical problems.
Bryan Catanzaro
Image Captioning
Baidu IDL
A yellow bus driving down a road
with green trees and green grass in
the background.
Living room with white couch and
blue carpeting. Room in apartment
gets some afternoon sun.
Bryan Catanzaro
Image Q&A
Baidu IDL
Sample questions and answers
Bryan Catanzaro
Natural User Interfaces
• Goal: Make interacting with computers as
natural as interacting with humans
• AI problems:
– Speech recognition
– Emotional recognition
– Semantic understanding
– Dialog systems
– Speech synthesis
Bryan Catanzaro
Machine learning for computer vision (c.
2009)
“Please put away the coffee mugs!”
Bryan Catanzaro
Machine learning for computer vision
“Mug”
Machine Learning
Cleanup-bot!
(Woohoo!)
Bryan Catanzaro
AI applications are hard…
Bryan Catanzaro
AI applications are hard…
Machine Learning can solve challenging problems
--- but it is a lot of work!
This eventually worked ~95% of the time.
Bryan Catanzaro
Why are applications so hard?
“Coffee Mug”
Pixel Intensity
Pixel intensity is a very difficult representation…
Bryan Catanzaro
pixel 1
pixel 2
Coffee Mug
Not Coffee Mug
Why are applications so hard?
pixel 1
pixel 2
Pixel Intensity[72 160]
-+
+
-
+
-
Bryan Catanzaro
Why are applications so hard?
+
pixel 1
pixel 2
-
+
+
-
-
+ -
+
+Coffee Mug
Not Coffee Mug-
+
pixel 1
pixel 2
-
+
+
-
-
+ -
+
Is this a Coffee Mug?
Learning Algorithm
Bryan Catanzaro
Features
+
handle?
cylinder?
-
+
+-
-
+
-
+
+Coffee Mug
Not Coffee Mug-
cylinder?handle?
Is this a Coffee Mug?
Learning Algorithm +
handle?
cylinder?
-
+
-
-
+
-
++
Bryan Catanzaro
Machine learning in practice
“Mug
” Machine
Learning
(Classifier)
Feature
Extraction
Bryan Catanzaro
Machine learning in practice
“Mug
” Machine
Learning
(Classifier)
Feature
Extraction
Prior Knowledge
Experience
Bryan Catanzaro
Machine learning in practice
• Enormous amounts of research time spent
inventing new features.
Idea
CodeTest
Hack up in Matlab
Run on workstation
Think really hard…
Bryan Catanzaro
Learning features
• Deep learning: learn multiple stages of
features to achieve end goal.
Features Features “Mug”?Classifier
Pixels
Bryan Catanzaro
Learning features
• “Neural networks” are one way to represent
features
Features Features “Mug”?
Classif
ier
Pixels
y = g(W x)
x
y
W
Bryan Catanzaro
Learning features
• Deep learning: learn multiple stages of
features to achieve end goal
“Mug”?
Pixels Features
Features
Classifier
W3
W2
W1
Bryan Catanzaro
Why Deep Learning?
1. Scale Matters
– Bigger models usually win
2. Data Matters
– More data means less
cleverness necessary
3. Productivity Matters
– Teams with better tools can try out more ideas
Data & Compute
Accuracy
Deep Learning
Many previou
methods
Bryan Catanzaro
Scaling up
• Make progress on AI by focusing on systems
– Make models bigger
– Tackle more data
– Reduce research cycle time
• Accelerate large-scale
experiments
Bryan Catanzaro
Exascale
• Strong scaling important
but difficult
– Weak scaling over time as
datasets increase
• We run our experiments
on 8-128 GPUs
• Exascale likely important
for running many “small”
experiments
Bryan Catanzaro
Training Deep Neural Networks
• Computation dominated by dot products
• Multiple inputs, multiple outputs, batch
means GEMM
– Compute bound
• Convolutional layers even more compute
bound
Bryan Catanzaro
Computational Characteristics
• High arithmetic intensity
– Arithmetic operations / byte of data
– O(Exaflops) / O(Terabytes) : 10^6
• In contrast, many other ML training jobs are
O(Petaflops)/O(Petabytes) = 10^0
• Medium size datasets
– Generally fit on 1 node
– HDFS, fault tolerance, disk I/O not bottlenecks
Training 1 model: ~10 Exaflops
Bryan Catanzaro
Deep Neural Network training is HPC
Idea
CodeTest
• Turnaround time is key
• Use most efficient hardware
– Parallel, heterogeneous computing
– Fast interconnect (PCIe, Infiniband)
• Push strong scalability
– Models and data have to be of commensurate size
• This is all standard HPC!
Bryan Catanzaro
Training: Stochastic Gradient Descent
• Simple algorithm
– Add momentum to power through local minima
– Compute gradient by backpropagation
• Operates on minibatches
– This makes it a GEMM problem instead of GEMV
• Choose minibatches stochastically
– Important to avoid memorizing training order
• Difficult to parallelize
– Prefers lots of small steps
– Increasing minibatch size not always helpful
Bryan Catanzaro
Limitations of batching
Error
Iterations
Batch size = 𝑛
Batch size = 2𝑛
Spending 2x the work picking a direction
Doesn’t reduce iteration count by 2x
Bryan Catanzaro
SVAIL Infrastructure
1
http://www.tyan.com
FT77CB7079
Service Engineer’s Manual
NVIDIA GeForce
GTX Titan X
Titan X x8
Mellanox Interconnect
• Software: CUDA, MPI, Majel (SVAIL internal
library)
• Hardware:
Bryan Catanzaro
Node Architecture
• All pairs of GPUs communicate
simultaneously over PCIe Gen 3 x16
• Groups of 4 GPUs form Peer to Peer domain
• Avoid moving data to CPUs or across QPI
Bryan Catanzaro
Parallelism
Model Parallel
Data Parallel
MPI_Allreduce()
Training Data Training Data
Bryan Catanzaro
Speech Recognition: Traditional ASR
• Getting higher performance is hard
• Improve each stage by engineering
Accuracy
Traditional ASR
Data + Model Size
Expert engineering.
Adam
Coates
Bryan Catanzaro
Speech recognition: Traditional ASR
• Huge investment in features for speech!
– Decades of work to get very small improvements
Spectrogram MFCC Flux
Bryan Catanzaro
Speech Recognition 2: Deep Learning!
• Since 2011, deep learning for features
AcousticModel
HMM
Language
Model
Transcription
“The quick brown fox
jumps over the lazy
dog.”
Bryan Catanzaro
Speech Recognition 2: Deep Learning!
• With more data, DL acoustic models perform
better than traditional models
Accuracy
Traditional ASR
Data + Model Size
DL V1 for Speech
Bryan Catanzaro
Speech Recognition 3: “Deep Speech”
• End-to-end learning
“The quick brown fox
jumps over the lazy
dog.”
Transcription
Bryan Catanzaro
Speech Recognition 3: “Deep Speech”
• We believe end-to-end DL works better
when we have big models and
lots of data
Accuracy
Traditional ASR
Data + Model Size
DL V1 for Speech
Deep Speech
Bryan Catanzaro
End-to-end speech with DL
• Deep neural network predicts characters directly
from audio
. . .
. . .
T H _ E … D O G
Bryan Catanzaro
Recurrent Network
• RNNs model temporal dependence
• Various flavors used in many applications
– LSTM, GRU, Bidirectional, …
– Especially sequential data (time series, text, etc.)
• Sequential dependence complicates
parallelism
Bryan Catanzaro
Connectionist Temporal Classification
Bryan Catanzaro
warp-ctc
• Recently open sourced our CTC
implementation
• Efficient, parallel CPU and GPU backend
• 100-400X faster than other implementations
• Apache license, C interfacehttps://github.com/baidu-research/warp-ctc
Bryan Catanzaro
Training sets
• Train on ~1½ years of data (and growing)
• English and Mandarin
• End-to-end deep learning is key to
assembling large datasets
• Datasets drive accuracy
Bryan Catanzaro
All-reduce
• We implemented our own all-reduce out of
send and receive
• Several algorithm choices based on size
• Careful attention to affinity and topology
Bryan Catanzaro
Scalability
• Batch size is hard to increase
– algorithm, memory limits
• Performance at small batch sizes (32, 64)
leads to scalability limits
Bryan Catanzaro
Performance for RNN training
• 55% of GPU FMA peak using a single GPU
• ~48% of peak using 8 GPUs in one node
• Weak scaling very efficient, albeit algorithmically
challenged
1
2
4
8
16
32
64
128
256
512
1 2 4 8 16 32 64 128
TFLOP/s
Number of GPUs
Typical
training run
one node multi node
Bryan Catanzaro
Precision
• FP16 mostly works
– Use FP32 for softmax and weight updates
• More sensitive to labeling error
1
10
100
1000
10000
100000
1000000
10000000
100000000
-31
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
Count
Magnitude
Weight Distribution
Bryan Catanzaro
Determinism
• Determinism very important
• So much randomness,
hard to tell if you have a bug
• Networks train despite bugs,
although accuracy impaired
• Reproducibility is important
– For the usual scientific reasons
– Progress not possible without reproducibility
• We use synchronous SGD
Bryan Catanzaro
Conclusion
• Deep Learning is solving many hard
problems
• Training deep neural networks is an HPC
problem
• Scaling brings AI progress!
Bryan Catanzaro
Thanks
• Andrew Ng, Adam Coates, Awni Hannun,
Patrick LeGresley … and all of SVAIL
Bryan Catanzaro
@ctnzr

Mais conteúdo relacionado

Semelhante a HPC Advisory Council Stanford Conference 2016

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!TigerGraph
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer Kevin Lee
 
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...Big Data Week
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataStylight
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)byteLAKE
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
 
CD presentation march 12th, 2018
CD presentation march 12th, 2018CD presentation march 12th, 2018
CD presentation march 12th, 2018Ran Levy
 
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?Agentschap Innoveren & Ondernemen
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...inside-BigData.com
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitBarbara Fusinska
 
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systemsadrianionel
 
Deep Learning on Everyday Devices
Deep Learning on Everyday DevicesDeep Learning on Everyday Devices
Deep Learning on Everyday DevicesBrodmann17
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 

Semelhante a HPC Advisory Council Stanford Conference 2016 (20)

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer
 
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
BDW Chicago 2016 - Manny Puentes, CTO, Altitude digital - How We Built a Data...
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
CD presentation march 12th, 2018
CD presentation march 12th, 2018CD presentation march 12th, 2018
CD presentation march 12th, 2018
 
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
Hoe een efficiënte Machine of Deep Learning backend ontwikkelen?
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive Toolkit
 
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
 
Deep Learning on Everyday Devices
Deep Learning on Everyday DevicesDeep Learning on Everyday Devices
Deep Learning on Everyday Devices
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 

Último

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 

Último (20)

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 

HPC Advisory Council Stanford Conference 2016

  • 1. Scaling Deep Learning Bryan Catanzaro @ctnzr
  • 2. Bryan Catanzaro What do we want AI to do? Drive us to work Serve drinks? Help us communicate 帮助我们沟通 Keep us organized Help us find things Guide us to content
  • 4. Bryan Catanzaro Face Analysis Baidu IDL Gender Age Range Ethnicity Mood
  • 5. Bryan Catanzaro Medical Diagnostics App Baidu BDL AskADoctor can assess 520 different diseases, representing ~90 percent of the most common medical problems.
  • 6. Bryan Catanzaro Image Captioning Baidu IDL A yellow bus driving down a road with green trees and green grass in the background. Living room with white couch and blue carpeting. Room in apartment gets some afternoon sun.
  • 7. Bryan Catanzaro Image Q&A Baidu IDL Sample questions and answers
  • 8. Bryan Catanzaro Natural User Interfaces • Goal: Make interacting with computers as natural as interacting with humans • AI problems: – Speech recognition – Emotional recognition – Semantic understanding – Dialog systems – Speech synthesis
  • 9. Bryan Catanzaro Machine learning for computer vision (c. 2009) “Please put away the coffee mugs!”
  • 10. Bryan Catanzaro Machine learning for computer vision “Mug” Machine Learning Cleanup-bot! (Woohoo!)
  • 12. Bryan Catanzaro AI applications are hard… Machine Learning can solve challenging problems --- but it is a lot of work! This eventually worked ~95% of the time.
  • 13. Bryan Catanzaro Why are applications so hard? “Coffee Mug” Pixel Intensity Pixel intensity is a very difficult representation…
  • 14. Bryan Catanzaro pixel 1 pixel 2 Coffee Mug Not Coffee Mug Why are applications so hard? pixel 1 pixel 2 Pixel Intensity[72 160] -+ + - + -
  • 15. Bryan Catanzaro Why are applications so hard? + pixel 1 pixel 2 - + + - - + - + +Coffee Mug Not Coffee Mug- + pixel 1 pixel 2 - + + - - + - + Is this a Coffee Mug? Learning Algorithm
  • 16. Bryan Catanzaro Features + handle? cylinder? - + +- - + - + +Coffee Mug Not Coffee Mug- cylinder?handle? Is this a Coffee Mug? Learning Algorithm + handle? cylinder? - + - - + - ++
  • 17. Bryan Catanzaro Machine learning in practice “Mug ” Machine Learning (Classifier) Feature Extraction
  • 18. Bryan Catanzaro Machine learning in practice “Mug ” Machine Learning (Classifier) Feature Extraction Prior Knowledge Experience
  • 19. Bryan Catanzaro Machine learning in practice • Enormous amounts of research time spent inventing new features. Idea CodeTest Hack up in Matlab Run on workstation Think really hard…
  • 20. Bryan Catanzaro Learning features • Deep learning: learn multiple stages of features to achieve end goal. Features Features “Mug”?Classifier Pixels
  • 21. Bryan Catanzaro Learning features • “Neural networks” are one way to represent features Features Features “Mug”? Classif ier Pixels y = g(W x) x y W
  • 22. Bryan Catanzaro Learning features • Deep learning: learn multiple stages of features to achieve end goal “Mug”? Pixels Features Features Classifier W3 W2 W1
  • 23. Bryan Catanzaro Why Deep Learning? 1. Scale Matters – Bigger models usually win 2. Data Matters – More data means less cleverness necessary 3. Productivity Matters – Teams with better tools can try out more ideas Data & Compute Accuracy Deep Learning Many previou methods
  • 24. Bryan Catanzaro Scaling up • Make progress on AI by focusing on systems – Make models bigger – Tackle more data – Reduce research cycle time • Accelerate large-scale experiments
  • 25. Bryan Catanzaro Exascale • Strong scaling important but difficult – Weak scaling over time as datasets increase • We run our experiments on 8-128 GPUs • Exascale likely important for running many “small” experiments
  • 26. Bryan Catanzaro Training Deep Neural Networks • Computation dominated by dot products • Multiple inputs, multiple outputs, batch means GEMM – Compute bound • Convolutional layers even more compute bound
  • 27. Bryan Catanzaro Computational Characteristics • High arithmetic intensity – Arithmetic operations / byte of data – O(Exaflops) / O(Terabytes) : 10^6 • In contrast, many other ML training jobs are O(Petaflops)/O(Petabytes) = 10^0 • Medium size datasets – Generally fit on 1 node – HDFS, fault tolerance, disk I/O not bottlenecks Training 1 model: ~10 Exaflops
  • 28. Bryan Catanzaro Deep Neural Network training is HPC Idea CodeTest • Turnaround time is key • Use most efficient hardware – Parallel, heterogeneous computing – Fast interconnect (PCIe, Infiniband) • Push strong scalability – Models and data have to be of commensurate size • This is all standard HPC!
  • 29. Bryan Catanzaro Training: Stochastic Gradient Descent • Simple algorithm – Add momentum to power through local minima – Compute gradient by backpropagation • Operates on minibatches – This makes it a GEMM problem instead of GEMV • Choose minibatches stochastically – Important to avoid memorizing training order • Difficult to parallelize – Prefers lots of small steps – Increasing minibatch size not always helpful
  • 30. Bryan Catanzaro Limitations of batching Error Iterations Batch size = 𝑛 Batch size = 2𝑛 Spending 2x the work picking a direction Doesn’t reduce iteration count by 2x
  • 31. Bryan Catanzaro SVAIL Infrastructure 1 http://www.tyan.com FT77CB7079 Service Engineer’s Manual NVIDIA GeForce GTX Titan X Titan X x8 Mellanox Interconnect • Software: CUDA, MPI, Majel (SVAIL internal library) • Hardware:
  • 32. Bryan Catanzaro Node Architecture • All pairs of GPUs communicate simultaneously over PCIe Gen 3 x16 • Groups of 4 GPUs form Peer to Peer domain • Avoid moving data to CPUs or across QPI
  • 33. Bryan Catanzaro Parallelism Model Parallel Data Parallel MPI_Allreduce() Training Data Training Data
  • 34. Bryan Catanzaro Speech Recognition: Traditional ASR • Getting higher performance is hard • Improve each stage by engineering Accuracy Traditional ASR Data + Model Size Expert engineering. Adam Coates
  • 35. Bryan Catanzaro Speech recognition: Traditional ASR • Huge investment in features for speech! – Decades of work to get very small improvements Spectrogram MFCC Flux
  • 36. Bryan Catanzaro Speech Recognition 2: Deep Learning! • Since 2011, deep learning for features AcousticModel HMM Language Model Transcription “The quick brown fox jumps over the lazy dog.”
  • 37. Bryan Catanzaro Speech Recognition 2: Deep Learning! • With more data, DL acoustic models perform better than traditional models Accuracy Traditional ASR Data + Model Size DL V1 for Speech
  • 38. Bryan Catanzaro Speech Recognition 3: “Deep Speech” • End-to-end learning “The quick brown fox jumps over the lazy dog.” Transcription
  • 39. Bryan Catanzaro Speech Recognition 3: “Deep Speech” • We believe end-to-end DL works better when we have big models and lots of data Accuracy Traditional ASR Data + Model Size DL V1 for Speech Deep Speech
  • 40. Bryan Catanzaro End-to-end speech with DL • Deep neural network predicts characters directly from audio . . . . . . T H _ E … D O G
  • 41. Bryan Catanzaro Recurrent Network • RNNs model temporal dependence • Various flavors used in many applications – LSTM, GRU, Bidirectional, … – Especially sequential data (time series, text, etc.) • Sequential dependence complicates parallelism
  • 43. Bryan Catanzaro warp-ctc • Recently open sourced our CTC implementation • Efficient, parallel CPU and GPU backend • 100-400X faster than other implementations • Apache license, C interfacehttps://github.com/baidu-research/warp-ctc
  • 44. Bryan Catanzaro Training sets • Train on ~1½ years of data (and growing) • English and Mandarin • End-to-end deep learning is key to assembling large datasets • Datasets drive accuracy
  • 45. Bryan Catanzaro All-reduce • We implemented our own all-reduce out of send and receive • Several algorithm choices based on size • Careful attention to affinity and topology
  • 46. Bryan Catanzaro Scalability • Batch size is hard to increase – algorithm, memory limits • Performance at small batch sizes (32, 64) leads to scalability limits
  • 47. Bryan Catanzaro Performance for RNN training • 55% of GPU FMA peak using a single GPU • ~48% of peak using 8 GPUs in one node • Weak scaling very efficient, albeit algorithmically challenged 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 TFLOP/s Number of GPUs Typical training run one node multi node
  • 48. Bryan Catanzaro Precision • FP16 mostly works – Use FP32 for softmax and weight updates • More sensitive to labeling error 1 10 100 1000 10000 100000 1000000 10000000 100000000 -31 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Count Magnitude Weight Distribution
  • 49. Bryan Catanzaro Determinism • Determinism very important • So much randomness, hard to tell if you have a bug • Networks train despite bugs, although accuracy impaired • Reproducibility is important – For the usual scientific reasons – Progress not possible without reproducibility • We use synchronous SGD
  • 50. Bryan Catanzaro Conclusion • Deep Learning is solving many hard problems • Training deep neural networks is an HPC problem • Scaling brings AI progress!
  • 51. Bryan Catanzaro Thanks • Andrew Ng, Adam Coates, Awni Hannun, Patrick LeGresley … and all of SVAIL Bryan Catanzaro @ctnzr

Notas do Editor

  1. Fix contrast
  2. Model Parallel: Latency sensitive Data Parallel: Bandwidth sensitive