SlideShare uma empresa Scribd logo
1 de 26
Haibin Lin
Applied Scientist
AWS AI
From Hours to Minutes: The Journey of
Optimizing Mask-RCNN and BERT Using MXNet
Lin Yuan
Software Design Engineer
AWS AI
Dataset and Model Size Keep Growing
Dataset size for training (GB) Model parameter size (million)
Large Scale Distributed Training for Deep Neural Networks
Data parallelism Model parallelism
Optimization for Large Scale Distributed Training
• System-Level Optimization
• Accelerate training on a single GPU
• fused operators, data prefetching, vectorization, cache utilization, tensor core
• Distributed training with multiple GPUs
• large batch size, NCCL-allreduce, Elastic Fabric Adaptor
• Algorithm-Level Optimization
• Large-batch optimization algorithm
• Model architecture
• Accuracy/runtime trade off
Performance Optimization on AWS Cloud
• Leverage the Amazon EC2 P3dn.24xlarge GPU instances
• 8 Nvidia V100 Tensor Core GPUs with 32 GB of memory each
• 96 Intel Xeon Scalable vCPUs
• 1.8 TB local NVMe SSD
• 100 Gbps network throughput
• support Elastic Fabric Adapter
• Software
• Apache MXNet
• GluonNLP and GluonCV toolkits
• Horovod distributed training library
Case Study: Mask R-CNN
Deep learning nowadays - Mask-RCNN
• Widely used in object detection
and instance Segmentation
• Target accuracy
• bounding box AP: 37.7
• mask AP: 33.9
GluonCV: a Deep Learning Toolkit for Computer Vision
• Training scripts that reproduce SOTA results reported in latest papers
• A large set of pre-trained models
• Carefully designed APIs and easy to understand its implementations
• Community support
• Built on top of Apache MXNet framework
Image
classification
Object
detection
Semantic
segmentation
Pose
estimation
Video action
recognition
GPU Profiling
• Analyze runtime using Nvidia Visual Profiler
• Identify large kernels to optimize
Slow operator
NHWC layout conversion
small kernels
GPU Optimization
Runtime Improvements
• optimize ROIAlign: +10%
• optimize NMS: +10%
• fuse RCNN target generator: +5%
• NWHC layout conversion: +10%
• pointwise operator fusion: +3%
Automatic Mixed Precision
• Automatic casting of the model
• Convolution, FullyConnected -> FP16
• Norm, Mean, SoftMax, etc. -> FP32
• Add, Mul etc. -> Cast to widest type
• AMP boosted the throughput by 5~10%
• Casting the gradients to FP16 gives another throughput improvement by 1~2%
without compromising Accuracy.
Utilities for dynamic loss scaling
Model Hybridization
• MXNet provides users the APIs to construct and debug the model using
imperative programming
• Users can invoke a hybridize API to boost model performance that is
equivalent to symbolic programming.
• We applied hybridization to the model and achieved 5% runtime improvement
• Also, Hybridizing the model with static_alloc gave another 1~2% throughput
improvement.
Performance Tuning in AWS cluster
• Bind each GPU with 12 vCPUs (6 from each CPUs) on Amazon P3dn.24xlarge
EC2 instance helps us to get 8% improvement in throughput
• Autotune Horovod hyperparameters such as tensor fusion threshold cycle
times, cache capacity, hierarchical allreduce etc. +9% throughput
• Increase the number of data workers from 4 to 8 also help to accelerate data
loading. Note that however more data workers do not necessarily mean better
performance due to the overhead of context switching.
• Accelerate dataloader through Cython
• Distributed validation showed significant improvement in Validation compute
time. Validation time was 13 secs/epoch on 24 P3dn vs several minutes on
non-distributed validation.
Case Study: BERT
55
65
75
85
95
General Language Understanding Evaluation (GLUE) Benchmark
Human Baseline
Deep learning nowadays - BERT
BERT
Transfer learning with BERT for NLP
• Pre-training for NLP
• learn text representation on large-scale
corpus
• Fine-tuning for downstream tasks
• Named Entity Recognition
• Question Answering
• Search
• Chatbot
• Text Summarization
• Text Classification
• Models available in GluonNLP toolkit
feature
extractor
}
GTC is awesome!
positive
NLP CV
Image credit to: d2l.ai
GluonNLP: a deep learning natural language toolkit
• Open source, available on SageMaker and deep learning container
• State-of-the-art NLP models
• Easy prototyping
• Fast deployment
• Multiple built-in NLP tasks
BERT model architecture
Image credit to: d2l.ai
BERTMulti-head attention (Vaswani et al., 17)
x N
1. Masked language modeling
• Estimate
• Randomly mask 15% of all tokens and predict them
2. Next sentence prediction
• 50% of the time, replace it by random sentence
• Learn logical coherence
Pre-training objectives
I went to the bank to deposit some money.
I went to the <mask> to deposit some money.
<CLS> Haibin is obnoxious <SEP> I don’t like his shirt
<CLS> Haibin is obnoxious <SEP> Hello world! .
Data loading
• Mini-batches are generated on the fly for dynamic masking[1]
• Multi-process DatasetLoader with pre-fetching in the background
• AWS FSx for Lustre: file system for compute-intensive workloads
• Profiling result visualization
previous
batch
current
batch
data
loading
gap
Image credit to: d2l.ai
Fast Multi-head Self-Attention
For each layer:
Separate projections:
Qproj = QWq, Kproj = QWk, Vproj = QWv
Transpose Qproj , Kproj , Vproj :
From (N, T, H, C) to (N, H, T, C)
Compute attention:
score = batch_gemm(Qproj, Kproj)
result = batch_gemm(score, Vproj)
Transpose result:
From (N, H, T, C) to (N, T, H, C)
credit to: Clement Fuji Tsang
Higher cache utilization
1.58x faster (end to end)
Transpose Q:
From (N, T, HC) to (T, N, HC)
For each layer:
Joint projections:
Wqkv = concat(Wq, Wk, Wv)
Q_K_Vproj = QWqkv
Compute attention:
score = strided_batch_gemm(Qproj, Kproj)
result = strided_batch_gemm(score, Vproj)
Transpose final result:
From (T, N, HC) to (N, T, HC)
GPU memory is precious
- For each mini-batch, the gradient is synchronized across GPUs
- Gradient allreduce can overlap with backward computation
- A larger batch sizes leads to more time to hide communication latency
- 1-bit dropout mask leads to 20% memory reduction, enabling larger batch sizes
Image credit to: d2l.ai
Forward1 Backward1Forward2 Forward3 Backward2 Backward3
Allreduce1 Allreduce2 Allreduce3
time
We can overlap computation
and communication
NCCL + Elastic Fabric Adaptor
HPC Application
MPI
implementation
TCP/IP stack
ENA network
driver
ENA Device
HPC Application
MPI
implementation
EFA kernel
driver
ENA Device
Libfabric
user
space
kernel
Traditional HPC
software stack in EC2
kernel
user
space
HPC software stack
in EC2 with EFA
- Elastic Fabric Adaptor (EFA)
- For HPC and distributed ML
- Bypass OS kernel
- Integrated with MPI, NCCL
- BERT training
- 32 p3dn.24xlarge instances
- V100 GPUs x 256
- 100 Gb/s networking
- BERT-large with GluonNLP
- Batch size 64K, phase 1
- 90% strong scaling efficiency, with
EFA enabled
Distributed Stochastic Optimization
credit to: Shuai Zheng
𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑡 𝑔𝑡
𝑥 𝑡+1 = 𝑥 𝑡 − 𝜂 𝑡
𝑔 𝑡
∥𝑔 𝑡∥2
Framework
batch
size
#XPUs #steps optimizer F1 score training time
Tensorflow 64K/32K 1K TPUs 8599 LAMB [2] 90.58% 76.19m
MXNet 32K/32K 512 GPUs 7038/1563
LAMB +
NG
90.60% 141.5m
References
[1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach."
arXiv preprint arXiv:1907.11692 (2019).
[2] You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76
minutes." International Conference on Learning Representations. 2019.
Thank you
Haibin Lin
haibilin@amazon.com
Lin Yuan
lnyuan@amazon.com

Mais conteúdo relacionado

Mais procurados

Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikander
rogerz1234567
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
pauldix
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
Edge AI and Vision Alliance
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 

Mais procurados (20)

Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikander
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark
 
Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
AlexNet and so on...
AlexNet and so on...AlexNet and so on...
AlexNet and so on...
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN Training
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 

Semelhante a From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
inside-BigData.com
 

Semelhante a From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet (20)

2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AI
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Using Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clustersUsing Deep Learning Toolkits with Kubernetes clusters
Using Deep Learning Toolkits with Kubernetes clusters
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
The Convergence of HPC and Deep Learning
The Convergence of HPC and Deep LearningThe Convergence of HPC and Deep Learning
The Convergence of HPC and Deep Learning
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet

  • 1. Haibin Lin Applied Scientist AWS AI From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet Lin Yuan Software Design Engineer AWS AI
  • 2. Dataset and Model Size Keep Growing Dataset size for training (GB) Model parameter size (million)
  • 3. Large Scale Distributed Training for Deep Neural Networks Data parallelism Model parallelism
  • 4. Optimization for Large Scale Distributed Training • System-Level Optimization • Accelerate training on a single GPU • fused operators, data prefetching, vectorization, cache utilization, tensor core • Distributed training with multiple GPUs • large batch size, NCCL-allreduce, Elastic Fabric Adaptor • Algorithm-Level Optimization • Large-batch optimization algorithm • Model architecture • Accuracy/runtime trade off
  • 5. Performance Optimization on AWS Cloud • Leverage the Amazon EC2 P3dn.24xlarge GPU instances • 8 Nvidia V100 Tensor Core GPUs with 32 GB of memory each • 96 Intel Xeon Scalable vCPUs • 1.8 TB local NVMe SSD • 100 Gbps network throughput • support Elastic Fabric Adapter • Software • Apache MXNet • GluonNLP and GluonCV toolkits • Horovod distributed training library
  • 7. Deep learning nowadays - Mask-RCNN • Widely used in object detection and instance Segmentation • Target accuracy • bounding box AP: 37.7 • mask AP: 33.9
  • 8. GluonCV: a Deep Learning Toolkit for Computer Vision • Training scripts that reproduce SOTA results reported in latest papers • A large set of pre-trained models • Carefully designed APIs and easy to understand its implementations • Community support • Built on top of Apache MXNet framework Image classification Object detection Semantic segmentation Pose estimation Video action recognition
  • 9. GPU Profiling • Analyze runtime using Nvidia Visual Profiler • Identify large kernels to optimize Slow operator NHWC layout conversion small kernels
  • 10. GPU Optimization Runtime Improvements • optimize ROIAlign: +10% • optimize NMS: +10% • fuse RCNN target generator: +5% • NWHC layout conversion: +10% • pointwise operator fusion: +3%
  • 11. Automatic Mixed Precision • Automatic casting of the model • Convolution, FullyConnected -> FP16 • Norm, Mean, SoftMax, etc. -> FP32 • Add, Mul etc. -> Cast to widest type • AMP boosted the throughput by 5~10% • Casting the gradients to FP16 gives another throughput improvement by 1~2% without compromising Accuracy. Utilities for dynamic loss scaling
  • 12. Model Hybridization • MXNet provides users the APIs to construct and debug the model using imperative programming • Users can invoke a hybridize API to boost model performance that is equivalent to symbolic programming. • We applied hybridization to the model and achieved 5% runtime improvement • Also, Hybridizing the model with static_alloc gave another 1~2% throughput improvement.
  • 13. Performance Tuning in AWS cluster • Bind each GPU with 12 vCPUs (6 from each CPUs) on Amazon P3dn.24xlarge EC2 instance helps us to get 8% improvement in throughput • Autotune Horovod hyperparameters such as tensor fusion threshold cycle times, cache capacity, hierarchical allreduce etc. +9% throughput • Increase the number of data workers from 4 to 8 also help to accelerate data loading. Note that however more data workers do not necessarily mean better performance due to the overhead of context switching. • Accelerate dataloader through Cython • Distributed validation showed significant improvement in Validation compute time. Validation time was 13 secs/epoch on 24 P3dn vs several minutes on non-distributed validation.
  • 15. 55 65 75 85 95 General Language Understanding Evaluation (GLUE) Benchmark Human Baseline Deep learning nowadays - BERT BERT
  • 16. Transfer learning with BERT for NLP • Pre-training for NLP • learn text representation on large-scale corpus • Fine-tuning for downstream tasks • Named Entity Recognition • Question Answering • Search • Chatbot • Text Summarization • Text Classification • Models available in GluonNLP toolkit feature extractor } GTC is awesome! positive NLP CV Image credit to: d2l.ai
  • 17. GluonNLP: a deep learning natural language toolkit • Open source, available on SageMaker and deep learning container • State-of-the-art NLP models • Easy prototyping • Fast deployment • Multiple built-in NLP tasks
  • 18. BERT model architecture Image credit to: d2l.ai BERTMulti-head attention (Vaswani et al., 17) x N
  • 19. 1. Masked language modeling • Estimate • Randomly mask 15% of all tokens and predict them 2. Next sentence prediction • 50% of the time, replace it by random sentence • Learn logical coherence Pre-training objectives I went to the bank to deposit some money. I went to the <mask> to deposit some money. <CLS> Haibin is obnoxious <SEP> I don’t like his shirt <CLS> Haibin is obnoxious <SEP> Hello world! .
  • 20. Data loading • Mini-batches are generated on the fly for dynamic masking[1] • Multi-process DatasetLoader with pre-fetching in the background • AWS FSx for Lustre: file system for compute-intensive workloads • Profiling result visualization previous batch current batch data loading gap Image credit to: d2l.ai
  • 21. Fast Multi-head Self-Attention For each layer: Separate projections: Qproj = QWq, Kproj = QWk, Vproj = QWv Transpose Qproj , Kproj , Vproj : From (N, T, H, C) to (N, H, T, C) Compute attention: score = batch_gemm(Qproj, Kproj) result = batch_gemm(score, Vproj) Transpose result: From (N, H, T, C) to (N, T, H, C) credit to: Clement Fuji Tsang Higher cache utilization 1.58x faster (end to end) Transpose Q: From (N, T, HC) to (T, N, HC) For each layer: Joint projections: Wqkv = concat(Wq, Wk, Wv) Q_K_Vproj = QWqkv Compute attention: score = strided_batch_gemm(Qproj, Kproj) result = strided_batch_gemm(score, Vproj) Transpose final result: From (T, N, HC) to (N, T, HC)
  • 22. GPU memory is precious - For each mini-batch, the gradient is synchronized across GPUs - Gradient allreduce can overlap with backward computation - A larger batch sizes leads to more time to hide communication latency - 1-bit dropout mask leads to 20% memory reduction, enabling larger batch sizes Image credit to: d2l.ai Forward1 Backward1Forward2 Forward3 Backward2 Backward3 Allreduce1 Allreduce2 Allreduce3 time We can overlap computation and communication
  • 23. NCCL + Elastic Fabric Adaptor HPC Application MPI implementation TCP/IP stack ENA network driver ENA Device HPC Application MPI implementation EFA kernel driver ENA Device Libfabric user space kernel Traditional HPC software stack in EC2 kernel user space HPC software stack in EC2 with EFA - Elastic Fabric Adaptor (EFA) - For HPC and distributed ML - Bypass OS kernel - Integrated with MPI, NCCL - BERT training - 32 p3dn.24xlarge instances - V100 GPUs x 256 - 100 Gb/s networking - BERT-large with GluonNLP - Batch size 64K, phase 1 - 90% strong scaling efficiency, with EFA enabled
  • 24. Distributed Stochastic Optimization credit to: Shuai Zheng 𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑡 𝑔𝑡 𝑥 𝑡+1 = 𝑥 𝑡 − 𝜂 𝑡 𝑔 𝑡 ∥𝑔 𝑡∥2 Framework batch size #XPUs #steps optimizer F1 score training time Tensorflow 64K/32K 1K TPUs 8599 LAMB [2] 90.58% 76.19m MXNet 32K/32K 512 GPUs 7038/1563 LAMB + NG 90.60% 141.5m
  • 25. References [1] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019). [2] You, Yang, et al. "Large batch optimization for deep learning: Training bert in 76 minutes." International Conference on Learning Representations. 2019.

Notas do Editor

  1. First call deck
  2. A
  3. What is the specialty for this toolkit? Previously, each model has its own repo. Now all the SOTA models in one place. Smooth to develop.
  4. Today we are launching Amazon FSx for Lustre, designed to meet the needs of these applications and others that you will undoubtedly dream up. Based on the mature and popular Lustre open source project, Amazon FSx for Lustre is a highly parallel file system that supports sub-millisecond access to petabyte-scale file systems. Thousands of simultaneous clients (EC2 instances and on-premises servers) can drive millions of IOPS (Input/Output Operations per Second) and transfer hundreds of gibibytes of data per second.