SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
2020/09/06
Ho Seong Lee (hoya012)
Cognex Deep Learning Lab
Research Engineer
PR-274 | Mixed Precision Training 1
Contents
• Introduction
• Related Work
• Implementation
• Results
• PyTorch 1.6 AMP New features & Experiment
• Conclusion
PR-274 | Mixed Precision Training 2
Introduction
Increasing the size of a neural network typically improves accuracy
• But also increases the memory and compute requirements for training the model.
• Introduce methodology for training deep neural networks using half-precision floating point numbers,
without losing model accuracy or having to modify hyper-parameters.
• Introduce three techniques to prevent model accuracy loss.
• Using these techniques, demonstrate that a wide variety of network architectures and
applications can be trained to match the accuracy FP32 training.
PR-274 | Mixed Precision Training 3
Main Contributions
Related Works
Network Compression
PR-274 | Mixed Precision Training 4
• Low-precision Training
• Train networks with low precision weights.
• Quantization
• Quantize pretrained model reducing the number of bits.
• Pruning
• Remove connections according to an importance criteria.
• Dedicated architectures
• Design architecture to be memory efficient such as SqueezeNet, MobileNet, ShuffleNet.
Related Works
Network Compression in PR-12 Study
PR-274 | Mixed Precision Training 5
• Total 23 papers were covered! → 23/274 = Almost 8%!
• But, Low-precision training is, as far as I know, the first topic to be covered.
Related Works
Related Works – Low Precision Training
• “Binaryconnect: Training deep neural networks with binary weights during propagations.”, 2015 NIPS
• Propose training with binary weights, all other tensors and arithmetic were in full precision.
• “Binarized neural networks.”, 2016 NIPS
• Also binarize the activations, but gradients were stored and computed in single precision.
• “Quantized neural net- works: Training neural networks with low precision weights and activations.”,
2016 arXiv
• Quantize weights and activations to 2, 4, and 6 bits, but gradients were real numbers.
• “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”, 2016 ECCV
• Binarize all tensors, including the gradients, but lead to non-trivial loss of accuracy.
PR-274 | Mixed Precision Training 6
Related Works
Main Contributions
• All tensors and arithmetic for forward and backward passes use reduced precision, FP16.
• No hyper-parameters (such as layer width) are adjusted.
• Models trained with these techniques do not incur accuracy loss when compared to FP16 baselines.
• Demonstrate that this technique works across a variety of applications.
PR-274 | Mixed Precision Training 7
Implementation
IEEE 754 Floating Point Representation
• Number can be represented by (−1) 𝑆
∗ 1. 𝑀 ∗ 2(𝐸 −𝐵𝑖𝑎𝑠)
PR-274 | Mixed Precision Training 8
Implementation
PR-274 | Mixed Precision Training 9
Bonus) New Floating-Point format
IEEE754
FP32
IEEE754
FP16
1bit
1bit
8bit
5bit
23bit
10bit
Google
bfloat16
1bit 8bit 7bit
NVIDIA
TensorFloat
1bit 8bit 10bit
AMD
FP24
1bit 7bit 16bit
Implementation
PR-274 | Mixed Precision Training 10
1. FP32 Master copy of weights
• In mixed precision training, weights, activations, and gradients are stored as FP16.
• In order to match the accuracy of FP32 networks, an FP32 master copy of weights is maintained and
update with the weight gradient during the optimizer step.
Halving the storage and bandwidth
Implementation
PR-274 | Mixed Precision Training 11
1. FP32 Master copy of weights → Why?
• Weight Update (weight gradients multiplied by the learning rate) becomes too small to be represented
in FP16. (smaller than 2−24
)
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂 ∗
𝜕𝐸
𝜕𝑊
Implementation
PR-274 | Mixed Precision Training 12
1. FP32 Master copy of weights → Experiments
• Train the Mandarin speech model with FP32 master copy and without FP32 master copy.
• Updating FP16 weights results in 80% relative accuracy loss.
Worse than FP master copy
Implementation
PR-274 | Mixed Precision Training 13
2. Loss Scaling
• Activation gradient values tend to be dominated by small magnitudes.
• Scaling them by a factor of 8 is sufficient to match the accuracy achieved with FP32 training.
• It means activation gradient values below 2−27
were irrelevant to the training.
Implementation
PR-274 | Mixed Precision Training 14
2. Loss Scaling
• One efficient way to shift the gradient values into FP16-representable range is to scale the loss value
computed in the forward pass, prior to starting back-propagation.
• This can keep the relevant gradient values from becoming zeros.
• Weight gradients must be unscaled before weight update to maintain the update magnitudes.
Implementation
PR-274 | Mixed Precision Training 15
2. Loss Scaling – How to choose the loss scaling factor?
• Simple way is to pick a constant scaling factor empirically.
• Or if gradient statistics are available, directly choosing a factor so that its product with the maximum
absolute gradient value is below 65,504 (the maximum value representable in FP16).
• There is no downside to choosing a large scaling factor as long as it does not cause overflow during
backpropagation.
Implementation
PR-274 | Mixed Precision Training 16
2. Loss Scaling – Automatic Mixed Precision
• More robust way is to choose the loss scaling factor dynamically (Automatically).
• The basic idea is to start with a large scaling factor and then reconsider it in each training iteration.
• If an overflow occurs, skip the weight update and decrease the scaling factor.
• If no overflow occurs for a chosen number of iterations N, increase the scaling factor.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
Use N=2000, Increase x2, Decrease x0.5
Implementation
PR-274 | Mixed Precision Training 17
3. Arithmetic Precision
• Neural network arithmetic falls into three categories: vector dot-products, reductions, and point-wise
operations.
• To maintain model accuracy, we found that some networks require that FP16 vector dot-product
accumulates the partial products into an FP32 value, which is converted to FP16 before writing to
memory.
Reference: https://www.quora.com/How-does-Fused-Multiply-Add-FMA-work-and-what-is-its-importance-in-computing
Implementation
PR-274 | Mixed Precision Training 18
3. Arithmetic Precision
• Large reductions (sums across elements of a vector) should be carried out in FP32.
• Such reductions mostly come up in batch-normalization layers and softmax layers.
• Both layer types in author’s implementations still read and write FP16 tensors from memory, performing
the arithmetic in FP32. → did not slow down the training process.
Results
PR-274 | Mixed Precision Training 19
Comparison Baseline(FP32) with Mixed Precision
Results
PR-274 | Mixed Precision Training 20
Comparison Baseline(FP32) with Mixed Precision
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 21
Automatic Mixed Precision in PyTorch
• Last July, PyTorch release new version 1.6 and support Automatic Mixed Precision features officially!
• We can very simply use Automatic Mixed Precision. Just add 5 lines.
Merged into PyTorch / Deprecated!
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 22
Automatic Mixed Precision in PyTorch
• Just add 5 line. Now we can use Automatic Mixed Precision Training in PyTorch!
Before
After
Reference: https://github.com/hoya012/automatic-mixed-precision-tutorials-pytorch
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 23
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• To verify effect of AMP, perform a simple classification experiment.
• Use Kaggle Intel Image Classification dataset.
• Contains around 25k images of size 150x150 distributed under 6 categories .
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 24
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• Use ImageNet Pretrained ResNet-18.
• Use GTX 1080 Ti (w/o Tensor Core) and RTX 2080 Ti (with Tensor Core).
• Fix training setting (batch size=256, epoch=120, lr, augmentation, optimizer, etc.).
PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 25
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• We can save GPU Memory almost 30% ~ 40%!
• If use good GPU (with Tensor Core), we can save computational time!
• NVIDIA Tensor Cores provide hardware acceleration for mixed precision training.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
Conclusion
PR-274 | Mixed Precision Training
• Introduce methodology for training deep neural networks using half-precision floating point.
• Introduce three techniques to prevent model accuracy loss.
• PyTorch officially support Automatic Mixed Precision training.
28

Mais conteúdo relacionado

Mais procurados

[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptxTuCaoMinh2
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Edureka!
 
Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?Bernard Marr
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...Edge AI and Vision Alliance
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...홍배 김
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNNPradnya Saval
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksChristian Perone
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...Shuhei Yoshida
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
CVPR 2018 Paper Reading MobileNet V2
CVPR 2018 Paper Reading MobileNet V2CVPR 2018 Paper Reading MobileNet V2
CVPR 2018 Paper Reading MobileNet V2Khang Pham
 

Mais procurados (20)

XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
 
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
Restricted Boltzmann Machine | Neural Network Tutorial | Deep Learning Tutori...
 
Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?Artificial Intelligence: What Is Reinforcement Learning?
Artificial Intelligence: What Is Reinforcement Learning?
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
Lstm
LstmLstm
Lstm
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
CVPR 2018 Paper Reading MobileNet V2
CVPR 2018 Paper Reading MobileNet V2CVPR 2018 Paper Reading MobileNet V2
CVPR 2018 Paper Reading MobileNet V2
 
CNN Quantization
CNN QuantizationCNN Quantization
CNN Quantization
 

Semelhante a Mixed Precision Training Review

Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...Dongmin Choi
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 RecapSri Ambati
 
IRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGAIRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGAIRJET Journal
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceLEE HOSEONG
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]taeseon ryu
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learningmilad abbasi
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...Bharath Sudharsan
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep LearningKamer Ali Yuksel
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptxruvex
 
How to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeHow to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeMichael Galarnyk
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...Tahmid Abtahi
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Model Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdfModel Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdfZHUORANGUO2
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compressionDavid Tung
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdfFEG
 

Semelhante a Mixed Precision Training Review (20)

Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
IRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGAIRJET - Implementation of Neural Network on FPGA
IRJET - Implementation of Neural Network on FPGA
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
Semi-Supervised Deep Learning
Semi-Supervised Deep LearningSemi-Supervised Deep Learning
Semi-Supervised Deep Learning
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
How to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training TimeHow to Reduce Scikit-Learn Training Time
How to Reduce Scikit-Learn Training Time
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Model Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdfModel Quantization Technologies with AIMET.pdf
Model Quantization Technologies with AIMET.pdf
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
 

Mais de LEE HOSEONG

Unsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillationUnsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillationLEE HOSEONG
 
do adversarially robust image net models transfer better
do adversarially robust image net models transfer betterdo adversarially robust image net models transfer better
do adversarially robust image net models transfer betterLEE HOSEONG
 
CNN Architecture A to Z
CNN Architecture A to ZCNN Architecture A to Z
CNN Architecture A to ZLEE HOSEONG
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationLEE HOSEONG
 
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen..."The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...LEE HOSEONG
 
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionMVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionLEE HOSEONG
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewLEE HOSEONG
 
"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper ReviewLEE HOSEONG
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionLEE HOSEONG
 
Human uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 ReviewHuman uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 ReviewLEE HOSEONG
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution OverviewLEE HOSEONG
 
2019 ICLR Best Paper Review
2019 ICLR Best Paper Review2019 ICLR Best Paper Review
2019 ICLR Best Paper ReviewLEE HOSEONG
 
2019 cvpr paper_overview
2019 cvpr paper_overview2019 cvpr paper_overview
2019 cvpr paper_overviewLEE HOSEONG
 
"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper ReviewLEE HOSEONG
 
"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper ReviewLEE HOSEONG
 
"Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re..."Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re...LEE HOSEONG
 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper ReviewLEE HOSEONG
 
"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper ReviewLEE HOSEONG
 
"From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ..."From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ...LEE HOSEONG
 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...LEE HOSEONG
 

Mais de LEE HOSEONG (20)

Unsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillationUnsupervised anomaly detection using style distillation
Unsupervised anomaly detection using style distillation
 
do adversarially robust image net models transfer better
do adversarially robust image net models transfer betterdo adversarially robust image net models transfer better
do adversarially robust image net models transfer better
 
CNN Architecture A to Z
CNN Architecture A to ZCNN Architecture A to Z
CNN Architecture A to Z
 
carrier of_tricks_for_image_classification
carrier of_tricks_for_image_classificationcarrier of_tricks_for_image_classification
carrier of_tricks_for_image_classification
 
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen..."The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
"The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Gen...
 
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly DetectionMVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review
 
Unsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-SupervisionUnsupervised visual representation learning overview: Toward Self-Supervision
Unsupervised visual representation learning overview: Toward Self-Supervision
 
Human uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 ReviewHuman uncertainty makes classification more robust, ICCV 2019 Review
Human uncertainty makes classification more robust, ICCV 2019 Review
 
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
 
2019 ICLR Best Paper Review
2019 ICLR Best Paper Review2019 ICLR Best Paper Review
2019 ICLR Best Paper Review
 
2019 cvpr paper_overview
2019 cvpr paper_overview2019 cvpr paper_overview
2019 cvpr paper_overview
 
"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review"Google Vizier: A Service for Black-Box Optimization" Paper Review
"Google Vizier: A Service for Black-Box Optimization" Paper Review
 
"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review"Searching for Activation Functions" Paper Review
"Searching for Activation Functions" Paper Review
 
"Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re..."Learning transferable architectures for scalable image recognition" Paper Re...
"Learning transferable architectures for scalable image recognition" Paper Re...
 
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
"Learning From Noisy Large-Scale Datasets With Minimal Supervision" Paper Review
 
"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review"Dataset and metrics for predicting local visible differences" Paper Review
"Dataset and metrics for predicting local visible differences" Paper Review
 
"From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ..."From image level to pixel-level labeling with convolutional networks" Paper ...
"From image level to pixel-level labeling with convolutional networks" Paper ...
 
"simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r..."simple does it weakly supervised instance and semantic segmentation" Paper r...
"simple does it weakly supervised instance and semantic segmentation" Paper r...
 

Último

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Último (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Mixed Precision Training Review

  • 1. 2020/09/06 Ho Seong Lee (hoya012) Cognex Deep Learning Lab Research Engineer PR-274 | Mixed Precision Training 1
  • 2. Contents • Introduction • Related Work • Implementation • Results • PyTorch 1.6 AMP New features & Experiment • Conclusion PR-274 | Mixed Precision Training 2
  • 3. Introduction Increasing the size of a neural network typically improves accuracy • But also increases the memory and compute requirements for training the model. • Introduce methodology for training deep neural networks using half-precision floating point numbers, without losing model accuracy or having to modify hyper-parameters. • Introduce three techniques to prevent model accuracy loss. • Using these techniques, demonstrate that a wide variety of network architectures and applications can be trained to match the accuracy FP32 training. PR-274 | Mixed Precision Training 3 Main Contributions
  • 4. Related Works Network Compression PR-274 | Mixed Precision Training 4 • Low-precision Training • Train networks with low precision weights. • Quantization • Quantize pretrained model reducing the number of bits. • Pruning • Remove connections according to an importance criteria. • Dedicated architectures • Design architecture to be memory efficient such as SqueezeNet, MobileNet, ShuffleNet.
  • 5. Related Works Network Compression in PR-12 Study PR-274 | Mixed Precision Training 5 • Total 23 papers were covered! → 23/274 = Almost 8%! • But, Low-precision training is, as far as I know, the first topic to be covered.
  • 6. Related Works Related Works – Low Precision Training • “Binaryconnect: Training deep neural networks with binary weights during propagations.”, 2015 NIPS • Propose training with binary weights, all other tensors and arithmetic were in full precision. • “Binarized neural networks.”, 2016 NIPS • Also binarize the activations, but gradients were stored and computed in single precision. • “Quantized neural net- works: Training neural networks with low precision weights and activations.”, 2016 arXiv • Quantize weights and activations to 2, 4, and 6 bits, but gradients were real numbers. • “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”, 2016 ECCV • Binarize all tensors, including the gradients, but lead to non-trivial loss of accuracy. PR-274 | Mixed Precision Training 6
  • 7. Related Works Main Contributions • All tensors and arithmetic for forward and backward passes use reduced precision, FP16. • No hyper-parameters (such as layer width) are adjusted. • Models trained with these techniques do not incur accuracy loss when compared to FP16 baselines. • Demonstrate that this technique works across a variety of applications. PR-274 | Mixed Precision Training 7
  • 8. Implementation IEEE 754 Floating Point Representation • Number can be represented by (−1) 𝑆 ∗ 1. 𝑀 ∗ 2(𝐸 −𝐵𝑖𝑎𝑠) PR-274 | Mixed Precision Training 8
  • 9. Implementation PR-274 | Mixed Precision Training 9 Bonus) New Floating-Point format IEEE754 FP32 IEEE754 FP16 1bit 1bit 8bit 5bit 23bit 10bit Google bfloat16 1bit 8bit 7bit NVIDIA TensorFloat 1bit 8bit 10bit AMD FP24 1bit 7bit 16bit
  • 10. Implementation PR-274 | Mixed Precision Training 10 1. FP32 Master copy of weights • In mixed precision training, weights, activations, and gradients are stored as FP16. • In order to match the accuracy of FP32 networks, an FP32 master copy of weights is maintained and update with the weight gradient during the optimizer step. Halving the storage and bandwidth
  • 11. Implementation PR-274 | Mixed Precision Training 11 1. FP32 Master copy of weights → Why? • Weight Update (weight gradients multiplied by the learning rate) becomes too small to be represented in FP16. (smaller than 2−24 ) 𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂 ∗ 𝜕𝐸 𝜕𝑊
  • 12. Implementation PR-274 | Mixed Precision Training 12 1. FP32 Master copy of weights → Experiments • Train the Mandarin speech model with FP32 master copy and without FP32 master copy. • Updating FP16 weights results in 80% relative accuracy loss. Worse than FP master copy
  • 13. Implementation PR-274 | Mixed Precision Training 13 2. Loss Scaling • Activation gradient values tend to be dominated by small magnitudes. • Scaling them by a factor of 8 is sufficient to match the accuracy achieved with FP32 training. • It means activation gradient values below 2−27 were irrelevant to the training.
  • 14. Implementation PR-274 | Mixed Precision Training 14 2. Loss Scaling • One efficient way to shift the gradient values into FP16-representable range is to scale the loss value computed in the forward pass, prior to starting back-propagation. • This can keep the relevant gradient values from becoming zeros. • Weight gradients must be unscaled before weight update to maintain the update magnitudes.
  • 15. Implementation PR-274 | Mixed Precision Training 15 2. Loss Scaling – How to choose the loss scaling factor? • Simple way is to pick a constant scaling factor empirically. • Or if gradient statistics are available, directly choosing a factor so that its product with the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16). • There is no downside to choosing a large scaling factor as long as it does not cause overflow during backpropagation.
  • 16. Implementation PR-274 | Mixed Precision Training 16 2. Loss Scaling – Automatic Mixed Precision • More robust way is to choose the loss scaling factor dynamically (Automatically). • The basic idea is to start with a large scaling factor and then reconsider it in each training iteration. • If an overflow occurs, skip the weight update and decrease the scaling factor. • If no overflow occurs for a chosen number of iterations N, increase the scaling factor. Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html Use N=2000, Increase x2, Decrease x0.5
  • 17. Implementation PR-274 | Mixed Precision Training 17 3. Arithmetic Precision • Neural network arithmetic falls into three categories: vector dot-products, reductions, and point-wise operations. • To maintain model accuracy, we found that some networks require that FP16 vector dot-product accumulates the partial products into an FP32 value, which is converted to FP16 before writing to memory. Reference: https://www.quora.com/How-does-Fused-Multiply-Add-FMA-work-and-what-is-its-importance-in-computing
  • 18. Implementation PR-274 | Mixed Precision Training 18 3. Arithmetic Precision • Large reductions (sums across elements of a vector) should be carried out in FP32. • Such reductions mostly come up in batch-normalization layers and softmax layers. • Both layer types in author’s implementations still read and write FP16 tensors from memory, performing the arithmetic in FP32. → did not slow down the training process.
  • 19. Results PR-274 | Mixed Precision Training 19 Comparison Baseline(FP32) with Mixed Precision
  • 20. Results PR-274 | Mixed Precision Training 20 Comparison Baseline(FP32) with Mixed Precision
  • 21. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 21 Automatic Mixed Precision in PyTorch • Last July, PyTorch release new version 1.6 and support Automatic Mixed Precision features officially! • We can very simply use Automatic Mixed Precision. Just add 5 lines. Merged into PyTorch / Deprecated!
  • 22. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 22 Automatic Mixed Precision in PyTorch • Just add 5 line. Now we can use Automatic Mixed Precision Training in PyTorch! Before After Reference: https://github.com/hoya012/automatic-mixed-precision-tutorials-pytorch
  • 23. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 23 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • To verify effect of AMP, perform a simple classification experiment. • Use Kaggle Intel Image Classification dataset. • Contains around 25k images of size 150x150 distributed under 6 categories .
  • 24. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 24 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • Use ImageNet Pretrained ResNet-18. • Use GTX 1080 Ti (w/o Tensor Core) and RTX 2080 Ti (with Tensor Core). • Fix training setting (batch size=256, epoch=120, lr, augmentation, optimizer, etc.).
  • 25. PyTorch 1.6 AMP New features & Experiment PR-274 | Mixed Precision Training 25 Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial • We can save GPU Memory almost 30% ~ 40%! • If use good GPU (with Tensor Core), we can save computational time! • NVIDIA Tensor Cores provide hardware acceleration for mixed precision training. Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
  • 26. Conclusion PR-274 | Mixed Precision Training • Introduce methodology for training deep neural networks using half-precision floating point. • Introduce three techniques to prevent model accuracy loss. • PyTorch officially support Automatic Mixed Precision training. 28