Dev Dives: Streamline document processing with UiPath Studio Web
Mixed Precision Training Review
1. 2020/09/06
Ho Seong Lee (hoya012)
Cognex Deep Learning Lab
Research Engineer
PR-274 | Mixed Precision Training 1
2. Contents
• Introduction
• Related Work
• Implementation
• Results
• PyTorch 1.6 AMP New features & Experiment
• Conclusion
PR-274 | Mixed Precision Training 2
3. Introduction
Increasing the size of a neural network typically improves accuracy
• But also increases the memory and compute requirements for training the model.
• Introduce methodology for training deep neural networks using half-precision floating point numbers,
without losing model accuracy or having to modify hyper-parameters.
• Introduce three techniques to prevent model accuracy loss.
• Using these techniques, demonstrate that a wide variety of network architectures and
applications can be trained to match the accuracy FP32 training.
PR-274 | Mixed Precision Training 3
Main Contributions
4. Related Works
Network Compression
PR-274 | Mixed Precision Training 4
• Low-precision Training
• Train networks with low precision weights.
• Quantization
• Quantize pretrained model reducing the number of bits.
• Pruning
• Remove connections according to an importance criteria.
• Dedicated architectures
• Design architecture to be memory efficient such as SqueezeNet, MobileNet, ShuffleNet.
5. Related Works
Network Compression in PR-12 Study
PR-274 | Mixed Precision Training 5
• Total 23 papers were covered! → 23/274 = Almost 8%!
• But, Low-precision training is, as far as I know, the first topic to be covered.
6. Related Works
Related Works – Low Precision Training
• “Binaryconnect: Training deep neural networks with binary weights during propagations.”, 2015 NIPS
• Propose training with binary weights, all other tensors and arithmetic were in full precision.
• “Binarized neural networks.”, 2016 NIPS
• Also binarize the activations, but gradients were stored and computed in single precision.
• “Quantized neural net- works: Training neural networks with low precision weights and activations.”,
2016 arXiv
• Quantize weights and activations to 2, 4, and 6 bits, but gradients were real numbers.
• “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”, 2016 ECCV
• Binarize all tensors, including the gradients, but lead to non-trivial loss of accuracy.
PR-274 | Mixed Precision Training 6
7. Related Works
Main Contributions
• All tensors and arithmetic for forward and backward passes use reduced precision, FP16.
• No hyper-parameters (such as layer width) are adjusted.
• Models trained with these techniques do not incur accuracy loss when compared to FP16 baselines.
• Demonstrate that this technique works across a variety of applications.
PR-274 | Mixed Precision Training 7
8. Implementation
IEEE 754 Floating Point Representation
• Number can be represented by (−1) 𝑆
∗ 1. 𝑀 ∗ 2(𝐸 −𝐵𝑖𝑎𝑠)
PR-274 | Mixed Precision Training 8
9. Implementation
PR-274 | Mixed Precision Training 9
Bonus) New Floating-Point format
IEEE754
FP32
IEEE754
FP16
1bit
1bit
8bit
5bit
23bit
10bit
Google
bfloat16
1bit 8bit 7bit
NVIDIA
TensorFloat
1bit 8bit 10bit
AMD
FP24
1bit 7bit 16bit
10. Implementation
PR-274 | Mixed Precision Training 10
1. FP32 Master copy of weights
• In mixed precision training, weights, activations, and gradients are stored as FP16.
• In order to match the accuracy of FP32 networks, an FP32 master copy of weights is maintained and
update with the weight gradient during the optimizer step.
Halving the storage and bandwidth
11. Implementation
PR-274 | Mixed Precision Training 11
1. FP32 Master copy of weights → Why?
• Weight Update (weight gradients multiplied by the learning rate) becomes too small to be represented
in FP16. (smaller than 2−24
)
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂 ∗
𝜕𝐸
𝜕𝑊
12. Implementation
PR-274 | Mixed Precision Training 12
1. FP32 Master copy of weights → Experiments
• Train the Mandarin speech model with FP32 master copy and without FP32 master copy.
• Updating FP16 weights results in 80% relative accuracy loss.
Worse than FP master copy
13. Implementation
PR-274 | Mixed Precision Training 13
2. Loss Scaling
• Activation gradient values tend to be dominated by small magnitudes.
• Scaling them by a factor of 8 is sufficient to match the accuracy achieved with FP32 training.
• It means activation gradient values below 2−27
were irrelevant to the training.
14. Implementation
PR-274 | Mixed Precision Training 14
2. Loss Scaling
• One efficient way to shift the gradient values into FP16-representable range is to scale the loss value
computed in the forward pass, prior to starting back-propagation.
• This can keep the relevant gradient values from becoming zeros.
• Weight gradients must be unscaled before weight update to maintain the update magnitudes.
15. Implementation
PR-274 | Mixed Precision Training 15
2. Loss Scaling – How to choose the loss scaling factor?
• Simple way is to pick a constant scaling factor empirically.
• Or if gradient statistics are available, directly choosing a factor so that its product with the maximum
absolute gradient value is below 65,504 (the maximum value representable in FP16).
• There is no downside to choosing a large scaling factor as long as it does not cause overflow during
backpropagation.
16. Implementation
PR-274 | Mixed Precision Training 16
2. Loss Scaling – Automatic Mixed Precision
• More robust way is to choose the loss scaling factor dynamically (Automatically).
• The basic idea is to start with a large scaling factor and then reconsider it in each training iteration.
• If an overflow occurs, skip the weight update and decrease the scaling factor.
• If no overflow occurs for a chosen number of iterations N, increase the scaling factor.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
Use N=2000, Increase x2, Decrease x0.5
17. Implementation
PR-274 | Mixed Precision Training 17
3. Arithmetic Precision
• Neural network arithmetic falls into three categories: vector dot-products, reductions, and point-wise
operations.
• To maintain model accuracy, we found that some networks require that FP16 vector dot-product
accumulates the partial products into an FP32 value, which is converted to FP16 before writing to
memory.
Reference: https://www.quora.com/How-does-Fused-Multiply-Add-FMA-work-and-what-is-its-importance-in-computing
18. Implementation
PR-274 | Mixed Precision Training 18
3. Arithmetic Precision
• Large reductions (sums across elements of a vector) should be carried out in FP32.
• Such reductions mostly come up in batch-normalization layers and softmax layers.
• Both layer types in author’s implementations still read and write FP16 tensors from memory, performing
the arithmetic in FP32. → did not slow down the training process.
19. Results
PR-274 | Mixed Precision Training 19
Comparison Baseline(FP32) with Mixed Precision
20. Results
PR-274 | Mixed Precision Training 20
Comparison Baseline(FP32) with Mixed Precision
21. PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 21
Automatic Mixed Precision in PyTorch
• Last July, PyTorch release new version 1.6 and support Automatic Mixed Precision features officially!
• We can very simply use Automatic Mixed Precision. Just add 5 lines.
Merged into PyTorch / Deprecated!
22. PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 22
Automatic Mixed Precision in PyTorch
• Just add 5 line. Now we can use Automatic Mixed Precision Training in PyTorch!
Before
After
Reference: https://github.com/hoya012/automatic-mixed-precision-tutorials-pytorch
23. PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 23
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• To verify effect of AMP, perform a simple classification experiment.
• Use Kaggle Intel Image Classification dataset.
• Contains around 25k images of size 150x150 distributed under 6 categories .
24. PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 24
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• Use ImageNet Pretrained ResNet-18.
• Use GTX 1080 Ti (w/o Tensor Core) and RTX 2080 Ti (with Tensor Core).
• Fix training setting (batch size=256, epoch=120, lr, augmentation, optimizer, etc.).
25. PyTorch 1.6 AMP New features & Experiment
PR-274 | Mixed Precision Training 25
Image Classification with Automatic Mixed-Precision Training PyTorch Tutorial
• We can save GPU Memory almost 30% ~ 40%!
• If use good GPU (with Tensor Core), we can save computational time!
• NVIDIA Tensor Cores provide hardware acceleration for mixed precision training.
Reference: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html
26. Conclusion
PR-274 | Mixed Precision Training
• Introduce methodology for training deep neural networks using half-precision floating point.
• Introduce three techniques to prevent model accuracy loss.
• PyTorch officially support Automatic Mixed Precision training.
28