This document summarizes a presentation given by Chirag Patel and Tijmen Blankevoort of Qualcomm AI Research on model efficiency techniques for edge AI. They discuss why model efficiency is important for on-device AI due to constraints like power and thermal limits. They overview techniques like quantization, conditional compute, neural architecture search, and compilation that can shrink AI models and efficiently run them on hardware. Specifically, they find that integer quantization through techniques like post-training and quantization-aware training can achieve similar accuracy as floating point models but provide much better performance per watt. Overall, the presentation advocates that integer quantization is the best approach for efficient AI inference on edge devices.
Handwritten Text Recognition for manuscripts and early printed texts
The future of model efficiency for edge AI: Overcoming oscillations in quantization-aware training
1. Chirag Patel
Engineer, Principal/Manager
Qualcomm AI Research
September 21, 2022
@QCOMResearch
The future of model efficiency
for edge AI
Tijmen Blankevoort
Director of Engineering
Qualcomm AI Research
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc
2. 2
Our presenters
Why model efficiency is
important for on-device AI
Overview of integer quantization
(INT) versus floating point (FP)
3
1
2
4
Agenda
Chirag
Patel
Engineer, Principal/Manager,
Qualcomm AI Research
Tijmen
Blankevoort
Director, Engineering,
Qualcomm AI Research
5 Questions?
Open-source tools: AI Model
Efficiency Toolkit (AIMET) and
AIMET Model Zoo
AIMET and AIMET Model Zoo are products of Qualcomm Innovation Center, Inc.
Improving low-bit quantization
3. 3
3
Video monitoring
Extended reality Smart cities
Smart factories
Autonomous vehicles
Video conferencing
Smart homes
Smartphone
3
AI is being used all around us
increasing productivity, enhancing collaboration, and transforming industries
4. 4
4
Source: Welling
Will we have reached the capacity of the human brain?
Energy efficiency of the human brain is estimated
to be 100,000x better than current hardware
2025
Weight
parameter
count
1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
1943: First NN (+/- N=10)
1988: NetTalk
(+/- N=20K)
2009: Hinton’s Deep
Belief Net (+/- N=10M)
2013: Google/Y!
(N=+/- 1B)
2025:
N = 100T = 1014
2017: Very large neural
networks (N=137B)
1012
1010
108
106
1014
104
102
100
Deep neural networks
are energy hungry
and growing fast
AI is being powered by the explosive
growth of deep neural networks
2021: Extremely large
neural networks (N=1.6T)
5. 5
Power and thermal
efficiency are essential
for on-device AI
The challenge of
AI workloads
Constrained mobile
environment
Very compute
intensive
Large,
complicated neural
network models
Complex
concurrencies
Always-on
Real-time
Must be thermally
efficient for sleek,
ultra-light designs
Storage/memory
bandwidth limitations
Requires long battery
life for all-day use
6. 6
Holistic
model efficiency
research
Multiple axes to shrink
AI models and efficiently
run them on hardware
Quantization
Learning to reduce
bit-precision while keeping
desired accuracy
Conditional
compute
Learning to execute only parts
of a large inference model
based on the input
Neural
architecture
search
Learning to design smaller
neural networks that are
on par or outperform
hand-designed
architectures on
real hardware
Compilation
Learning to compile
AI models for efficient
hardware execution
7. 7
7
AIMET and AIMET Model Zoo are products of Qualcomm Innovation Center, Inc.
Leading AI research and fast commercialization
Driving the industry towards integer inference and power-efficient AI
AI Model Efficiency Toolkit (AIMET)
AIMET Model Zoo
Relaxed Quantization
(ICLR 2019)
Data-free Quantization
(ICCV 2019)
AdaRound
(ICML 2020)
Bayesian Bits
(NeurIPS 2020)
Quantization
research
Quantization
open-sourcing
Overcoming Oscillations
(ICML 2022)
Transformer Quantization
(EMNLP 2021)
Joint Pruning and Quantization
(ECCV 2020)
FP8 Quantization
(NeurIPS 2022)
8. 8
1: FP32 model compared to quantized model
Leading
research to
efficiently
quantize
AI models
Promising results show that
low-precision integer inference
can become widespread
Virtually the same accuracy
between a FP32 and quantized
AI model through:
• Automated, data free,
post-training methods
• Automated training-based
mixed-precision method
Significant performance per watt
improvements through quantization
Automated reduction in precision
of weights and activations while
maintaining accuracy Models trained at
high precision
32-bit floating point
3452.3194
8-bit Integer
255
Increase in performance
per watt from savings in
memory and compute1
Inference at
lower precision
16-bit Integer
3452
01010101
Increase in performance
per watt from savings in
memory and compute1
up to
4X
4-bit Integer
15
Increase in performance
per watt from savings in
memory and compute1
01010101
up to
16X
up to
64X
01010101
0101
01010101 01010101 01010101 01010101
9. 9
What does it mean to quantize a neural network?
Weight and activation quantization can have different bit-precisions to maintain accuracy
Biases
Choose
quantization
bit-width
Choose
quantization
bit-width
• Simulated quantization ops
are added in the neural network
after each usage of weights
and activations, and after
every ‘operation’
• Quantization is generally
simulated in floating point
instead of running
in integer math
• Weights and activations can
be quantized with the same
or different precisions within
a model layer
• For example, W8A16 uses
quantized 8-bit weights and
16-bit activations. INT8 means
quantized 8-bit weight
and 8-bit activations.
Act quant
Input Conv / FC + RELU Output
Weights
Wt quant
10. 10
10
What algorithm to choose to improve accuracy?
Post-training quantization
(PTQ)
Quantization-aware training
(QAT)
Take pre-trained FP32 model and
convert it directly into fixed-point network
Train/fine-tune the network with the
simulated quantization operations in place
No need for the original
training pipeline
Requires access to the training
pipeline and labelled data
Data-free or small (unlabeled)
calibration set needed
Simple usage
(⇔ single API call)
Longer training times
Might not reach as high
accuracy as QAT
Hyper-parameter tuning
Achieves higher accuracy,
especially for lower bit-widths
11. 11
Which is the better format for quantizing neural networks?
Floating point vs integer
12. 12
12
2 2
S
4
S
2
S 2
𝑠: scale; 𝑚: mantissa; 𝑒: exponent; S: sign
INT8 and FP8 have the same number of values
but different distributions
Multiple FP8 formats exist,
and they consume more power than INT8
FP: 𝑧 = 𝑠 ⋅ 𝑚 ⋅ 2!
Formats
7
S
6 1
S
5 2
S
4 3
S
3 4
S
2 5
S
INT8
INT8
FP8 5/2
FP8 4/3
FP8 3/4
FP8 2/5
Formats most
commonly proposed
in the industry
INT: 𝑧 = 𝑠 ⋅ 𝑚
Mantissa Exponent
13. 13
8
7
6
5
4
2
0
3
1
10
8
6
4
2
0
Most layers of models
do not have large outliers
FP8 may be
useful for model
layers with
large outliers
Normal
Outlier-heavy distribution
Uniform
Weight distribution Signal-to-noise ratio (SNR)
higher better
INT8 5M2E 4M3E 3M4E 2M5E
INT8 5M2E 4M3E 3M4E 2M5E
SNR
INT8 5M2E 4M3E 3M4E 2M5E
12
10
8
6
4
2
0
SNR
SNR
Normal
Outlier-heavy distribution
Uniform
Some
outliers
14. 14
14
Several FP8 formats are required to get the best PTQ inference results
For different networks, different formats are better — it depends on the amount of outliers
Supporting multiple formats in hardware is expensive
Model FP32 Best FP8 format Best result Worst FP8 format Worst FP8 format
ResNet18 69.72% 69.66% 64.92%
MobileNetV2 71.70% 71.06% 49.51%
BERT 83.06 82.80 71.56
SalsaNext 55.80 55.67 55.12
HRNet 81.05 81.04 80.77
DeepLabV3
(MobileNetV2)
72.91 72.58
37.93
ViT 77.75% 77.71% 76.69
5 2
5 2
3 4
4 3
5 2
4 3
4 3
2 5
3 4
3 4
3 4
“FP8 Quantization: The Power of the Exponent”, NeurIPS 2022
2 5
5 2
2 5
15. 15
FP32 INT8 Best FP8*
69.72%
69.55% 69.66%
70.43% 69.82%
ResNet
FP32 INT8 Best FP8*
71.70%
70.94% 71.06%
71.82% 71.54%
MobileNet V2
FP32 INT8 Best FP8*
83.06
71.03 82.80
83.26 83.70
Bert
FP32 INT8 Best FP8*
72.91
71.24 72.58
73.99 72.41
DeepLabV3
*: Best FP8 is the best result from testing the different FP8 formats.
“FP8 Quantization: The Power of the Exponent”, NeurIPS 2022
INT8 has similar
results as FP8
with QAT
Outliers can
be suitably
trained with
QAT
PTQ
QAT
• No PTQ tricks –
Per-channel
• All QAT results
per-tensor quantization
• FP8 mantissa and
exponent format
was optimized for
this comparison
16. 16
FP32 INT
(W8A8)
INT
(W8A16)
Best FP8
result
ResNet18 69.72% 69.55% 69.75% 69.66%
HRNet 81.05 80.93 81.08 81.04
INT W8A16
accuracy is better
than
FP8 for all models
with PTQ
No real gap FP8/INT8
FP32 INT
(W8A8)
INT
(W8A16)
Best FP8
result
BERT 83.06 71.03 82.90 82.80
SalsaNext 55.80 54.22 55.82 55.67
ViT 77.75% 76.39% 77.73% 77.71%
MobileNetV2 69.72% 69.55% 69.75% 69.66%
• Min-max range setting
• Per-channel quantization
17. 17
INT16 performs
better than FP16
unless there are
large outliers
• 1,000 samples of 𝑋 ~
𝑁𝑜𝑟𝑚𝑎𝑙(0, 1)
• We add one outlier
and vary its value
• INT13 performs comparable
to FP16 in terms of MSE
10−4
10−5
10−7
10−6
0 250 500 750 1000
Outlier
1250 1500 1750 2000
MSE
MSE Error for w/o activation function for sigma =1.0, 100 neurons
int16
float16
18. 18
INT16
outperforms
FP16 in accuracy
and runs faster
in hardware
MobileNetV2
71.74%
FP32
71.69%
FP16
71.74%
INT16
EfficientDet-D1
40.08
FP32
40.07
FP16
40.07
INT16
19. 19
19
Integer quantization is the way to do AI inference
Enabled through PTQ and QAT techniques
Mixed precision gives the best of both worlds,
using extra precision only when necessary
INT 4 INT 8 INT 16
Best Better Good
Power efficiency
and latency
Best
Better
Good
Accuracy
20. Overcoming oscillations in quantization-aware training
Improving quantization-aware
training at lower bit-widths
21. 21
21
Poor validation accuracy is consistent across various learning rates and epochs during QAT
“Overcoming Oscillations in Quantization-Aware Training” (ICML 2022)
Validation accuracy for QAT is typically unstable
Why do we see the validation accuracy drop for 4-bit QAT?
Training accuracy
Epoch
Different
learning rates (LR)
71%
67%
62%
2 4 6 8 10 12 14 16 18 20
Validation accuracy
Epoch
44%
70%
60%
50%
42%
2 4 6 8 10 12 14 16 18 20
22. 22
22
2
0
−2
−4
Oscillations are present in QAT
Example of MobileNetV2 training (last 1000 iterations of training)
Quantized weights, 𝑞 𝑤
(Lowest bit of 4−bit weight)
Latent weights, 𝑤
(Positive FP values zoomed on 0.5)
Sign
and
Bit0
weights
Iterations during QAT
FP
weights
0.5004
0.5002
0.5000
0.4998
0.4996
0 200 400 600 1000
23. 23
Network Bits Pre-BN Acc. Post-BN Acc.
MobileNetV2 8 71.79 0.07
71.89 0.05
MobileNetV2 4 68.99 0.44
71.01 0.05
MobileNetV2 3 64.97 1.23
69.50 0.04
Method Train Loss Val. Acc. (%)
Baseline 1.3566 69.50
SR (mean + std) 1.3547 0.0053
69.58 0.09
SR (best) 1.3391 69.85
AdaRound 1.3070 70.12
+2.02
+4.53
Corrupts batch norm statistics
• At inference, BN uses running
statistics from training
• Oscillations lead to big changes
in statistics -> running statistics
are not a good estimate
• Solution: BN re-estimation
Disrupts model convergence
• At the end of training, oscillating weights
may not be on the correct ‘side’
• Stochastic rounding (SR) and binary
optimization (AdaRound) show that they
are indeed not in the best possible state.
• Oscillations prevent network from
converging to best local minimum
Oscillating weights are harmful when training a model
23
24. 24
EMA = Exponential Moving Average
Higher oscillation
frequencies during
QAT negatively
impact accuracy
Oscillation
occurs
The integer value changes
&
Its direction is opposite
to its previous one
25. 25
25
“Overcoming Oscillations in Quantization-Aware Training” (ICML 2022)
Oscillation dampening and iterative freezing fix the QAT issue
Dampening takes a regularizing approach:
the weights are forced closer to the bin center
Freezing the oscillating weights stabilizes training
and mitigates the unwanted effects of oscillations
100
80
60
40
20
0
−0.4 −0.2 0.0 0.2 0.4
wint − w/s wint − w/s
−0.4 −0.2 0.0 0.2 0.4
400
300
200
100
0
Dampening Freezing
Frozen
Not frozen
26. 26
1: “Overcoming Oscillations in Quantization-Aware Training” (ICML 2022)
2: “Learned step size quantization” (ICLR,2020)
We achieve
SOTA results
for INT4
quantization1
• Train with learned
step-size quantization (LSQ2)
and re-estimation
• Dampening and freezing
perform on par with
each other
Method W/A Val. Acc. (%)
Full-precision 32/32 65.1
LSQ*
(Esser et al., 2020)
4/4 61.0 (-4.1)
LSQ + BR
(Han et al., 2021)
4/4 61.5 (-3.6)
LSQ + Dampen
(ours)
4/4 63.7 (-1.4)
LSQ + Freeze
(ours)
4/4 63.6 (-1.5)
Method W/A Val. Acc. (%)
Full-precision 32/32 71.7
LSQ*
(Esser et al., 2020)
4/4 69.5 (−2.3)
LSQ + BR
(Han et al., 2021)
4/4 70.4 (−1.4)
LSQ + Dampen
(ours)
4/4 70.5 (−1.2)
LSQ + Freeze
(ours)
4/4 70.6 (−1.1)
MobileNetV3 MobileNetV2
28. 28
28
AIMET makes AI models small
Open-sourced GitHub project that includes state-of-the-art quantization
and compression techniques from Qualcomm AI Research
Features: State-of-the-art
network compression
tools
State-of-the-art
quantization
tools
Support for both
TensorFlow
and PyTorch
Benchmarks
and tests for
many models
Developed by
professional software
developers
If interested, please join the AIMET GitHub project: https://github.com/quic/aimet
Trained
AI model
AI Model Efficiency Toolkit
(AIMET)
Optimized
AI model
TensorFlow or PyTorch
Compression
Quantization
Deployed
AI model
29. AIMET
Providing advanced
model efficiency
features and benefits
Benefits
Lower memory
bandwidth
Lower
power
Lower
storage
Higher
performance
Maintains model
accuracy
Simple
ease of use
Features
Quantization
Compression
State-of-the-art INT8 and INT4 performance
Quantization-aware training
(QAT)
Efficient tensor decomposition
and removal of redundant
channels in convolution layers
Spatial singular value
decomposition (SVD)
Channel pruning
Visualization
Analysis tools for drawing insights
for quantization and compression
Weight ranges
Per-layer compression sensitivity
Quantization simulation
Post-training quantization
(PTQ) methods:
• Data-Free Quantization
• Adaptive Rounding (AdaRound),
• Automatic Mixed Precision (AMP)
• AutoQuant
29
30. 30
30
AIMET features and APIs are easy to use
Designed to fit naturally in the AI model development workflow for researchers, developers, and ISVs
PyTorch
model
PyTorch
Model
Train
No change
Same API
PyTorch
Model
Train
Create
QuantSim
Evaluate
Typical model
training workflow
User-friendly
QAT workflow in AIMET
No change
Same API
Evaluate
User-friendly APIs invoked directly
from the existing model pipeline
Example Jupyter notebooks on AIMET GitHub
AIMET
extensions extensions
Model optimization library
(techniques to compress & quantize models)
Framework specific API Algorithm API
Other
frameworks
31. 31
Low resolution Super resolution
First 4K
super-resolution
demo at 100+ FPS
on mobile
Our new machine-learning
based super-resolution method
8-bit quantized model created
using AIMET QAT
32. 32
With better
PTQ and QAT
techniques,
more models
will achieve better
power efficiency
AIMET enables accurate INT W4A8
for wide range of use cases
Task Model FP32 INT W4A8
Classification
ResNet50 76.10% 75.4%
ResNet18 69.75% 68.96%
EfficientNet-Lite 75.31% 74.33%
Regnext 78.3% 77.2%
Segmentation
Deeplabv3
(RN-50)
76.07% 75.91%
Super-resolution ABPN 31.97 dB 31.67 (dB)
Pose detection
PoseNet
(HRNet-32)
0.765 0.763
33. 33
33
Comparison between FP32 model and model quantized with AIMET
AIMET quantizes transformers with high accuracy,
comparable to FP32
Top-1 accuracy
81.30
FP32
80.88
INT8/
W8A16
(PTQ)
ViT base
GLUE
84.99
FP32
84.60
INT8
(QAT)
RoBERTa base
GLUE
82.73
FP32
81.95
INT8
(QAT)
BERT base
(uncased)
GLUE
79.21
FP32
78.61
INT8
(QAT)
DistilBERT base
(uncased)
34. 34
34
AIMET
Model Zoo
Accurate pre-trained 8-bit
quantized models
Image
classification
Semantic
segmentation
Pose
estimation
Speech
recognition
Object
detection
Super
resolution
35. 35
35
35
*: Comparison between FP32 model and INT8 model quantized with AIMET. For further details, check out: https://github.com/quic/aimet-model-zoo/
AIMET Model Zoo includes popular quantized AI models
Accuracy is maintained for INT8 models — less than 1% loss*
Top-1 accuracy*
75.21%
FP32
74.96%
INT8
ResNet-50 (v1)
Top-1 accuracy*
75%
FP32
74.21%
INT8
MobileNet-v2-1.4
Top-1 accuracy*
74.93%
FP32
74.99%
INT8
EfficientNet Lite
mAP*
0.2469
FP32
0.2456
INT8
ResNet-50 (v1)
mAP*
0.35
FP32
0.349
INT8
RetinaNet
mAP*
0.383
FP32
0.379
INT8
Pose estimation
PSNR*
25.45
FP32
24.78
INT8
SRGAN
Top-1 accuracy*
71.67%
FP32
71.14%
INT8
MobileNetV2
Top-1 accuracy*
75.42%
FP32
74.44%
INT8
EfficientNet-lite0
mIoU*
72.62%
FP32
72.22%
INT8
DeepLabV3+
mAP*
68.7%
FP32
68.6%
INT8
MobileNetV2-SSD-Lite
mAP*
0.364
FP32
0.359
INT8
Pose estimation
PSNR
25.51
FP32
25.5
INT8
SRGAN
WER*
9.92%
FP32
10.22%
INT8
DeepSpeech2
PSNR
32.75
FP32
32.69
INT8
ABPN
<1%
Loss in
accuracy*
36. 36
*: Comparison between FP32 model and INT8 model quantized with AIMET. For further details, check out: https://github.com/quic/aimet-model-zoo/
Super resolution
model suite
Wide variety
of models,
suited for fast,
energy-efficient
INT8 inference
• Virtually no accuracy
loss compared to FP32
• Simple and convenient
for developer integration
• Useful across diverse
applications, from gaming
and photography to XR
and autonomous driving
1 Anchor-based Plain Net (ABPN)
2 Robust Real-Time Single-Image Super Resolution (XLSR)
3 Super-Efficient Super Resolution (SESR)
PSNR (dB)
32.66
FP32
32.58
INT8
SESR-M73
PSNR (dB)
32.41
FP32
32.25
INT8
SESR-M33
PSNR (dB)
32.71
FP32
32.64
INT8
ABPN1
PSNR (dB)
32.57
FP32
32.30
INT8
XLSR2
PSNR (dB)
33.03
FP32
32.92
INT8
SESR-XL3
INT8 PSNR
and
visual quality
comparable
to FP32*
37. 37
37
Explore our open-source projects and tools
AIMET
State-of-the-art quantization
and compression techniques
github.com/quic/aimet
AIMET Model Zoo
Accurate pre-trained
8-bit quantized models
github.com/quic/aimet-model-zoo
Quantization
whitepaper
arxiv.org/abs/2201.08442
38. AI Frameworks
Qualcomm® Neural Processing SDK TF Lite
TF Lite Micro Direct ML
AI Runtimes
Programming Languages
Virtual platforms
Core Libraries
Math Libraries
Profilers & Debuggers
Compilers
System Interface SoC, accelerator drivers
Auto
XR Robotics
IoT
ACPC
Smartphones Cloud
Platforms
Qualcomm® AI Engine Direct (QNN)
Tools:
Emulation Support
Qualcomm Neural Processing SDK, Qualcomm AI Model Studio, and Qualcomm AI Engine Direct are products of Qualcomm Technologies, Inc. and/or its subsidiaries
Qualcomm AI
Model Studio
AIMET
AIMET
Model Zoo
NAS
Model
analyzers
Infrastructure:
39. 39
Model efficiency is key for
enabling on-device AI and
accelerating the growth of the
connected intelligent edge
INT8/16 perform better
than FP8/16
Qualcomm AI Research is
enabling 4-bit integer models
AIMET is making fixed-point
quantization possible at scale
without sacrificing accuracy