SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
Qualcomm Technologies Inc.
April 2020 @qualcomm_tech
Enabling power-efficient
AI through quantization
22
Will we have reached the capacity of the human brain?
Energy efficiency of a brain is 100x better than current hardware
Source: Welling
2025
Energyconsumption
1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
1943: First NN (+/- N=10)
1988: NetTalk
(+/- N=20K)
2009: Hinton’s Deep
Belief Net (+/- N=10M)
2013: Google/Y!
(N=+/- 1B)
2025:
N = 100T = 1014
2017: Extremely large
neural networks (N =137B)
1012
1010
108
106
1014
104
102
100
Deep neural networks
are energy hungry
and growing fast
AI is being powered by the explosive
growth of deep neural networks
3
Privacy
Reliability
Low latency
Efficient use of
network bandwidth
Process data closest to the
source, complement the cloud
On-device
intelligence is
paramount
4
The AI power and thermal ceiling
The challenge
of AI workloads
Very compute intensive
Complex concurrencies
Real-time
Always-on
Constrained
mobile environment
Must be thermally efficient
for sleek, ultra-light designs
Requires long battery
life for all-day use
Storage/memory
bandwidth limitations
5
New
input
data
Advancing AI research to increase power efficiency
Trained neural network model
Compute
operations
• Vector and matrix
manipulations
• CPU, GPU, DSP,
and AI accelerationMove data between
memory and compute
• Move pieces of input data and AI
model from memory to compute
• Send partial results back to memory
Inference
output
6
Trained neural network model
Inference
output
New
input data
Hardware
awareness
AI Acceleration
(scalar, vector, tensor)
Acceleration research
Such as compute-in-memory
Advancing AI research to increase power efficiency
QuantizationCompression Compilation
Learning to reduce bit-precision
while keeping desired accuracy
Learning to prune model while
keeping desired accuracy
Learning to compile AI models for
efficient hardware execution
Applying AI to optimize AI model through automated techniques
What is
quantization?
8
What is neural network quantization?
For any given trained neural network:
• Store weights in n bits
• Compute calculations in n bits
Quantization analogy
Similar to representing the pixels of an
image with less bits
9Source: Mark Horowitz (Stanford). Energy based on ASIC | Area based on TSMC45nm
Quantizing AI models offers significant benefits
Power consumption
Significant reduction in
energy for both computations
and memory access
Silicon area
Integer math or less bits require
less silicon area compared to
floating point math and more bits
Mem access
energy (pJ)
Cache (64-bit)
8KB 10
32KB 20
1MB 100
DRAM
1300-
2600
Up to 4X energy
reduction
Add energy (pJ)
INT8 FP32
0.03 0.9
30X energy
reduction
Mult energy (pJ)
INT8 FP32
0.2 3.7
18.5X energy
reduction
Memory usage
8-bit versus 32-bit weights
and activations stored in
memory
01010101 01010101 01010101 01010101
01010101
Add area (μm2)
INT8 FP32
36 4184
116X area reduction
Mult area (μm2)
INT8 FP32
282 7700
27X area reduction
Latency
With less memory access
and simpler computations,
latency can be reduced
1010
How to most efficiently calculate AB+b?
Matrix math is the primary operation of neural nets
A running example to showcase how to make these operations more efficient
𝐴 𝑇
=
0.97 0.64
0.58 0.84
0.74 1.00
0.84 0.81
0.00 0.18
0.57 0.96
0.90 0.28
0.80 0.81
𝐵 =
0.41 0.25
0.00 0.41
0.73 0.66
0.41 0.57
0.42 0.24
0.39 0.82
0.71 1.00
0.17 0.35
𝑏 =
0.1
0.2
0.3
0.4
11
A schematic mac array for efficient computation
𝑊1
𝑊2
𝑊3
𝑊4
𝐼1 𝐼2 𝐼3 𝐼4
𝑋1,1 𝑋1,2 𝑋1,3 𝑋1,4
𝑋2,1 𝑋2,2 𝑋2,3 𝑋2,4
𝑋3,1 𝑋3,2 𝑋3,3 𝑋3,4
𝑋4,1 𝑋4,2 𝑋4,3 𝑋4,4
𝐴1
𝐴2
𝐴3
𝐴4
Input valuesWeightvalues
Accumulators
The array efficiently calculates the dot
product between multiple vectors
𝐴𝑖 = 𝑊𝑖 ⋅ 𝐼1 + 𝑊𝑖 ⋅ 𝐼2 + 𝑊𝑖 ⋅ 𝐼3 + 𝑊𝑖 ⋅ 𝐼4
12
Step-by-step matrix multiply calculation on mac array
𝑊1
𝑊2
𝑊3
𝑊4
𝐼1 𝐼2 𝐼3 𝐼4
𝑋1,1 𝑋1,2 𝑋1,3 𝑋1,4
𝑋2,1 𝑋2,2 𝑋2,3 𝑋2,4
𝑋3,1 𝑋3,2 𝑋3,3 𝑋3,4
𝑋4,1 𝑋4,2 𝑋4,3 𝑋4,4
𝐴1
𝐴2
𝐴3
𝐴4
𝐴 𝑇
=
0.97 0.64
0.58 0.84
0.74 1.00
0.84 0.81
0.00 0.18
0.57 0.96
0.90 0.28
0.80 0.81
𝐵 =
0.41 0.25
0.00 0.41
0.73 0.66
0.41 0.57
0.42 0.24
0.39 0.82
0.71 1.00
0.17 0.35
𝑂𝑢𝑡 𝑇
=
0.62 0.71
0.94 1.33
0.99 0.84
1.40 1.31
1.00 1.10
1.17 1.42
1.66 1.40
2.15 1.69
Load full matrix into each individual block
13
Quantization comes at a cost of lost precision
Instead of storing weights as floating point values,
store them as integers with a scale factor:
𝐴 𝑇 =
0.97 0.64
0.58 0.84
0.74 1.00
0.84 0.81
0.00 0.18
0.57 0.96
0.90 0.28
0.80 0.81
≈
1
255
247 163
148 214
189 255
214 207
0 46
145 245
229 71
204 207
= 𝑠 ⋅ 𝒁
𝐴 𝑇
− 𝑠 ⋅ 𝒁 =
1
255
0.35 0.20
−0.1 0.20
−0.3 0
0.20 −0.45
0.00 −0.1
0.35 −0.2
−0.5 0.40
0 −0.45
However, quantization is not free:This means that for every weight tensor or
activation tensor, we only have to store an
INT8 weight matrix and 1 scaling factor,
instead of a FP32 weight matrix.
14
Different types of quantization have pros and cons
Symmetric, asymmetric, signed, and unsigned quantization
Symmetric signed
𝑠 ⋅ 𝑧𝑖𝑛𝑡8
Asymmetric
𝑠 ⋅ (𝑧 + 𝑜)
Fixed point grid
Floating point grid
s: scale factor
o: offset
𝑠min max
𝑜
0
0 255
Symmetric unsigned
𝑠 ⋅ 𝑧 𝑢𝑖𝑛𝑡8
𝑠 max
0
0
255
𝑠 max
0
0
128-127
15
An example calculation using symmetric quantization
𝑊1
𝑊2
𝑊3
𝑊4
𝐼1 𝐼2 𝐼3 𝐼4
𝑋1,1 𝑋1,2 𝑋1,3 𝑋1,4
𝑋2,1 𝑋2,2 𝑋2,3 𝑋2,4
𝑋3,1 𝑋3,2 𝑋3,3 𝑋3,4
𝑋4,1 𝑋4,2 𝑋4,3 𝑋4,4
𝐴1
𝐴2
𝐴3
𝐴4
𝐴 𝑇
=
1
255
247 163
148 214
189 255
214 207
0 46
145 245
229 71
204 207
𝐵 =
1
255
105 64
0 105
186 168
105 145
107 61
99 209
181 255
43 89
𝑂𝑢𝑡 𝑇
=
71403 97747
58931 88259
108231 136021
97633 128887
31639 33699
57546 90712
49513 71639
98520 130328
Load full matrix into each individual block
1
255 ⋅ 255
⋅
≈
𝑂𝑢𝑡 𝑇
=
1
136021 ⋅ 255
134 183
110 165
203 255
183 242
59 63
108 170
93 134
185 244
Activation
quantization
INT32INT8INT8INT8INT8
1616
Symmetric weights and asymmetric activations are the best option
What type of quantization should you use?
Symmetric quantization
𝑊 ⋅ 𝑋 ≈ 𝑠1 𝑊𝑖𝑛𝑡 ⋅ 𝑠2 𝑋𝑖𝑛𝑡
= 𝑠1 𝑠2(𝑊𝑖𝑛𝑡 ⋅ 𝑋𝑖𝑛𝑡)
Asymmetric quantization
𝑊 ⋅ 𝑋 ≈ 𝑠1 𝑊𝑖𝑛𝑡 + 𝑜1 ⋅ 𝑠2 𝑋𝑖𝑛𝑡 + 𝑜2
= 𝑠1 𝑠2 𝑊𝑖𝑛𝑡 ⋅ 𝑋𝑖𝑛𝑡 + 𝑠1 𝑠2 𝑜1 𝑋𝑖𝑛𝑡 + 𝑠1 𝑠2 𝑜2 𝑊𝑖𝑛𝑡 + 𝑠1 𝑜1 𝑠2 𝑜2
𝑊 is the weight matrix
𝑋 is the input of a layer
Same calculation
Precompute, add to
layer bias
Unavoidable
overhead
Asymmetric weight calculations incur 10-15% extra energy consumption
17
Simulating quantization
18
How to accurately simulate quantization
Quantization is generally simulated in floating point
instead of actually running in integer math
Simulated quantization ops are added in the neural
network after each usage of weights, and after every
‘operation’
• No dedicated kernels are necessary
• This easily allows for flexible bit-widths 1,2,3,4,…
• Makes GPU speed-up easy
• Biases are not quantized
Act quant
Input
Weights
Wt quant
Conv / FC
+
RELU
Biases
Output
19
How to simulate asymmetric quantization with b bits
Given a floating point value 𝑥, we quantize:
𝑥𝑖𝑛𝑡 = 𝑟𝑜𝑢𝑛𝑑
𝑥 − 𝑜
s
𝑥 𝑄 = 𝑐𝑙𝑎𝑚𝑝 𝑥𝑖𝑛𝑡, 𝑚𝑖𝑛 = 0, 𝑚𝑎𝑥 = 2 𝑏
− 1
𝑥 𝑓𝑙𝑜𝑎𝑡 = 𝑥 𝑄 ⋅ s + 0
The procedure turns any value into a ‘8-bit quantized’
value, while all calculations are done in float32
Definitely slower than training without quantization
operations
Floating point grid
Fixed point grid
𝑠min max
𝑜
0
0 255
What happens in those simulated
quantization blocks?
1
These operations are added
everywhere in the network
min, max are set for activations
based on passing of batches of
data through the whole network
2020
Hardly any difference between quantization simulation and real hardware
How accurate is the quantization simulation?
Very accurate — the rounding errors are tiny
Model Top1 simulated Top1 on-device
Resnet 50 75.76% 75.67%
MobileNetV2 70.12% 70.01%
2121
16-bit fixed-point quantization is always fine
How well does quantizing a model work?
Quantizing some computer vision models to 8-bit weights and activations
Model Top1 accuracy Top1 quantized
InceptionV3 0.78 0.78
NasnetMobile 0.74 0.722
Resnet 50 0.756 0.75
MobileNetV2 0.749 0.004
DeepLabV3 0.729 0.41
FastDVDNet 0.862 0.696
22
Quantization-aware training
23
Overcoming the challenges of quantized training
Quantized training challenges:
• ”Round” doesn’t have a proper gradient.
• “Clamp” kills the gradient
Solution: Redefine gradient op as ”straight-through”*
Then train with gradient descent as usual
*Bengio et al. 2013. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Forward pass Backward pass
Act quant
Input
Weights
Wt quant
Conv / FC
+
RELU
Biases
Output
2424
Each calculated gradient is a little bit ’wrong’.
This compounds over the whole network and makes training deep networks difficult.
Problem: A mismatch between real and quantized values
Forward pass Backward pass
0.4
0.5
0.4
0.4
25
Addressing biased gradients in quantized training
Through stochastic rounding and relaxed quantization
Stochastic rounding1
𝑥 − 𝜖 𝑥 𝑥 + 𝜖 𝑥 + 2𝜖
𝑥
q(x)=൞
‫ہ‬ ‫ۂ‬𝑥 𝑝: 1 −
𝑥−‫ہ‬ ‫ۂ‬𝑥
𝜖
‫ہ‬ ‫ۂ‬𝑥 + 𝜖 𝑝:
𝑥−‫ہ‬ ‫ۂ‬𝑥
𝜖
1) Gupta et al. 2015 Deep Learning with Limited Numerical Precision
2) Louizos et al. 2019. Relaxed Quantization for discretized neural networks
Relaxed quantization2
Not a big problem for 8-bit quantization
26
Major improvements from learning the min and max values
Adjust [min, max] on the fly while training,
such as when overflow occurs
min max
x
Dynamic ranges1
1) Wu et al. 2018 Training and Inference with Integers in Deep Neural Networks
2) Jain et al. 2019 Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware
Fully trainable min, max2
Parametrize min and max,
and train them alongside full network
With learned min and max values,
ImageNet models trained to 4-bit weights and 4-bit
activations have hardly any loss
2727
If possible, fine-tune after compression and quantization for optimal performance
Quantized training solves a lot of accuracy problems
Model Top1 accuracy Top1 quantized Fine-tuned
InceptionV3 0.78 0.78 0.78
NasnetMobile 0.74 0.722 0.73
Resnet 50 0.756 0.75 0.752
MobileNetV2 0.749 0.004 0.735
DeeplabV3 0.729 0.414 0.725
Results from Jacob et al. 2018 Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only inference
28
Making quantization practical for the masses
Our research focuses on quantization techniques that maintain accuracy while reducing engineering effort
Rel-14
enTV1
No data
Data might not be available
No back propagation
Retraining is time intensive and sensitive to
hyper parameters
No architecture changes
Requires training from scratch with
quantization in mind
Ideal
properties
for simple
quantization
DFQ:
Data-Free Quantization
30
Data-Free Quantization through
Weight Equalization and Bias
Correction
A better
method for
no-data
quantization
• Paper introduces a method for
quantization without the use of data
• Excellent results without any training
Tijmen Blankevoort
Qualcomm Technologies
Netherlands B.V.
Max Welling
Qualcomm Technologies
Netherlands B.V.
Mart van Baalen
Qualcomm Technologies
Netherlands B.V.
Markus Nagel
Qualcomm Technologies
Netherlands B.V.
Nagel et al. 2019 Data-Free Quantization Through Weight Equalization and
Bias Correction
31
Visualizing quantization rounding error
Weight/activation output bin
Weight/activationsize
Schematic histogram of weight/activation
ranges for layer
0
50
100
1 2 3 4
min
max
Quantizationgridbbits
Expected error
1
2
(
min −𝑚𝑎𝑥
2 𝑏−1
)
32
Imbalanced weights is a common problem in practice
Distributions of weights in 2nd layer of
MobileNetV2 (ImageNet)
33
The problem occurs because of mismatched ranges
Weight/activation output bin
Weight/activationsize
Schematic histogram of weight/activation
ranges for layer
0
50
100
1 2 3 4
min
max
Quantizationgridbbits
3434
• Per-channel quantization1 keeps
a scale s_i (and offset o_i) for
each output i
• Not all hardware supports this
Per-channel
quantization
Weight output bin
Weightsize
Schematic histogram of weight ranges for layer
0
50
100
1 2 3 4
min
max
Quantizationgridbbits
max
[1] Krishnamoorthi 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper
35
Cross-layer equalization scales weights in neighboring
layers for better quantization
𝑟𝑒𝑙𝑢 𝑥 = max 0, 𝑥
We have that
𝑟𝑒𝑙𝑢 𝑠𝑥 = 𝑠 ⋅ 𝑟𝑒𝑙𝑢(𝑥)
We can scale two layers with a (P)Relu
together to optimize it for quantization
3636
Equalize the outputs of layer 1 with the inputs of layer 2
by setting 𝑠𝑖 =
1
𝑟 𝑖
𝑟𝑖
1
𝑟𝑖
2
Finding the scaling factors for cross-layer equalization
Weight output bin
Weight/activationsize
Schematic histogram of weight ranges for layer 1
0
50
100
1 2 3 4
min
max
Quantizationgridbbits
Weight input bin
Weight/activationsize
Schematic histogram of weight ranges for layer 2
0
50
100
1 2 3 4
min
max
Quantizationgridbbits
min
max
Quantizationgrid
bbits
min
max
Quantizationgrid
bbits
𝑟2
(1)
𝑟2
(1)
/𝑠2 ⋅ 𝑠2
37
Cross-layer equalization significantly improves accuracy
Original weight ranges After cross-layer equalization
Top-1 accuracy
Float32
Top-1 accuracy
INT8 (best)
Difference Top-1
accuracy
Asymmetric quant 71.9% 0.1% -71.8%
Equalization 71.72% 69.91% -1.99%
MobileNetV2 results for ImageNet
38
Biased quantization is a result of rounding errors
Floating point gridmin max
0 Fixed point grid
0 255
0.1 0.1 0.1 0.1 0.1
By sheer coincidence, it could be that
𝒘 ⋅ 𝒙 ≈ ෥𝒘 ⋅ 𝒙 ≈ 𝒘 ⋅ 𝒙 + 0.1𝒙
Since most values are rounded up, the average output of the
quantized model is now bigger than the original
i.e. 𝔼 𝒚 ≠ 𝔼 ෥𝒚
෥𝑤 = quantized 𝑤
39
Biased errors are very detrimental to network performance
This bias is especially strong for networks with
depth-wise separable convolutions!
Each output only has 3x3 associated parameters
MobileNet v2 layer 2 biased error
histogram per output
Dk X Dk
conv
40
Calculating bias correction to address biased quantization
Given 𝑾, a weight matrix, and a
quantized approximation ෪𝑊, we
can write in closed form:
The bias of an output is given as:
𝑾 = ෪𝑾 + 𝝐
𝔼 𝒚 − 𝔼 ෥𝒚 =
𝔼 𝑾𝒙 − 𝔼 ෪𝑾 𝒙 =
𝐖𝔼 𝒙 − ෪𝑾 𝔼 𝒙 =
𝝐𝔼 𝒙
Key idea: Bias correction
We find 𝝐𝔼[𝒙] and subtract it from the
output after quantization to correct for
the bias effect!
41
Bias correction removes the biased output error
MobileNetv2 2nd layer
42
DFQ offers state-of-the-art results
Per-channel is per channel quantization (Krishnamoorthi 2018)
QT is quantization aware training (Jacob et al. CVPR 2018)
SR+DR is stochastic rounding + dynamic ranges (results from Louizos et al. ICLR 2019)
QMN is Quantization friendly MobileNets (Sheng et al. EMC2 2018)
RQ is Relaxed quantization (Louizos et al. ICLR 2019)
ImageNet top 1 accuracy
Data-free
Requires
training
~D: no data needed
~BP: no backprop needed
~AC: no architecture changes needed
4343
Data-free quantization recap
• A simple API call results in better
quantization performance
• A fully-automatic pipeline gives
near-original model performance
without fine-tuning
• No (P)ReLU activations? Use smart
clipping and bias correction
Flow diagram of the data-free
quantization method
44
AdaRound
45
Adaptive rounding for neural
network quantization
AdaRound
• Introduces a new way of rounding
• Achieves excellent 4-bit weight results
Mart van Baalen
Qualcomm Technologies
Netherlands B.V.
Markus Nagel
Qualcomm Technologies
Netherlands B.V.
Rana Amjad
Qualcomm Technologies
Netherlands B.V.
Christos Louizos
Qualcomm Technologies
Netherlands B.V.
Nagel et al. 2020 Up or Down? Adaptive Rounding for Post-Training
Quantization
Tijmen Blankevoort
Qualcomm Technologies
Netherlands B.V.
4646
Rounding can be done better for quantization
Introducing AdaRound to optimize for rounding
Normally, we round our weights to
the nearest grid-point
AdaptiveRounding for Post-Trainin
Rounding scheme Acc(%)
Nearest 52.29
Ceil 0.10
Floor 0.10
Stochastic 52.06± 5.52
Stochastic (best) 63.06
Table1. Comparison of ImageNet validation accuracy among dif-
ferent rounding schemes for 4-bit quantization of the first layer
of Resnet18. We report the mean and the standard deviation of
100 stochastic (Gupta et al., 2015) rounding choices (Stochastic)
as well as the best validation performance among these samples
(Stochastic (best)).
well-founded and showssignificant performance improve-
ment in practice. We start by analyzing the loss due to
Figure 1
tion acc
quantiza
Rounding all values up or down gives 0
performance. Drawing 100 samples and picking the
best one increases performance by 10%
Rounding-to-the-nearest is not optimal
Consider the accuracy of various rounding schemes
for 4-bit quantization of Resnet18 first layer
By optimizing the layer-wise objective, AdaRound optimizes the
network weights in minutes without fine-tuning or hyperparameters
4747
For several models, we can
now have 4-bit weights while
only dropping 1-2% accuracy
AdaRound makes 4-bit weights possible
Without any fine-tuning or hyperparameter tweaking
Compared to networks quantized to 8-bit weight and 8-bit activation, 4-
bit weight and 8-bit activation speeds up execution by 2x and reduces
energy consumption by 2x — with virtually no additional work
48
Bayesian Bits
49
Unifying quantization and pruning
Bayesian Bits
Introduces one training scheme to
automatically find mixed-precision
quantization networks and pruning
Tijmen Blankevoort
Qualcomm Technologies
Netherlands B.V.
Mart van Baalen
Qualcomm Technologies
Netherlands B.V.
Markus Nagel
Qualcomm Technologies
Netherlands B.V.
Rana Amjad
Qualcomm Technologies
Netherlands B.V.
Christos Louizos
Qualcomm Technologies
Netherlands B.V.
Max Welling
Qualcomm Technologies
Netherlands B.V.
Baalen et al. 2020 Bayesian Bits: Unifying Quantization and Pruning
50
Bayesian Bits
Automatically learning quantization bit-width and pruning during training
𝑥 𝑞 = 𝑥2 + 𝑧4 𝜖4 + 𝑧8 𝜖8 + 𝑧16 𝜖16 + 𝑧32 𝜖32
We decompose quantization grids into
multiple smaller grids
This allows us to introduce gating variables 𝑧
that toggle higher-bit-width quantization on/off
51
State-of-the-art performance for mixed-precision quantization
During training, the network automatically finds the optimal
trade-off between network complexity and accuracy
The result: Some layers are fine with 8 bits, while
others are fine with 2 bits. And some layers are
pruned (green)
Systematically selecting the appropriate amount of precision
52
How developers and the research community can take
advantage of our quantization tools
Develop
53
Qualcomm Neural
Processing SDK
Relaxed Quantization
(ICLR 2019)
Leading quantization research and fast commercialization
Driving the industry towards integer inference and power-efficient AI
Data-free Quantization
(ICCV 2019)
AdaRound
(ICML 2020)
Bayesian Bits
(ICML 2020)
Quantization
research
Quantization
commercialization
Qualcomm AI Model
Efficiency Toolkit
(AIMET)Qualcomm Neural Processing SDK and Qualcomm AI Model Efficiency Toolkit are
products of Qualcomm Technologies, Inc. and/or its subsidiaries.
54
Qualcomm® Neural Processing SDK
Software accelerated runtime for the execution of deep neural networks on device
Qualcomm Neural Processing SDK, Qualcomm Snapdragon, Qualcomm Kryo, Qualcomm Adreno, and Qualcomm Hexagon are products of Qualcomm
Technologies, Inc. and/or its subsidiaries.
Available at: developer.qualcomm.com
Efficient execution on Qualcomm®
Snapdragon™ Mobile Platform
• Takes advantage of Snapdragon heterogeneous
computing capabilities
• Runtime and libraries accelerate deep neural net
processing on all engines: CPU, GPU, and DSP with
Hexagon Vector eXtensions (HVX) and Hexagon
Tensor Accelerator (HTA)
Model framework/Network support
• Convolutional neural networks and Long short Term
Memory (LSTM) Networks
• Support for Caffe/Caffe2, TensorFlow, and
user/developer defined layers
Optimization/Debugging tools
• Offline network conversion tools
• Debug and analyze network performance
• API and SDK documentation with sample code
• Ease of integration into customer applications
Qualcomm®
Kryo™
CPU
Qualcomm®
Adreno™
GPU
Qualcomm®
Hexagon™
DSP
Sample
code
Offline
conversion tools
Analyze
performance
Ease of
integration
5555
TensorFlow or PyTorch
Trained AI model
AI Model Efficiency Toolkit
(AIMET)
Quantization
• Data free quantization
• Quantization simulation
• Fine-tuning
Compression
• Spatial SVD
• Channel pruning
• Visualization
Optimized AI model
Fine-tune (optional)
Deploy at scale
AIMET plugs in seamlessly to the developer workflow
56
Includes state-of-the-art
quantization and compression
techniques from Qualcomm AI
Research
AIMET makes
AI models
small
Features
• State-of-the-art network compression tools
• State-of-the-art quantization tools
• Support for both TensorFlow and Pytorch
• Benchmarks and tests for many models
• Developed by professional software developers
If interested, email: aimet.support@qti.qualcomm.com
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
57
Coming soon…
AIMET
open source
58
Quantization improves power
efficiency, performance, and
memory usage
Our research addresses the
limitations of traditional
quantization approaches
Our user-friendly tools allow
developers to develop power-
efficient AI apps on Qualcomm
Snapdragon
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Thank you
Nothing in these materials is an offer to sell any of the
components or devices referenced herein.
©2018-2020 Qualcomm Technologies, Inc. and/or its
affiliated companies. All Rights Reserved.
Qualcomm and Snapdragon are trademarks of Qualcomm
Incorporated, registered in the United States and other
countries. Other products and brand names may be
trademarks or registered trademarks of their respective
owners.
References in this presentation to “Qualcomm” may mean Qualcomm
Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as
applicable. Qualcomm Incorporated includes Qualcomm’s licensing
business, QTL, and the vast majority of its patent portfolio. Qualcomm
Technologies, Inc.,
a wholly-owned subsidiary of Qualcomm Incorporated, operates, along
with its subsidiaries, substantially all of Qualcomm’s engineering,
research and development functions, and substantially all of its product
and services businesses, including its semiconductor business, QCT.

Mais conteúdo relacionado

Mais procurados

PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
NBDT : Neural-backed Decision Tree 2021 ICLR
 NBDT : Neural-backed Decision Tree 2021 ICLR NBDT : Neural-backed Decision Tree 2021 ICLR
NBDT : Neural-backed Decision Tree 2021 ICLRtaeseon ryu
 
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from XailientEdge AI and Vision Alliance
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachFerdin Joe John Joseph PhD
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)nikhilus85
 
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis taeseon ryu
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition Intel Nervana
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer VisionDongmin Choi
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSangwoo Mo
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processingYu Huang
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersSeunghyun Hwang
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationJason Anderson
 

Mais procurados (20)

PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
NBDT : Neural-backed Decision Tree 2021 ICLR
 NBDT : Neural-backed Decision Tree 2021 ICLR NBDT : Neural-backed Decision Tree 2021 ICLR
NBDT : Neural-backed Decision Tree 2021 ICLR
 
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
“Introduction to DNN Model Compression Techniques,” a Presentation from Xailient
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Transforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approachTransforming deep into transformers – a computer vision approach
Transforming deep into transformers – a computer vision approach
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)Action Recognition (Thesis presentation)
Action Recognition (Thesis presentation)
 
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processing
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
 

Semelhante a Enabling Power-Efficient AI Through Quantization

MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationApril Knyff
 
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Olivier Jeunen
 
Digital Electronics Notes
Digital Electronics Notes Digital Electronics Notes
Digital Electronics Notes Srikrishna Thota
 
Fixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksFixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksIJITE
 
Fixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksFixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural Networksgerogepatton
 
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...zeeshanshanzy009
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningVahid Mirjalili
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmHadi Fadlallah
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Tutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfTutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfDuy-Hieu Bui
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedOmid Vahdaty
 
Conversion of number system with base concept
Conversion of number system with base conceptConversion of number system with base concept
Conversion of number system with base conceptUniversity of Potsdam
 
Survey On Two-Term Dot Product Of Multiplier Using Floating Point
Survey On Two-Term Dot Product Of Multiplier Using Floating PointSurvey On Two-Term Dot Product Of Multiplier Using Floating Point
Survey On Two-Term Dot Product Of Multiplier Using Floating PointIRJET Journal
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-netDing Li
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 

Semelhante a Enabling Power-Efficient AI Through Quantization (20)

MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
A White Paper On Neural Network Quantization
A White Paper On Neural Network QuantizationA White Paper On Neural Network Quantization
A White Paper On Neural Network Quantization
 
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
 
Digital Electronics Notes
Digital Electronics Notes Digital Electronics Notes
Digital Electronics Notes
 
Fixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksFixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural Networks
 
Fixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural NetworksFixed-Point Code Synthesis for Neural Networks
Fixed-Point Code Synthesis for Neural Networks
 
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Enhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithmEnhancing the performance of kmeans algorithm
Enhancing the performance of kmeans algorithm
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Tutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdfTutorial-on-DNN-07-Co-design-Precision.pdf
Tutorial-on-DNN-07-Co-design-Precision.pdf
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
 
Conversion of number system with base concept
Conversion of number system with base conceptConversion of number system with base concept
Conversion of number system with base concept
 
Survey On Two-Term Dot Product Of Multiplier Using Floating Point
Survey On Two-Term Dot Product Of Multiplier Using Floating PointSurvey On Two-Term Dot Product Of Multiplier Using Floating Point
Survey On Two-Term Dot Product Of Multiplier Using Floating Point
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 

Mais de Qualcomm Research

Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdfQualcomm Research
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Research
 
Why and what you need to know about 6G in 2022
Why and what you need to know about 6G in 2022Why and what you need to know about 6G in 2022
Why and what you need to know about 6G in 2022Qualcomm Research
 
Understanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdfUnderstanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdfQualcomm Research
 
Enabling the metaverse with 5G- web.pdf
Enabling the metaverse with 5G- web.pdfEnabling the metaverse with 5G- web.pdf
Enabling the metaverse with 5G- web.pdfQualcomm Research
 
Bringing AI research to wireless communication and sensing
Bringing AI research to wireless communication and sensingBringing AI research to wireless communication and sensing
Bringing AI research to wireless communication and sensingQualcomm Research
 
How will sidelink bring a new level of 5G versatility.pdf
How will sidelink bring a new level of 5G versatility.pdfHow will sidelink bring a new level of 5G versatility.pdf
How will sidelink bring a new level of 5G versatility.pdfQualcomm Research
 
Scaling 5G to new frontiers with NR-Light (RedCap)
Scaling 5G to new frontiers with NR-Light (RedCap)Scaling 5G to new frontiers with NR-Light (RedCap)
Scaling 5G to new frontiers with NR-Light (RedCap)Qualcomm Research
 
Realizing mission-critical industrial automation with 5G
Realizing mission-critical industrial automation with 5GRealizing mission-critical industrial automation with 5G
Realizing mission-critical industrial automation with 5GQualcomm Research
 
3GPP Release 17: Completing the first phase of 5G evolution
3GPP Release 17: Completing the first phase of 5G evolution3GPP Release 17: Completing the first phase of 5G evolution
3GPP Release 17: Completing the first phase of 5G evolutionQualcomm Research
 
AI firsts: Leading from research to proof-of-concept
AI firsts: Leading from research to proof-of-conceptAI firsts: Leading from research to proof-of-concept
AI firsts: Leading from research to proof-of-conceptQualcomm Research
 
Setting off the 5G Advanced evolution with 3GPP Release 18
Setting off the 5G Advanced evolution with 3GPP Release 18Setting off the 5G Advanced evolution with 3GPP Release 18
Setting off the 5G Advanced evolution with 3GPP Release 18Qualcomm Research
 
5G positioning for the connected intelligent edge
5G positioning for the connected intelligent edge5G positioning for the connected intelligent edge
5G positioning for the connected intelligent edgeQualcomm Research
 
Enabling on-device learning at scale
Enabling on-device learning at scaleEnabling on-device learning at scale
Enabling on-device learning at scaleQualcomm Research
 
The essential role of AI in the 5G future
The essential role of AI in the 5G futureThe essential role of AI in the 5G future
The essential role of AI in the 5G futureQualcomm Research
 
How AI research is enabling next-gen codecs
How AI research is enabling next-gen codecsHow AI research is enabling next-gen codecs
How AI research is enabling next-gen codecsQualcomm Research
 
Role of localization and environment perception in autonomous driving
Role of localization and environment perception in autonomous drivingRole of localization and environment perception in autonomous driving
Role of localization and environment perception in autonomous drivingQualcomm Research
 
Intelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiencyIntelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiencyQualcomm Research
 

Mais de Qualcomm Research (20)

Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
 
The future of AI is hybrid
The future of AI is hybridThe future of AI is hybrid
The future of AI is hybrid
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
 
Why and what you need to know about 6G in 2022
Why and what you need to know about 6G in 2022Why and what you need to know about 6G in 2022
Why and what you need to know about 6G in 2022
 
Understanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdfUnderstanding the world in 3D with AI.pdf
Understanding the world in 3D with AI.pdf
 
Enabling the metaverse with 5G- web.pdf
Enabling the metaverse with 5G- web.pdfEnabling the metaverse with 5G- web.pdf
Enabling the metaverse with 5G- web.pdf
 
Bringing AI research to wireless communication and sensing
Bringing AI research to wireless communication and sensingBringing AI research to wireless communication and sensing
Bringing AI research to wireless communication and sensing
 
How will sidelink bring a new level of 5G versatility.pdf
How will sidelink bring a new level of 5G versatility.pdfHow will sidelink bring a new level of 5G versatility.pdf
How will sidelink bring a new level of 5G versatility.pdf
 
Scaling 5G to new frontiers with NR-Light (RedCap)
Scaling 5G to new frontiers with NR-Light (RedCap)Scaling 5G to new frontiers with NR-Light (RedCap)
Scaling 5G to new frontiers with NR-Light (RedCap)
 
Realizing mission-critical industrial automation with 5G
Realizing mission-critical industrial automation with 5GRealizing mission-critical industrial automation with 5G
Realizing mission-critical industrial automation with 5G
 
3GPP Release 17: Completing the first phase of 5G evolution
3GPP Release 17: Completing the first phase of 5G evolution3GPP Release 17: Completing the first phase of 5G evolution
3GPP Release 17: Completing the first phase of 5G evolution
 
AI firsts: Leading from research to proof-of-concept
AI firsts: Leading from research to proof-of-conceptAI firsts: Leading from research to proof-of-concept
AI firsts: Leading from research to proof-of-concept
 
Setting off the 5G Advanced evolution with 3GPP Release 18
Setting off the 5G Advanced evolution with 3GPP Release 18Setting off the 5G Advanced evolution with 3GPP Release 18
Setting off the 5G Advanced evolution with 3GPP Release 18
 
5G positioning for the connected intelligent edge
5G positioning for the connected intelligent edge5G positioning for the connected intelligent edge
5G positioning for the connected intelligent edge
 
Enabling on-device learning at scale
Enabling on-device learning at scaleEnabling on-device learning at scale
Enabling on-device learning at scale
 
The essential role of AI in the 5G future
The essential role of AI in the 5G futureThe essential role of AI in the 5G future
The essential role of AI in the 5G future
 
How AI research is enabling next-gen codecs
How AI research is enabling next-gen codecsHow AI research is enabling next-gen codecs
How AI research is enabling next-gen codecs
 
Role of localization and environment perception in autonomous driving
Role of localization and environment perception in autonomous drivingRole of localization and environment perception in autonomous driving
Role of localization and environment perception in autonomous driving
 
Pioneering 5G broadcast
Pioneering 5G broadcastPioneering 5G broadcast
Pioneering 5G broadcast
 
Intelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiencyIntelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiency
 

Último

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Enabling Power-Efficient AI Through Quantization

  • 1. Qualcomm Technologies Inc. April 2020 @qualcomm_tech Enabling power-efficient AI through quantization
  • 2. 22 Will we have reached the capacity of the human brain? Energy efficiency of a brain is 100x better than current hardware Source: Welling 2025 Energyconsumption 1940 1950 1960 1970 1980 1990 2000 2010 2020 2030 1943: First NN (+/- N=10) 1988: NetTalk (+/- N=20K) 2009: Hinton’s Deep Belief Net (+/- N=10M) 2013: Google/Y! (N=+/- 1B) 2025: N = 100T = 1014 2017: Extremely large neural networks (N =137B) 1012 1010 108 106 1014 104 102 100 Deep neural networks are energy hungry and growing fast AI is being powered by the explosive growth of deep neural networks
  • 3. 3 Privacy Reliability Low latency Efficient use of network bandwidth Process data closest to the source, complement the cloud On-device intelligence is paramount
  • 4. 4 The AI power and thermal ceiling The challenge of AI workloads Very compute intensive Complex concurrencies Real-time Always-on Constrained mobile environment Must be thermally efficient for sleek, ultra-light designs Requires long battery life for all-day use Storage/memory bandwidth limitations
  • 5. 5 New input data Advancing AI research to increase power efficiency Trained neural network model Compute operations • Vector and matrix manipulations • CPU, GPU, DSP, and AI accelerationMove data between memory and compute • Move pieces of input data and AI model from memory to compute • Send partial results back to memory Inference output
  • 6. 6 Trained neural network model Inference output New input data Hardware awareness AI Acceleration (scalar, vector, tensor) Acceleration research Such as compute-in-memory Advancing AI research to increase power efficiency QuantizationCompression Compilation Learning to reduce bit-precision while keeping desired accuracy Learning to prune model while keeping desired accuracy Learning to compile AI models for efficient hardware execution Applying AI to optimize AI model through automated techniques
  • 8. 8 What is neural network quantization? For any given trained neural network: • Store weights in n bits • Compute calculations in n bits Quantization analogy Similar to representing the pixels of an image with less bits
  • 9. 9Source: Mark Horowitz (Stanford). Energy based on ASIC | Area based on TSMC45nm Quantizing AI models offers significant benefits Power consumption Significant reduction in energy for both computations and memory access Silicon area Integer math or less bits require less silicon area compared to floating point math and more bits Mem access energy (pJ) Cache (64-bit) 8KB 10 32KB 20 1MB 100 DRAM 1300- 2600 Up to 4X energy reduction Add energy (pJ) INT8 FP32 0.03 0.9 30X energy reduction Mult energy (pJ) INT8 FP32 0.2 3.7 18.5X energy reduction Memory usage 8-bit versus 32-bit weights and activations stored in memory 01010101 01010101 01010101 01010101 01010101 Add area (μm2) INT8 FP32 36 4184 116X area reduction Mult area (μm2) INT8 FP32 282 7700 27X area reduction Latency With less memory access and simpler computations, latency can be reduced
  • 10. 1010 How to most efficiently calculate AB+b? Matrix math is the primary operation of neural nets A running example to showcase how to make these operations more efficient 𝐴 𝑇 = 0.97 0.64 0.58 0.84 0.74 1.00 0.84 0.81 0.00 0.18 0.57 0.96 0.90 0.28 0.80 0.81 𝐵 = 0.41 0.25 0.00 0.41 0.73 0.66 0.41 0.57 0.42 0.24 0.39 0.82 0.71 1.00 0.17 0.35 𝑏 = 0.1 0.2 0.3 0.4
  • 11. 11 A schematic mac array for efficient computation 𝑊1 𝑊2 𝑊3 𝑊4 𝐼1 𝐼2 𝐼3 𝐼4 𝑋1,1 𝑋1,2 𝑋1,3 𝑋1,4 𝑋2,1 𝑋2,2 𝑋2,3 𝑋2,4 𝑋3,1 𝑋3,2 𝑋3,3 𝑋3,4 𝑋4,1 𝑋4,2 𝑋4,3 𝑋4,4 𝐴1 𝐴2 𝐴3 𝐴4 Input valuesWeightvalues Accumulators The array efficiently calculates the dot product between multiple vectors 𝐴𝑖 = 𝑊𝑖 ⋅ 𝐼1 + 𝑊𝑖 ⋅ 𝐼2 + 𝑊𝑖 ⋅ 𝐼3 + 𝑊𝑖 ⋅ 𝐼4
  • 12. 12 Step-by-step matrix multiply calculation on mac array 𝑊1 𝑊2 𝑊3 𝑊4 𝐼1 𝐼2 𝐼3 𝐼4 𝑋1,1 𝑋1,2 𝑋1,3 𝑋1,4 𝑋2,1 𝑋2,2 𝑋2,3 𝑋2,4 𝑋3,1 𝑋3,2 𝑋3,3 𝑋3,4 𝑋4,1 𝑋4,2 𝑋4,3 𝑋4,4 𝐴1 𝐴2 𝐴3 𝐴4 𝐴 𝑇 = 0.97 0.64 0.58 0.84 0.74 1.00 0.84 0.81 0.00 0.18 0.57 0.96 0.90 0.28 0.80 0.81 𝐵 = 0.41 0.25 0.00 0.41 0.73 0.66 0.41 0.57 0.42 0.24 0.39 0.82 0.71 1.00 0.17 0.35 𝑂𝑢𝑡 𝑇 = 0.62 0.71 0.94 1.33 0.99 0.84 1.40 1.31 1.00 1.10 1.17 1.42 1.66 1.40 2.15 1.69 Load full matrix into each individual block
  • 13. 13 Quantization comes at a cost of lost precision Instead of storing weights as floating point values, store them as integers with a scale factor: 𝐴 𝑇 = 0.97 0.64 0.58 0.84 0.74 1.00 0.84 0.81 0.00 0.18 0.57 0.96 0.90 0.28 0.80 0.81 ≈ 1 255 247 163 148 214 189 255 214 207 0 46 145 245 229 71 204 207 = 𝑠 ⋅ 𝒁 𝐴 𝑇 − 𝑠 ⋅ 𝒁 = 1 255 0.35 0.20 −0.1 0.20 −0.3 0 0.20 −0.45 0.00 −0.1 0.35 −0.2 −0.5 0.40 0 −0.45 However, quantization is not free:This means that for every weight tensor or activation tensor, we only have to store an INT8 weight matrix and 1 scaling factor, instead of a FP32 weight matrix.
  • 14. 14 Different types of quantization have pros and cons Symmetric, asymmetric, signed, and unsigned quantization Symmetric signed 𝑠 ⋅ 𝑧𝑖𝑛𝑡8 Asymmetric 𝑠 ⋅ (𝑧 + 𝑜) Fixed point grid Floating point grid s: scale factor o: offset 𝑠min max 𝑜 0 0 255 Symmetric unsigned 𝑠 ⋅ 𝑧 𝑢𝑖𝑛𝑡8 𝑠 max 0 0 255 𝑠 max 0 0 128-127
  • 15. 15 An example calculation using symmetric quantization 𝑊1 𝑊2 𝑊3 𝑊4 𝐼1 𝐼2 𝐼3 𝐼4 𝑋1,1 𝑋1,2 𝑋1,3 𝑋1,4 𝑋2,1 𝑋2,2 𝑋2,3 𝑋2,4 𝑋3,1 𝑋3,2 𝑋3,3 𝑋3,4 𝑋4,1 𝑋4,2 𝑋4,3 𝑋4,4 𝐴1 𝐴2 𝐴3 𝐴4 𝐴 𝑇 = 1 255 247 163 148 214 189 255 214 207 0 46 145 245 229 71 204 207 𝐵 = 1 255 105 64 0 105 186 168 105 145 107 61 99 209 181 255 43 89 𝑂𝑢𝑡 𝑇 = 71403 97747 58931 88259 108231 136021 97633 128887 31639 33699 57546 90712 49513 71639 98520 130328 Load full matrix into each individual block 1 255 ⋅ 255 ⋅ ≈ 𝑂𝑢𝑡 𝑇 = 1 136021 ⋅ 255 134 183 110 165 203 255 183 242 59 63 108 170 93 134 185 244 Activation quantization INT32INT8INT8INT8INT8
  • 16. 1616 Symmetric weights and asymmetric activations are the best option What type of quantization should you use? Symmetric quantization 𝑊 ⋅ 𝑋 ≈ 𝑠1 𝑊𝑖𝑛𝑡 ⋅ 𝑠2 𝑋𝑖𝑛𝑡 = 𝑠1 𝑠2(𝑊𝑖𝑛𝑡 ⋅ 𝑋𝑖𝑛𝑡) Asymmetric quantization 𝑊 ⋅ 𝑋 ≈ 𝑠1 𝑊𝑖𝑛𝑡 + 𝑜1 ⋅ 𝑠2 𝑋𝑖𝑛𝑡 + 𝑜2 = 𝑠1 𝑠2 𝑊𝑖𝑛𝑡 ⋅ 𝑋𝑖𝑛𝑡 + 𝑠1 𝑠2 𝑜1 𝑋𝑖𝑛𝑡 + 𝑠1 𝑠2 𝑜2 𝑊𝑖𝑛𝑡 + 𝑠1 𝑜1 𝑠2 𝑜2 𝑊 is the weight matrix 𝑋 is the input of a layer Same calculation Precompute, add to layer bias Unavoidable overhead Asymmetric weight calculations incur 10-15% extra energy consumption
  • 18. 18 How to accurately simulate quantization Quantization is generally simulated in floating point instead of actually running in integer math Simulated quantization ops are added in the neural network after each usage of weights, and after every ‘operation’ • No dedicated kernels are necessary • This easily allows for flexible bit-widths 1,2,3,4,… • Makes GPU speed-up easy • Biases are not quantized Act quant Input Weights Wt quant Conv / FC + RELU Biases Output
  • 19. 19 How to simulate asymmetric quantization with b bits Given a floating point value 𝑥, we quantize: 𝑥𝑖𝑛𝑡 = 𝑟𝑜𝑢𝑛𝑑 𝑥 − 𝑜 s 𝑥 𝑄 = 𝑐𝑙𝑎𝑚𝑝 𝑥𝑖𝑛𝑡, 𝑚𝑖𝑛 = 0, 𝑚𝑎𝑥 = 2 𝑏 − 1 𝑥 𝑓𝑙𝑜𝑎𝑡 = 𝑥 𝑄 ⋅ s + 0 The procedure turns any value into a ‘8-bit quantized’ value, while all calculations are done in float32 Definitely slower than training without quantization operations Floating point grid Fixed point grid 𝑠min max 𝑜 0 0 255 What happens in those simulated quantization blocks? 1 These operations are added everywhere in the network min, max are set for activations based on passing of batches of data through the whole network
  • 20. 2020 Hardly any difference between quantization simulation and real hardware How accurate is the quantization simulation? Very accurate — the rounding errors are tiny Model Top1 simulated Top1 on-device Resnet 50 75.76% 75.67% MobileNetV2 70.12% 70.01%
  • 21. 2121 16-bit fixed-point quantization is always fine How well does quantizing a model work? Quantizing some computer vision models to 8-bit weights and activations Model Top1 accuracy Top1 quantized InceptionV3 0.78 0.78 NasnetMobile 0.74 0.722 Resnet 50 0.756 0.75 MobileNetV2 0.749 0.004 DeepLabV3 0.729 0.41 FastDVDNet 0.862 0.696
  • 23. 23 Overcoming the challenges of quantized training Quantized training challenges: • ”Round” doesn’t have a proper gradient. • “Clamp” kills the gradient Solution: Redefine gradient op as ”straight-through”* Then train with gradient descent as usual *Bengio et al. 2013. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation Forward pass Backward pass Act quant Input Weights Wt quant Conv / FC + RELU Biases Output
  • 24. 2424 Each calculated gradient is a little bit ’wrong’. This compounds over the whole network and makes training deep networks difficult. Problem: A mismatch between real and quantized values Forward pass Backward pass 0.4 0.5 0.4 0.4
  • 25. 25 Addressing biased gradients in quantized training Through stochastic rounding and relaxed quantization Stochastic rounding1 𝑥 − 𝜖 𝑥 𝑥 + 𝜖 𝑥 + 2𝜖 𝑥 q(x)=൞ ‫ہ‬ ‫ۂ‬𝑥 𝑝: 1 − 𝑥−‫ہ‬ ‫ۂ‬𝑥 𝜖 ‫ہ‬ ‫ۂ‬𝑥 + 𝜖 𝑝: 𝑥−‫ہ‬ ‫ۂ‬𝑥 𝜖 1) Gupta et al. 2015 Deep Learning with Limited Numerical Precision 2) Louizos et al. 2019. Relaxed Quantization for discretized neural networks Relaxed quantization2 Not a big problem for 8-bit quantization
  • 26. 26 Major improvements from learning the min and max values Adjust [min, max] on the fly while training, such as when overflow occurs min max x Dynamic ranges1 1) Wu et al. 2018 Training and Inference with Integers in Deep Neural Networks 2) Jain et al. 2019 Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware Fully trainable min, max2 Parametrize min and max, and train them alongside full network With learned min and max values, ImageNet models trained to 4-bit weights and 4-bit activations have hardly any loss
  • 27. 2727 If possible, fine-tune after compression and quantization for optimal performance Quantized training solves a lot of accuracy problems Model Top1 accuracy Top1 quantized Fine-tuned InceptionV3 0.78 0.78 0.78 NasnetMobile 0.74 0.722 0.73 Resnet 50 0.756 0.75 0.752 MobileNetV2 0.749 0.004 0.735 DeeplabV3 0.729 0.414 0.725 Results from Jacob et al. 2018 Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only inference
  • 28. 28 Making quantization practical for the masses Our research focuses on quantization techniques that maintain accuracy while reducing engineering effort Rel-14 enTV1 No data Data might not be available No back propagation Retraining is time intensive and sensitive to hyper parameters No architecture changes Requires training from scratch with quantization in mind Ideal properties for simple quantization
  • 30. 30 Data-Free Quantization through Weight Equalization and Bias Correction A better method for no-data quantization • Paper introduces a method for quantization without the use of data • Excellent results without any training Tijmen Blankevoort Qualcomm Technologies Netherlands B.V. Max Welling Qualcomm Technologies Netherlands B.V. Mart van Baalen Qualcomm Technologies Netherlands B.V. Markus Nagel Qualcomm Technologies Netherlands B.V. Nagel et al. 2019 Data-Free Quantization Through Weight Equalization and Bias Correction
  • 31. 31 Visualizing quantization rounding error Weight/activation output bin Weight/activationsize Schematic histogram of weight/activation ranges for layer 0 50 100 1 2 3 4 min max Quantizationgridbbits Expected error 1 2 ( min −𝑚𝑎𝑥 2 𝑏−1 )
  • 32. 32 Imbalanced weights is a common problem in practice Distributions of weights in 2nd layer of MobileNetV2 (ImageNet)
  • 33. 33 The problem occurs because of mismatched ranges Weight/activation output bin Weight/activationsize Schematic histogram of weight/activation ranges for layer 0 50 100 1 2 3 4 min max Quantizationgridbbits
  • 34. 3434 • Per-channel quantization1 keeps a scale s_i (and offset o_i) for each output i • Not all hardware supports this Per-channel quantization Weight output bin Weightsize Schematic histogram of weight ranges for layer 0 50 100 1 2 3 4 min max Quantizationgridbbits max [1] Krishnamoorthi 2018. Quantizing deep convolutional networks for efficient inference: A whitepaper
  • 35. 35 Cross-layer equalization scales weights in neighboring layers for better quantization 𝑟𝑒𝑙𝑢 𝑥 = max 0, 𝑥 We have that 𝑟𝑒𝑙𝑢 𝑠𝑥 = 𝑠 ⋅ 𝑟𝑒𝑙𝑢(𝑥) We can scale two layers with a (P)Relu together to optimize it for quantization
  • 36. 3636 Equalize the outputs of layer 1 with the inputs of layer 2 by setting 𝑠𝑖 = 1 𝑟 𝑖 𝑟𝑖 1 𝑟𝑖 2 Finding the scaling factors for cross-layer equalization Weight output bin Weight/activationsize Schematic histogram of weight ranges for layer 1 0 50 100 1 2 3 4 min max Quantizationgridbbits Weight input bin Weight/activationsize Schematic histogram of weight ranges for layer 2 0 50 100 1 2 3 4 min max Quantizationgridbbits min max Quantizationgrid bbits min max Quantizationgrid bbits 𝑟2 (1) 𝑟2 (1) /𝑠2 ⋅ 𝑠2
  • 37. 37 Cross-layer equalization significantly improves accuracy Original weight ranges After cross-layer equalization Top-1 accuracy Float32 Top-1 accuracy INT8 (best) Difference Top-1 accuracy Asymmetric quant 71.9% 0.1% -71.8% Equalization 71.72% 69.91% -1.99% MobileNetV2 results for ImageNet
  • 38. 38 Biased quantization is a result of rounding errors Floating point gridmin max 0 Fixed point grid 0 255 0.1 0.1 0.1 0.1 0.1 By sheer coincidence, it could be that 𝒘 ⋅ 𝒙 ≈ ෥𝒘 ⋅ 𝒙 ≈ 𝒘 ⋅ 𝒙 + 0.1𝒙 Since most values are rounded up, the average output of the quantized model is now bigger than the original i.e. 𝔼 𝒚 ≠ 𝔼 ෥𝒚 ෥𝑤 = quantized 𝑤
  • 39. 39 Biased errors are very detrimental to network performance This bias is especially strong for networks with depth-wise separable convolutions! Each output only has 3x3 associated parameters MobileNet v2 layer 2 biased error histogram per output Dk X Dk conv
  • 40. 40 Calculating bias correction to address biased quantization Given 𝑾, a weight matrix, and a quantized approximation ෪𝑊, we can write in closed form: The bias of an output is given as: 𝑾 = ෪𝑾 + 𝝐 𝔼 𝒚 − 𝔼 ෥𝒚 = 𝔼 𝑾𝒙 − 𝔼 ෪𝑾 𝒙 = 𝐖𝔼 𝒙 − ෪𝑾 𝔼 𝒙 = 𝝐𝔼 𝒙 Key idea: Bias correction We find 𝝐𝔼[𝒙] and subtract it from the output after quantization to correct for the bias effect!
  • 41. 41 Bias correction removes the biased output error MobileNetv2 2nd layer
  • 42. 42 DFQ offers state-of-the-art results Per-channel is per channel quantization (Krishnamoorthi 2018) QT is quantization aware training (Jacob et al. CVPR 2018) SR+DR is stochastic rounding + dynamic ranges (results from Louizos et al. ICLR 2019) QMN is Quantization friendly MobileNets (Sheng et al. EMC2 2018) RQ is Relaxed quantization (Louizos et al. ICLR 2019) ImageNet top 1 accuracy Data-free Requires training ~D: no data needed ~BP: no backprop needed ~AC: no architecture changes needed
  • 43. 4343 Data-free quantization recap • A simple API call results in better quantization performance • A fully-automatic pipeline gives near-original model performance without fine-tuning • No (P)ReLU activations? Use smart clipping and bias correction Flow diagram of the data-free quantization method
  • 45. 45 Adaptive rounding for neural network quantization AdaRound • Introduces a new way of rounding • Achieves excellent 4-bit weight results Mart van Baalen Qualcomm Technologies Netherlands B.V. Markus Nagel Qualcomm Technologies Netherlands B.V. Rana Amjad Qualcomm Technologies Netherlands B.V. Christos Louizos Qualcomm Technologies Netherlands B.V. Nagel et al. 2020 Up or Down? Adaptive Rounding for Post-Training Quantization Tijmen Blankevoort Qualcomm Technologies Netherlands B.V.
  • 46. 4646 Rounding can be done better for quantization Introducing AdaRound to optimize for rounding Normally, we round our weights to the nearest grid-point AdaptiveRounding for Post-Trainin Rounding scheme Acc(%) Nearest 52.29 Ceil 0.10 Floor 0.10 Stochastic 52.06± 5.52 Stochastic (best) 63.06 Table1. Comparison of ImageNet validation accuracy among dif- ferent rounding schemes for 4-bit quantization of the first layer of Resnet18. We report the mean and the standard deviation of 100 stochastic (Gupta et al., 2015) rounding choices (Stochastic) as well as the best validation performance among these samples (Stochastic (best)). well-founded and showssignificant performance improve- ment in practice. We start by analyzing the loss due to Figure 1 tion acc quantiza Rounding all values up or down gives 0 performance. Drawing 100 samples and picking the best one increases performance by 10% Rounding-to-the-nearest is not optimal Consider the accuracy of various rounding schemes for 4-bit quantization of Resnet18 first layer By optimizing the layer-wise objective, AdaRound optimizes the network weights in minutes without fine-tuning or hyperparameters
  • 47. 4747 For several models, we can now have 4-bit weights while only dropping 1-2% accuracy AdaRound makes 4-bit weights possible Without any fine-tuning or hyperparameter tweaking Compared to networks quantized to 8-bit weight and 8-bit activation, 4- bit weight and 8-bit activation speeds up execution by 2x and reduces energy consumption by 2x — with virtually no additional work
  • 49. 49 Unifying quantization and pruning Bayesian Bits Introduces one training scheme to automatically find mixed-precision quantization networks and pruning Tijmen Blankevoort Qualcomm Technologies Netherlands B.V. Mart van Baalen Qualcomm Technologies Netherlands B.V. Markus Nagel Qualcomm Technologies Netherlands B.V. Rana Amjad Qualcomm Technologies Netherlands B.V. Christos Louizos Qualcomm Technologies Netherlands B.V. Max Welling Qualcomm Technologies Netherlands B.V. Baalen et al. 2020 Bayesian Bits: Unifying Quantization and Pruning
  • 50. 50 Bayesian Bits Automatically learning quantization bit-width and pruning during training 𝑥 𝑞 = 𝑥2 + 𝑧4 𝜖4 + 𝑧8 𝜖8 + 𝑧16 𝜖16 + 𝑧32 𝜖32 We decompose quantization grids into multiple smaller grids This allows us to introduce gating variables 𝑧 that toggle higher-bit-width quantization on/off
  • 51. 51 State-of-the-art performance for mixed-precision quantization During training, the network automatically finds the optimal trade-off between network complexity and accuracy The result: Some layers are fine with 8 bits, while others are fine with 2 bits. And some layers are pruned (green) Systematically selecting the appropriate amount of precision
  • 52. 52 How developers and the research community can take advantage of our quantization tools Develop
  • 53. 53 Qualcomm Neural Processing SDK Relaxed Quantization (ICLR 2019) Leading quantization research and fast commercialization Driving the industry towards integer inference and power-efficient AI Data-free Quantization (ICCV 2019) AdaRound (ICML 2020) Bayesian Bits (ICML 2020) Quantization research Quantization commercialization Qualcomm AI Model Efficiency Toolkit (AIMET)Qualcomm Neural Processing SDK and Qualcomm AI Model Efficiency Toolkit are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
  • 54. 54 Qualcomm® Neural Processing SDK Software accelerated runtime for the execution of deep neural networks on device Qualcomm Neural Processing SDK, Qualcomm Snapdragon, Qualcomm Kryo, Qualcomm Adreno, and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Available at: developer.qualcomm.com Efficient execution on Qualcomm® Snapdragon™ Mobile Platform • Takes advantage of Snapdragon heterogeneous computing capabilities • Runtime and libraries accelerate deep neural net processing on all engines: CPU, GPU, and DSP with Hexagon Vector eXtensions (HVX) and Hexagon Tensor Accelerator (HTA) Model framework/Network support • Convolutional neural networks and Long short Term Memory (LSTM) Networks • Support for Caffe/Caffe2, TensorFlow, and user/developer defined layers Optimization/Debugging tools • Offline network conversion tools • Debug and analyze network performance • API and SDK documentation with sample code • Ease of integration into customer applications Qualcomm® Kryo™ CPU Qualcomm® Adreno™ GPU Qualcomm® Hexagon™ DSP Sample code Offline conversion tools Analyze performance Ease of integration
  • 55. 5555 TensorFlow or PyTorch Trained AI model AI Model Efficiency Toolkit (AIMET) Quantization • Data free quantization • Quantization simulation • Fine-tuning Compression • Spatial SVD • Channel pruning • Visualization Optimized AI model Fine-tune (optional) Deploy at scale AIMET plugs in seamlessly to the developer workflow
  • 56. 56 Includes state-of-the-art quantization and compression techniques from Qualcomm AI Research AIMET makes AI models small Features • State-of-the-art network compression tools • State-of-the-art quantization tools • Support for both TensorFlow and Pytorch • Benchmarks and tests for many models • Developed by professional software developers If interested, email: aimet.support@qti.qualcomm.com Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
  • 58. 58 Quantization improves power efficiency, performance, and memory usage Our research addresses the limitations of traditional quantization approaches Our user-friendly tools allow developers to develop power- efficient AI apps on Qualcomm Snapdragon
  • 59. Follow us on: For more information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Thank you Nothing in these materials is an offer to sell any of the components or devices referenced herein. ©2018-2020 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved. Qualcomm and Snapdragon are trademarks of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners. References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT.