1. AI Chip Trends and
Forecast
Joo-Young Kim
2019. 11. 6
ICT 산업전망컨퍼런스
2. Outline
• Introduction
- Brief history & deep neural network models
- AI stack and new computing paradigm
• Trends in AI chips
- ??
• Looking forward
- ???
4. Brief History of Neural Networks
F. Rosenblatt B. Widrow – M. Hoff M. Minsky – S. Papert D. Rumelhart – G. Hinton – R. Wiliams G. Hinton – R. Salakhutdinov
• Learnable weights and
Threshold
• XOR problem • Nonlinear problem solved
• High computation
• Local optima and overfitting
• Hierarchical feature
learning
1943
• Adjustable
but not
learnable
weights
W. S. McCulloch - W. Pitts
1958 1960 1969 1986 2006
Deep
Deep
Learning!
First Winter
Second Winter
- ImageNet
- AlphaGo
- Speech
translation
- Video synthesis
- Smart factory
- …
5. Deep Learning ≠ AI
AI Searching
Planning
Knowledge
Representation
Fuzzy Logic
Natural Language
Processing
Genetic
Algorithm
Any technique that enables
computers to mimic human behavior
AI techniques that have computers learn
without being explicitly programmed
A subset of ML that makes the
computation of multi-layer neural
networks feasible
6. Deep Learning Revolution
Human: ~5%
ImageNet (ILSVRC) Top-5 Error
* F. Veen, The Asimov Institute, 2016
Deep learning starts to surpass human-level
recognition on specific tasks
*
7. What Has Changed?
• Traditional pattern recognition
• Deep learning (model + data)
Trainable Features & Classifiers
"Dog"
"Ship"
"Car"CNN
DNN
Hand-Crafted
Features
HoG
SIFT
Haar Like
Simple Trainable
Classifiers
SVM
K-Means
"Dog"
"Ship"
"Car"
Amount of Data
Performance
Traditional
algorithms
Deep
learning
Andrew Ng, Stanford CS 229 class
8. Popular Types of DNNs
MLP
(Multi-Layer
Perceptron)
CNN
(Convolutional)
RNN
(Recurrent)
Characteristic Fully Connected Convolutional Layer
Sequential Data
Feedback Path
Major
Application
Speech
Recognition
Image Recognition
Speech / Action
Recognition
Number of
Layers
3~10 Layers Max ~100 Layers 3~5 Layers
Convolution
Pooling
Input
Output
Fully
Connected
Output
Input
Hidden
Output
Input
Matrix-vector
multiplication
3d convolution
Matrix-vector
multiplication
Main
Computation
10. DNN Characteristics
• Requires big data & big computation
• Modern hardware enabled deep learning revolution (e.g. GPU)
# Operations: ~2Billion/Face
# Mem. Access: ~1GB/Face
Local-feature-based Deep Learning-based
# Operations: ~0.1Billion/Face
# Mem. Access: ~10MB/Face
11. AI Stack
Algorithm
Chip
Device
• Neuromorphic chip: brain-inspired computing, biological brain simulation, …
• Programmable chip: GPU, ASIC, FPGA, DSP, …
• System-on-Chip: multi-core, many-core, SIMD, systolic array, …
• Development tool-chain: frameworks, compiler, simulator, optimizer, …
• High bandwidth off-chip memory: HBM, DRAM, GDDR, STT-MRAM, …
• High speed interface: SerDes, Optical Communication
• CMOS 3d stacking
• Emerging computing device: analog computing, memristors, …
• Emerging memory device: ReRAM, PCRAM, …
• Neural network topology: MLP, CNN, RNN, LSTM, SNN, …
• Deep neural networks: AlexNet, ResNet, GoogLeNet, …
• Neural network algorithms: reinforcement Learning, adversarial Learning, …
• Machine learning algorithms: SVM, K-NN, decision tree, Markov chain, …
Application
• Video/Image: face recognition, image generation, video analysis, …
• Sound and Speech: speech recognition, language synthesis, music generation, …
• NLP: text analysis, language translation, human-machine communication, …
• Robotics: autopilot, UAV, industrial automation, …
12. New Computational Paradigm
• Being able to handle big data
- Huge storage capacity, high bandwidth, low latency memory access
- “memory wall” problem
• Large amount of computation
- Mainly linear algebraic operations while control is relatively simple
- Parameters are large
• Training vs Inference
- Training: accuracy, data capacity (~1018 bytes), weight synchronization
- Inference: speed, energy, hardware cost, efficient reading of weights
• Data precision / Model compression / Pruning
- Not always require a high precision
• High configurability
- Tradeoff between energy efficiency and adaptability to new algorithms
14. DNN Hardware
• Mobile Based
- Specific AI
- Real-time
- Limited resources
- Low-power
• Cloud Based
- General AI
- High computing
- Huge memory
- Fast & accurate
learning
Low
Low Real-Time Operation
GlobalDataSharing
Cloud Server
Mobile
Edge Terminal
Control &
Control Model
Control &
Control Model
Data &
Learned Model
Data &
Learned Model
High
High
15. Cloud based AI Computing
Pre-trained Network
Learning
TrainingData(Dataset)
Inference
on
Cloud / Server
Question
Answer
Voice Assistant
Cloud / ServerDevice / Edge
16. DNN Chips for Cloud Server
• Nvidia (GPU)
• Goodle (TPU)
• Microsoft (BrainWave)
• Amazon (Inferentia)
• Facebook
• Alibaba, Baidu
Real-Time Operation
GlobalDataSharingLowHigh
HighLow
Cloud Server
• Control based on overall conditions
• Learning with data collected from edge devices
Stand-Alone AI
NVIDIA Volta Google Cloud TPU
17. Mobile/Edge based AI Inference
Self-driving vehicle, intelligent camera/speaker, IoT devices
Pretrained Network
Learning
Inference
on
Cloud / Server
TrainingData(Dataset)
Inference
Using Pretrained Model
User
Interface
&
APPs
platform
Sensors
Camera
MIC
GPS
Gyro
Touch
Local Data
Load
Pretrained
Model
Cloud / ServerDevice / Edge
18. Mobile/Edge DNN Applications
• Apple
• Huawei
• Qualcomm
• ARM
• CEVA
• Cambrion
• Horizon Robotics
• MobileEye
• Tesla
PowerConsumption
Inference Speed
HighLow
Slow Fast
IoT
Wearable
Smart
Phone
Drone
Automoitive
Mobile
Robot
19. Cloud vs Edge Summary
High Performance
High Precision
High Flexibility
Distributed
Scalable
Diverse Requirements
(Car, Wearable, IoT)
Low-Moderate Throughput
Low Latency
Power Efficiency
Low Cost
High Throughput
Low Latency
Power Efficiency
Distributed
Scalable
?
Cloud / Datacenter Edge / Mobile
InferenceTraining
20. Functional Integration
Intel CPU
nVidia GPU
Xilinx FPGA
MIT Eyeriss
KAIST LNPU
Google TPU
Microsoft BrainWave
…
Wave DPU
Tsinghua Thinker
…
Hardware Classic Domain specific Reconfigurable
Domain Cloud Could/Edge Could/Edge
Target Workload Training oriented Inference Inference & Training
Early 1st Stage 2nd Stage
?
Courtesy of GTIC 2019
21. Two Different Directions
• Be more flexible
• Be more compact
Dedicated
Diannao
2014
RS Dataflow
MIT Eyeriss
Systolic Array
Google TPU
Sparse-aware
Nvidia SCNN
Flexible Bitwidth
KAIST UNPU …
2016 2017.6 20182017.1
Compression
Pruning
EIE
2016.2
BWN TWN Low-bit Training
DoReFa-Net
Low-bit Quantization
LQ-Nets …
2016.8 2018.2 2018.92016.11
Courtesy of GTIC 2019
22. Von Neumann Bottleneck for AI
• Von-Neumann architecture serially fetches data from the storage
• AI application needs to access tremendous amount of data
AI
Processor
Memory
BUS
Bottleneck
Memory Wall
23. NVM DRAM
SRAM
(Cache)
Processor
Von Neumann Bottleneck
NVM DRAM
SRAM
(Cache)
Processor
Increasing Memory Bandwidth
How can we increase bandwidth between processor
and memory?
27. Towards into Memory
NVM DRAM
SRAM
(Cache)
Processor
Von Neumann Bottleneck
NVM DRAM
SRAM
(Cache)
Processor
NVM DRAM
P
SRAM
P P P P P P P P P P P
Traditional
Near-Memory/
Emerging Mem
In-Memory/
Memory-centric
29. PIM Chip
Renesas’s ternary SRAM PIM for AI inference
S. Okumura, et al., “A Ternary Based Bit Scalable, 8.80 TOPS/W CNN accelerator with Many-core Processing-in-memory Architecture with 896K
synapses/mm2”, Symposium on VLSI Technology 2019
30. AI Framework
Provides higher-level abstraction to developers/users
Convolution on volumes (1 line)
Max pooling (1 line)
Non-linear ReLu (1 line)
31. Hyper-Scale AI Accelerators
TPU v3 (2018)
Cerebras Wafer Scale Engine (2019)
Usually hundreds of processing units
in array structure..
How do we program this?
1.2T transistors
46,225 mm2
400,000 cores
18GB SRAM
100 Pb/s interconnect
34. Problem: No De Facto SW Tool & Hardware!
C / Java Compiler toolchain CPU
Software Hardware
OpenGL /
CUDA
Compiler toolchain GPU
Verilog / VHDL Synthesis toolchain FPGA
?
35. Neuromorphic Chip
• “Spiking neuron”
• Closely model biological
neuron’s activity
• Incorporates concept of
time: integrate and fire
• Computationally expensive
• Difficult to train →
Not practical at moment
1st
Generation
• Perceptron based
• No non-linear
functions
• Binary output
2nd
Generation
3rd
Generation
• Non-linear activation functions
• Continuous output
• Functional modeling of our
brain
• Working real-life applications
• We are here (FF, CNN, RNN, …)
36. IBM TrueNorth
• 5.4 billion transistors in 28nm CMOS process
• 64 x 64 neurosynaptic core, 256 neurons each
Paul A. Merolla, et al. "A million spiking-neuron integrated circuit with a scalable communication network and interface." Science2014
37. IBM TrueNorth
• Mimicking synapse with SRAM
• However, SRAM is not made for this (large area, cost).
Pre-Neuron (Tx)
Post-Neuron (Rx)
Synapse is a structure that
permits a neuron to pass an
electrical signal to another.
Input Spike
1 0
0 0
1 1
8T SRAM cell
as synapse
Output Spike (Voltage)
WL
BLT
BLT
BLBLWLT
Voltage Σ ΣΣ
1
0
1
SRAM Synapse Array
38. Neuromorphic Chip with Emerging Device
• New model requires device with new physics
• FeFET: better storing/transferring analog signal
M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017
39. Neuromorphic Chip with Emerging NV RAM
Z. Wang., et al, "Fully memristive neural networks for pattern classification with unsupervised learning", Nature Electronics 2018
• ReRAM (memristor)
40. 1. Cloud and Edge Will be Closer
• Edge inference & learning will be more important due to privacy concern, real-time
operation, and power constraint
• Federated learning: leverage cloud’s big data advantage on edge devices
Mobile Devices
Encryption & Compressed Data
Local
Learning
Custom
Weight
Cloud Servers
Shared
Model
Broadcasting
shared model
Aggregating
encrypted data
Local
Learning
Custom
Weight
Local
Learning
Custom
Weight
Local
Learning
Custom
Weight
Updated
Model
41. 2. AI Chips will Support More Algorithms
• State-of-the-art algorithms are moving from traditional MLP, CNN, RNN
to GAN, reinforcement learning, and unsupervised learning
Inference only
(MLP/RNN or CNN)
Inference + Training
(MLP/CNN/RNN)Inference only
(MLP/CNN/RNN)
Inference + Training
(GAN/RL/
Unsupervised/
MLP/CNN/RNN)
42. 3. AI Security Will be Essential
• It is easy to break DNN based recognition
New cyberattack: imperceivable noise injection
Breaking state-of-the-art face recognition Physical attack for autonomous vehicles
43. 4. For Success of AI Chip, SW is the Key
• How did ARM dominate mobile processor market?
- Low power consumption with reasonable performance
- ARM’s competent complier toolchain & licensing strategy
• Why did GPU have a big success in early DNN revolution?
- That was because of CUDA which is a generic programming language for data-
intensive workloads like matrix-vector multiplication
- CUDA was baked for several years to have developers actually use it
44. AI Chip Researches at KAIST
Multi-core OR
Processor
Dual
Layered
3-stage
Pipeline
Simultaneous
Multi-threading
Multi-classifier
System
Multi-core
MIMD
2008 2009 2010 2012 2013
Visual
Attent
ion
Tomato
Sauce
$2.60
Heterogeneous
Many-SIMD
20142011 2015 2016 2017
Multi-Modal UI/UX
Deep Learning Core
Tan
k
Rob
ot
Recogni
tion
Result
Sen
sing
Convolution
Cluster 0
FC LSTM
Processor
Ext. Gateway
Convolution
Cluster 3
Convolution
Cluster 1
Convolution
Cluster 2
CNN
Ctrlr.
Aggregation
Core
Top
Ctrlr.
Ext.Gateway
Stereo Matching
Processor
Face
Recognition
& CNN–RNN
2018 2019
Core
#1
Core
#2
Core
#3
Ext.
IF#0
Aggregation Core
1-DSIMDCoreTopCtrlr.
4000mm
WMEM
Ext.
IF#1
AFL
LBPE#0
LBPE#1
LBPE#2
LBPE#3
LBPE#4
LBPE#5
Matching
Core
Pipelined CNN PE
FMEM2
FMEM0
FWD/BWD Unit
CNN
Core1
Custom
RISC
WMEM
FMEM1
LocalDMA
Ext. I/F Ext. I/FTop Controller
ICP-PSO Engine
NN
PIM 0
NN
PIM 1
NN
PIM 2
NN
PIM 3
NN
PIM 4
NN
PIM 5
NN
PIM 6
NN
PIM 7
NN
PIM 8
NN
PIM 9
NN
PIM 10
NN
PIM 11
NN
PIM 12
NN
PIM 13
NN
PIM 14
NN
PIM 15
Variable Bit
DNN
& 3D HGR
Core Cluster 3Core Cluster 2
Core Cluster 1
Core1
Core3Core2
DMEM
PEL
PEL
PEL
PEL
ILB
Central Core
I/F
1
fp-unitSIMDCoreTopCtrlr.RISC
I/F
0
Process 65nm 1P8M Logic CMOS
Area 4mm × 4mm
SRAM 448 KB
Supply 0.67V – 1.1V
Power
196 mW @ 200MHz, 1.1V
2.4 mW @ 10MHz, 0.67V
Precision
Feature – bfloat16
Weight – 16/8/4'b FXP
Peak
Performance
204 GFLOPS @ 16b Weight
Ext.
IF 0
Core 1
Core 2 Core 3
Top Ctrlr.
Ext.
IF 1
UMEM
UMEMBMEM
BMEM
PE Arrays
Exp. Compressor
1-D SIMD
Supervised &
Reinforcement
Learning
Input Image
Hand Depth
Tracking
Results
-1.5cm
10cm
0cm
5cm
-5cm
7.5cm
0cm
5cm
40cm
20cm
25cm
30cm
35cm
-5cm
10cm
0cm
5cm
-5cm
10cm
0cm
5cm
40cm
20cm
25cm
30cm
35cm
X
Y
-5cm
10cm
0cm
5cm
-5cm
10cm
0cm
5cm
40cm
20cm
25cm
30cm
35cm
X
Y
X
Y
Hand
Tracking
Accuracy
2.6mm@20cm
4.6mm@30cm
3.4mm@40cm
5cm
Seperated
VGA
Cameras
22.5cm
40.5cm