We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
2020 icldla-updated
1. Copyright 2020 ITRI 工業技術研究院
ITRI DLA Accelerating System
design, system, tools, and applications
工業技術研究院 Industrial Technology Research Institute (ITRI)
資訊與通訊研究所 Information and Communication Research Lab (ICL)
2. Copyright 2020 ITRI 工業技術研究院
CNN Models Advance Fast
2
Source: Alfredo Canziani, 2017
We need high accuracy
with low computation
Many computer
vision tasks and
DNN models;
classification is the
basic
Different DNN Models for the
same classification based on
ImageNet database
3. Copyright 2020 ITRI 工業技術研究院
Three Steps for A High Efficient Accelerator
3
1. Increase MAC PEs with high parallelism
2. Ensure the data supplement to those PEs
3. Improve energy efficiency, adaptive to the models
Throughput
Computation Power
3
2
Concepts of step 1~ 3 Take Alexnet for example
Throughput Curves
given various DRAM BW
Convolution contains many independent MACs and data overlap
4. Copyright 2020 ITRI 工業技術研究院
FPS/Throughput of Various Models
-- profiled using 256MAC, 128KB, INT8, DLA inference
4
5. Copyright 2020 ITRI 工業技術研究院
C2C Ratio Preference in Classification Models
5
AlexNet prefers
Need bandwidth due to
heavy-weight FC layers
Inception prefers
More computation
power because
many branches of
CNN computations,
where concat layer
is memory BW free
ResNet prefers
evenly memory bandwidth
and computation power;
element-wise add need
bandwidth
MobileNet prefers
More memory
bandwidth, although
DW-CONV layers
reduce computation but
increase intermediate
activations
Heavy parameters
in last 3 FC layers
Concatenate
small CNN layers
Element-wise
add two activations
Depth-wise & point-wise CONV
Replace conventional CONV
7. Copyright 2020 ITRI 工業技術研究院
Customize Flow for An Accelerator
7
User’s
AI Framework
& Models
Framework
Converter
8-bit
Retrain
Framework
Network
Compiler
API, Driver (FW)
Accelerator (HW)
User’s
PPA SPEC
Candidate
HW SPEC
Model
Parser
Coarse Compile &
PPA Profiler
InferenceSynthesisAnalysis
HW Assembler
APP
Call API
HW Library
From NN analysis, synthesis, to inference
NV-DLA
NV-DLA
NV-DLA
8. Copyright 2020 ITRI 工業技術研究院 8
DLA Architecture Customizable and Configurable
1. Variable CONV MAC resources
• 64-MAC to 2048-MAC for convolution processer
• Variable size of convolutional buffer
2. Configurable NN operator processors
• Options for batch normalization, PReLU, scale,
bias, quantization, element-wise operators
• Options for down sample ( like pool) operators
• Options for nonlinear LUTs
• Options for user to add new processors
3. Custom memories and host CPUs
• Can be driven by user’s MCU or CPU
• Options for shared or private DRAM / SRAM /
NVM
Convolutional
Processor
Element-Wise
Processor
Pool
Processor
Nonlinear
Processor
InterfaceUnit
ConfigurationUnit
AXI
APB
BUS
AXI
Bridge
APB
Bridge
Flow
controller
High
Speed IO
DRAMIF
Peripherals
DLA IP
Custom
SRAMIF
User’s New
Processor
AXI
option B: Board Integrationoption A: SoC Integration
CPU DSP
Custom Host System
9. Copyright 2020 ITRI 工業技術研究院
DLA Reference SPECs
9
1. Atomic operation size (atomic C and K) of convolution
2. Convolutional buffer structure
3. Optional : nonlinear LUT, data reshape, weight decompression
Original NVDLA 64-MAC 256-MAC 512-MAC 1024-MAC 2048-MAC
Data type INT8 INT8 INT8 INT8 INT8
MAC for channel # 8 32 32 32 64
MAC for kernel # 8 8 16 32 32
Internal Buffer Size 128 KB 128 KB 512 KB 512 KB 512 KB
AXI (DBB) width 64 64 128 256 256
AXI (DBB) burst 1 1 4 4 4
CONV SRAM width X X X 256 256
CONV SRAM burst X X X 4 4
Status OK Not complete, bugs to generate
ITRI Version 64-MAC 256-MAC 512-MAC 1024-MAC 2048-MAC
AXI (DBB) burst upto 8 upto 8 upto 8 upto 8 upto 8
CONV SRAM width 64 64 128 256 256
CONV SRAM burst upto 8 upto 8 upto 8 upto 8 upto 8
Status OK OK OK TBA OK
Additional Functions Depth-wise convolution, Up-sampling
DEV Tools Bare-metal compiler, Performance profiler, Golden pattern generator & simulator
ITRIVersionImprovement
10. Copyright 2020 ITRI 工業技術研究院 10
Features of DLA Hardware width
height
IN IN
IN
OUT
kernels
Stride 1, no pad
Channel first Plane first
3D CONV example
1. Variable HW resource
• Search an efficient resource to models
• Adaptive performance & power consumption
2. Suit for long-channel convolution
• Output pixel first, share IN, avoiding partial sum storage
• Support any kernel size (n x m) ,the same data flow
3. Revision for depth-wise convolution
• Output pixel first, channel = 1 convolution
• Support any kernel size (n x m) ,the same data flow
4. Data reuse and hetero-layer fusion
• Input reuse or weight reuse by setup
• Fuse popular layers [CONV(BN)–Quantize-PReLU–Pooling ]
5. Program time hiding
• Configure N and N+1 layer simultaneously
• Cover the configuration time during layer change
11. Copyright 2020 ITRI 工業技術研究院
Exclusive HW View
of Depth-wise CONV
Convolution
DMA
(CDMA)
Convolution Buffer
(CBUF)
DW + Original CSC
Convolution MAC Array (CMAC)
DW + Original CACC
Global Unit (GLB)
(interrupt/fault)
CSB Master
MCIF
DW + Original SDP
Non-convolutional ProcessorsConvolutional Processor
Controller
DRAM
(AXI)
APB
CSB to APB
CVIF
Cross-Channel Data Processor (CDP)
Planar Data Processor (PDP)
RUBIK engine (RUBIK)
SRAM
(AXI)
BDMA
Fused depth-wise convolution engine
• DW data flow controller on CSC, CACC, SDP
• Fused DW-CONV with BN and PReLU
12. Copyright 2020 ITRI 工業技術研究院
API in C
API in C
API in C
API in C
12
NN-to-DLA Translator Flow and Verification
Layer Queue
CFGs
Model
Parse
Layer
Fuse
Layer
Partition
Model
Graph
HW-aware
Quantize
Insert
Direct Quantize
or Re-train
(Tensorflow)
Weight
Convert &
Partition
Model
Weights
Quantized
Weights
API in C
Baremetal Inference Example
1. Allocate free memory space
2. Capture image
3. Call coarse object detection API
4. Draw bounding boxes, capture each ROI
5. Call detailed classification API
6. Post processing, and loop back
Libraries
17. Copyright 2020 ITRI 工業技術研究院
NN-to-DLA Model Translation Tools
for Profile and Bare-metal Compile
Netron supports
ONNX (.onnx, .pb, .pbtxt),
Keras (.h5, .keras),
Core ML (.mlmodel),
Caffe (.caffemodel, .prototxt),
Caffe2 (predict_net.pb, predict_net.pbtxt),
MXNet(.model, -symbol.json),
TorchScript (.pt, .pth),
NCNN (.param)
TensorFlow Lite(.tflite).
Intermediate Format
Caffe-based considering
• asymmetric pad
• quantized layer
Pattern Generator
Parameter Formatter
HW Config
Generator
DLA
System
NN
Graph
Real Parameters
MUX
GUI Profiler
NN
Models
Compile / Translate
17
18. Copyright 2020 ITRI 工業技術研究院
Integrated Netron Executable Version
18
DNN模型
DLA 設定
https://github.com/SCLUO/Op
en-DLA-Performance-Profiler
1. MAC Utilization: average MAC
utilization under aggressive FPS
2. Roofline Factor: the ratio of memory
access cycles / total cycles
3. Conservative FPS: consider the
memory access and computation is
fully overlapped
4. Aggressive FPS: consider the
memory access and computation is
fully interleaved
23. Copyright 2020 ITRI 工業技術研究院 23
Implementation of USB Accelerator
Host Linux machine
1. Load RISC-V INIT, NN CFGs, Weights
2. Capture an image + preprocessing
3. Call object detection (YOLO) Start
4. Return output; send next image
5. Draw bounding boxes, display
INIT
Ready
Send Image + Start
Done
Ready for next
image + Start
Read output
RV INIT,
NN CFGs
image
weights
swap
output
0x0
Address Map
of USB Stick
24. Copyright 2020 ITRI 工業技術研究院
USB Acceleration System
• RV32-IM RISC-V & 64-MAC DLA on CESYS EFM-03 (Xilinx Artix-7)
@100MHz, achieving 3 inference per second (3 fps) of Tiny YOLO v1
• RV32-IM RISC-V & 256-MAC DLA on Xilinx ZCU102 @150MHz,
achieving 9 inference per second (9 fps) of Tiny YOLO v1
• RV32-IM RISC-V & 2048-MAC DLA on Xilinx VCU118 @150MHz,
achieving 21 inference per second (21 fps) of Tiny YOLO v1
Linux mini PC
USB FPGA USB Live Camera
Screen of
the Linux
mini PC
Test figures from
Win PC
USB Accelerator
FPGA Prototype
VCU 118
USB interface
NB Host
24
25. Copyright 2020 ITRI 工業技術研究院
inputoutput
ASIC Implementation
25
Layout View
• RV32-IM RISC-V & 64-MAC DLA
• Clock network optimization
• Register reduction
• Data path pipeline retiming
• Coarse & fine-grained clock gating
USB
GPIF
DRAM
IF
RISC-V
Cache
DLA
AXI
A
P
B
Peripherals
PLL
Block View
SoCEVB
Demo video
https://www.youtube.com/watch?v=qKF82386Wf4
26. Copyright 2020 ITRI 工業技術研究院
ZCU102 FPGA Object Detection Setup
26
DRAM (1GB)
Input Image
Model Weights
OS Controlled Space
DRAM
CTRL
DP
USB ARM CPU
(FPGA)
DLA
(Processing System)
Temp Activations
Output Data
Reserved
for DLA
(~64MB)
Program INIT
Set parameters
Load Weight
Image Capture (YUV)
Re-Format to RGB
Activate DLA
Post Processing
Display
DLA Finished
29. Copyright 2020 ITRI 工業技術研究院 29
Features of ITRI’s Solutions
• Support from profiling to implementation
▪ Profiler, NN-to-DLA translator, SoC/FPGA references
• Support complete inference on RTL simulation
▪ Accurate, straightforward for conventional IC design
• Support of various DLA SPECs, from 64 to 2048 MAC cores
▪ Successful ASIC and FPGA implementation references
▪ Exclusive operator support (DW CONV, up-sample)
• Collaborate compiler and software partner, Skymizer
• Complete HW-aware integer training flow
▪ Transparent model compression and quantization
30. Copyright 2020 ITRI 工業技術研究院
Our Services
Design Reference / License
DLA series with verification tool kits
Exclusive architecture of NN operator (DW-CONV, up-sample…)
Design Consultant / Service
System performance analysis and consultant
Customization of efficient & exclusive HW
HW-aware model compression
Design & Application Service
DNN Model profile, analysis, NN-to-DLA translate
HW-aware quantization and re-training
30