1. Pedro Trancoso
Chalmers University of Technology,
Gothenburg, Sweden
VEDLIoT Cognitive IoT
Hardware Platform,
Accelerators and Co-Design
26. September 2023
F. Qararyah, S. Zouzoula, M. Waqar, P. Trancoso, M.
Rothmann, M. Tassemeier, M. Porrmann, D. Ödman,
H. Salomonsson, F. Porrmann, R. Griessl, N. Kucza, K.
Mika, C. Stollenwerk, M. Kaiserm, L. Tigges, J.
Hagemeyer
5. 5
▪ Heterogenenous hardware platform
▪ Resource Efficient Cluster Server (RECS)
platform: cloud to edge
▪ u.RECS for far edge with 3 slots:
▪ NVIDIA NX – embeded GPU
▪ SMARC 2.1 – FPGA, CPU
▪ M.2 – dedicated accelerator
Cognitive IoT HW Platform
6. 6
▪ Accelerators “landscape”
▪ Evaluated a multitude of accelerators (CPUs, GPUs, FPGAs, ASICs)
▪ Same model (YoloV4), different batch sizes (1, 4, 8)
▪ Efficiency from 100GOPS/W to 1250 GOPS/W
Accelerators
7. 7
Accelerators
High-performance
Low power
Efficiency
▪ Accelerators “landscape”
▪ Evaluated a multitude of accelerators (CPUs, GPUs, FPGAs, ASICs)
▪ Same model (YoloV4), different batch sizes (1, 4, 8)
▪ Efficiency from 100GOPS/W to 1250 GOPS/W
▪ Identify different categories
FPGA-based accelerators:
• Flexibility
• Reconfigurability
• Efficiency
8. 8
Accelerators – Xilinx DPU
● Baseline for evaluation of FPGA accelerators developed in VEDLIoT
● Xilinx Deep Learning Processor Unit (DPU)
○ Programmable engine for convolutional neural networks
○ Easy integration as an IP core in
Xilinx UltraScale+ and Versal MPSoCs
○ Configurable hardware architecture
(e.g., parallelism, memory/DSP usage)
● Large design space
○ Goal: Find the best suitable
implementation for your requirements
9. 9
Dynamic reconfiguration of Xilinx DPU
● Change the characteristics of the DL accelerator at run-time
(e.g., change performance-power trade-off or performance-accuracy trade-off)
Different modes of operation:
• High-performance versus low-power
• City versus highway driving
• …
10. 10
STANN – Synthesis Templates for ANNs (1/2)
● Library for simple yet efficient generation
of DL-accelerators on FPGAs
● Templates for common layers
○ Network architecture parameterizable
e.g., number of neurons
○ Hardware implementation parameterizable
e.g., parallelism of processing units
● Resource efficiency by flexible quantization
○ Floating point and integer from 32bit to 8bit
● High level synthesis enables
fast design space exploration
○ Automatic code generation based on ONNX description
○ Highly parameterizable:
Reuse of hardware blocks vs. parallel execution
11. 11
STANN – Synthesis Templates for ANNs (2/2)
● STANN enables inference and training on FPGAs
● Training with Dataflow Architecture
○ Forward path similar to inference,
but needs to store more intermediate values
○ Backpropagation and weight update module
for each layer
● Fast, but uses a lot of resources
● Well suited for small networks,
used, e.g., in deep reinforcement learning
● Application example: Motor control
○ DQN (Deep Q-Network)
replaces manual parameter tuning
○ Used in OC Project Power Edge RL
Rothmann, M.; Porrmann, M.: STANN – Synthesis Templates for Artificial Neural Network Inference and Training. In: 17th International Work-Conference on Artificial Neural Networks, IWANN
2023, Ponta Delgada, Azores, Portugal, June 19-21, 2023
12. 12
Accelerators - FiBHA (1/3)
"FiBHA: Fixed Budget Hybrid CNN Accelerator", Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso, IEEE 34th International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD 2022), Bordeaux, France, November 2–5 2022
?
Generic ↔ Dedicated
13. 13
Accelerators - FiBHA (2/3)
"FiBHA: Fixed Budget Hybrid CNN Accelerator", Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso, IEEE 34th International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD 2022), Bordeaux, France, November 2–5 2022
Monolithic design
● One engine
computes all the
core layers
● E.g. TPU
SEML
● One engine
computes all
layers of the
same type
● PW engine, DW
engine
SESL
● One engine per
layer
● E.g. FINN
FiBHA
● SESL + SEML
14. 14
Accelerators - FiBHA (3/3)
● FiBHA compared to both alternatives
○ Up to 4x throughput improvement compared
to SESL (FINN)
■ Better use of the resource budget
○ Up to 1.7X throughput improvement
compared to SEML
■ Capturing more heterogeneity
● FiBHA compared to SEML
○ Representative set of heterogeneous CNNs
○ Various resource budgets
■ 1024 PEs - 4096 PEs
○ FiBHA constantly outperform SEML
"FiBHA: Fixed Budget Hybrid CNN Accelerator", Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso, IEEE 34th International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD 2022), Bordeaux, France, November 2–5 2022
15. 15
Memory Analysis/Recommendation Tool - Rainbow
● Set of different analyses
for on-chip memory and
off-chip data transfers
● Optimizers
○ Optimal data reuse
○ Co-design for multi-
precision model
quantization
○ Data reuse with batch
execution
● Heterogeneous execution
plans
"RAINBOW: Multi-Dimensional Hardware-Software Co-Design for DL Accelerator On-Chip Memory", S. Zouzoula, M.
W. Azhar, P. Trancoso, 2023 IEEE International Symposium on Performance Analysis of Systems and Software
(ISPASS-2023), pp. 1-3, April 2023
16. 16
▪ Optimizing DL models
▪ Harware-aware optimizations
▪ Model compression without loss of accuracy
Model-Accelerator Co-Design
17. 17
▪ Optimizing DL models
▪ Harware-aware optimizations
▪ Model compression without loss of accuracy
▪ Hardware software co-design
▪ Reconfigurable (FPGA) accelerators
▪ Template-based description
▪ Heterogenenous engines
Model-Accelerator Co-Design
Co-design
18. 18
Integration of Deep Learning into IoT devices with restricted computing capabilities and
minimal power consumption requirements – energy-efficient computing
▪ Cognitive IoT hardware platform with tailored hardware components and accelerators:
from embedded systems to edge computing and cloud platforms
▪ Wide range of accelerator designs from off-the-shelf to FPGA-based generic and
dedicated engines
▪ Dynamic reconfiguration for increased efficiency
▪ Memory analysis and recommendation for design space exploration, configuration, and
execution plans
▪ Model-Hardware co-design loop for optimized solutions
Summary
Co-Design with wide range of
options for accelerator designs
Most energy-efficient solution for a
particular application and constraints