IoT Tech Expo 2023_Pedro Trancoso presentation

Pedro Trancoso
Chalmers University of Technology,
Gothenburg, Sweden
VEDLIoT Cognitive IoT
Hardware Platform,
Accelerators and Co-Design
26. September 2023
F. Qararyah, S. Zouzoula, M. Waqar, P. Trancoso, M.
Rothmann, M. Tassemeier, M. Porrmann, D. Ödman,
H. Salomonsson, F. Porrmann, R. Griessl, N. Kucza, K.
Mika, C. Stollenwerk, M. Kaiserm, L. Tigges, J.
Hagemeyer

3
▪ Heterogenenous hardware platform
▪ Resource Efficient Cluster Server (RECS)
platform: cloud to edge
Cognitive IoT HW Platform

4
Cognitive IoT HW Platform cloud
edge

5
▪ u.RECS for far edge with 3 slots:
▪ NVIDIA NX – embeded GPU
▪ SMARC 2.1 – FPGA, CPU
▪ M.2 – dedicated accelerator
Cognitive IoT HW Platform

6
▪ Accelerators “landscape”
▪ Evaluated a multitude of accelerators (CPUs, GPUs, FPGAs, ASICs)
▪ Same model (YoloV4), different batch sizes (1, 4, 8)
▪ Efficiency from 100GOPS/W to 1250 GOPS/W
Accelerators

7
Accelerators
High-performance
Low power
Efficiency
▪ Accelerators “landscape”
▪ Evaluated a multitude of accelerators (CPUs, GPUs, FPGAs, ASICs)
▪ Same model (YoloV4), different batch sizes (1, 4, 8)
▪ Efficiency from 100GOPS/W to 1250 GOPS/W
▪ Identify different categories
FPGA-based accelerators:
• Flexibility
• Reconfigurability
• Efficiency

8
Accelerators – Xilinx DPU
● Baseline for evaluation of FPGA accelerators developed in VEDLIoT
● Xilinx Deep Learning Processor Unit (DPU)
○ Programmable engine for convolutional neural networks
○ Easy integration as an IP core in
Xilinx UltraScale+ and Versal MPSoCs
○ Configurable hardware architecture
(e.g., parallelism, memory/DSP usage)
● Large design space
○ Goal: Find the best suitable
implementation for your requirements

9
Dynamic reconfiguration of Xilinx DPU
● Change the characteristics of the DL accelerator at run-time
(e.g., change performance-power trade-off or performance-accuracy trade-off)
Different modes of operation:
• High-performance versus low-power
• City versus highway driving
• …

10
STANN – Synthesis Templates for ANNs (1/2)
● Library for simple yet efficient generation
of DL-accelerators on FPGAs
● Templates for common layers
○ Network architecture parameterizable
e.g., number of neurons
○ Hardware implementation parameterizable
e.g., parallelism of processing units
● Resource efficiency by flexible quantization
○ Floating point and integer from 32bit to 8bit
● High level synthesis enables
fast design space exploration
○ Automatic code generation based on ONNX description
○ Highly parameterizable:
Reuse of hardware blocks vs. parallel execution

11
STANN – Synthesis Templates for ANNs (2/2)
● STANN enables inference and training on FPGAs
● Training with Dataflow Architecture
○ Forward path similar to inference,
but needs to store more intermediate values
○ Backpropagation and weight update module
for each layer
● Fast, but uses a lot of resources
● Well suited for small networks,
used, e.g., in deep reinforcement learning
● Application example: Motor control
○ DQN (Deep Q-Network)
replaces manual parameter tuning
○ Used in OC Project Power Edge RL
Rothmann, M.; Porrmann, M.: STANN – Synthesis Templates for Artificial Neural Network Inference and Training. In: 17th International Work-Conference on Artificial Neural Networks, IWANN
2023, Ponta Delgada, Azores, Portugal, June 19-21, 2023

12
Accelerators - FiBHA (1/3)
"FiBHA: Fixed Budget Hybrid CNN Accelerator", Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso, IEEE 34th International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD 2022), Bordeaux, France, November 2–5 2022
?
Generic ↔ Dedicated

13
Monolithic design
● One engine
computes all the
core layers
● E.g. TPU
SEML
● One engine
computes all
layers of the
same type
● PW engine, DW
engine
SESL
● One engine per
layer
● E.g. FINN
FiBHA
● SESL + SEML

14
● FiBHA compared to both alternatives
○ Up to 4x throughput improvement compared
to SESL (FINN)
■ Better use of the resource budget
○ Up to 1.7X throughput improvement
compared to SEML
■ Capturing more heterogeneity
● FiBHA compared to SEML
○ Representative set of heterogeneous CNNs
○ Various resource budgets
■ 1024 PEs - 4096 PEs
○ FiBHA constantly outperform SEML

15
Memory Analysis/Recommendation Tool - Rainbow
● Set of different analyses
for on-chip memory and
off-chip data transfers
● Optimizers
○ Optimal data reuse
○ Co-design for multi-
precision model
quantization
○ Data reuse with batch
execution
● Heterogeneous execution
plans
"RAINBOW: Multi-Dimensional Hardware-Software Co-Design for DL Accelerator On-Chip Memory", S. Zouzoula, M.
W. Azhar, P. Trancoso, 2023 IEEE International Symposium on Performance Analysis of Systems and Software
(ISPASS-2023), pp. 1-3, April 2023

16
▪ Optimizing DL models
▪ Harware-aware optimizations
▪ Model compression without loss of accuracy
Model-Accelerator Co-Design

17
▪ Optimizing DL models
▪ Harware-aware optimizations
▪ Model compression without loss of accuracy
▪ Hardware software co-design
▪ Reconfigurable (FPGA) accelerators
▪ Template-based description
▪ Heterogenenous engines
Model-Accelerator Co-Design
Co-design

18
Integration of Deep Learning into IoT devices with restricted computing capabilities and
minimal power consumption requirements – energy-efficient computing
▪ Cognitive IoT hardware platform with tailored hardware components and accelerators:
from embedded systems to edge computing and cloud platforms
▪ Wide range of accelerator designs from off-the-shelf to FPGA-based generic and
dedicated engines
▪ Dynamic reconfiguration for increased efficiency
▪ Memory analysis and recommendation for design space exploration, configuration, and
execution plans
▪ Model-Hardware co-design loop for optimized solutions
Summary
Co-Design with wide range of
options for accelerator designs
Most energy-efficient solution for a
particular application and constraints

19
Thank you for your attention.

IoT Tech Expo 2023_Pedro Trancoso presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a IoT Tech Expo 2023_Pedro Trancoso presentation

Semelhante a IoT Tech Expo 2023_Pedro Trancoso presentation (20)

Mais de VEDLIoT Project

Mais de VEDLIoT Project (20)

Último

Último (20)

IoT Tech Expo 2023_Pedro Trancoso presentation