2. 2
Applications
Requirements Security & Safety
Hardware
Plattforms
Microservers &
Accelerators
Middleware
Embedded/
Far Edge
Near Edge Cloud
Safety
&
Robustness
Modelling
&
Verification
Jetson AGX
NVIDIA Xavier
COM-HPC
Xilinx Zynq
UltraScale+
SMARC
Xilinx Zynq
UltraScale+
Coral SoM
Xilinx
Kria
RPi CM4
ARVSOM
Smart Home Industrial IoT Automotive AI
Open
Call
Monitoring
Trusted
Execution
&
Communication
RISC-V
extensions
Optimizer Emulation Benchmarking & Deployment
uRECS t.RECS RECS|Box
Big Picture
3. 3
Applications
Requirements Security & Safety
Hardware
Plattforms
Microservers &
Accelerators
Middleware
Embedded/
Far Edge
Near Edge Cloud
Safety
&
Robustness
Modelling
&
Verification
Jetson AGX
NVIDIA Xavier
COM-HPC
Xilinx Zynq
UltraScale+
SMARC
Xilinx Zynq
UltraScale+
Coral SoM
Xilinx
Kria
RPi CM4
ARVSOM
Smart Home Industrial IoT Automotive AI
Open
Call
Monitoring
Trusted
Execution
&
Communication
RISC-V
extensions
Optimizer Emulation Benchmarking & Deployment
uRECS t.RECS RECS|Box
Big Picture
Hardware
Plattforms
Microservers &
Accelerators
Embedded/
Far Edge
Near Edge Cloud
Jetson AGX
NVIDIA Xavier
COM-HPC
Xilinx Zynq
UltraScale+
SMARC
Xilinx Zynq
UltraScale+
Coral SoM
Xilinx
Kria
RPi CM4
ARVSOM
uRECS t.RECS RECS|Box
• FPGA-based Accelerators in VEDLIoT
• Dynamic Reconfiguration of Accelerators
• First Results on Performance and Energy Efficiency
• Workflow for Configurable Soft SoCs
4. 4
FPGA Infrastructure
• FPGA base architecture
• Integration of the required Interfaces and accelerators
• Support for dynamic run-time reconfiguration
• Exchange accelerators on the FPGA at run-time to increase resource efficiency and flexibility
• FPGA task deployment mechanism
• Migration of a task from one FPGA to another FPGA
Logic Cells 85k 2800k 25.2M 75.6M
5. 5
Basic FPGA Infrastructure
• FPGA base architecture for the µ.RECS
• Block-based design enabling easy customization of the FPGA platform in the µ.RECS
• Front-end based on Xilinx Vitis with additional (optional) IP-cores from LiteX
• Scripting approach for complete system design
• Easy porting to new FPGAs and FPGA platforms, esp. µ.RECS. t.RECS, RECS|Box
• Flexible integration of accelerators
• Integration of the required Interfaces for communication (Ethernet, PCIe, etc)
as well as sensors and actuators targeted in the use cases
• PetaLinux enables easy access to the
system and to integrated accelerators
for software developers
• µ.RECS testbed for early evaluation
SMARC Module
SoC
FPGA-Fabric
Processing System
HDMI
CSI
PCIe x4
GigE
USB
DDR
(PS)
Memory
Subsystem
Interrupt
Controller
Dual/Quad Arm
Cortex- A53
Dual Arm
Cortex-R5
I/O Interfaces
AXI
Accelerator(s)
AXI
AXI-Lite
AXI-Lite
GPIO, UART
DDR
(PL)
Xilinx/ LiteX
Memory Ctrl
eMMC
Flash
SD
GPIO, UART
I/O Ctrl
SATA
Clk
Platform Mgmt,
System Funct. &
Configuration
HDMI
CSI
6. 6
FPGA Base Architecture for µ.RECS
SMARC Module
SoC
FPGA-Fabric
Processing System
HDMI
CSI
PCIe x4
GigE
USB
DDR
(PS)
Memory
Subsystem
Interrupt Controller
Dual/Quad Arm
Cortex- A53
Dual Arm
Cortex-R5
I/O Interfaces
AXI
Accelerator(s)
AXI
AXI-Lite
AXI-Lite
GPIO, UART
DDR
(PL)
Xilinx/ LiteX
Memory Ctrl
eMMC
Flash
SD
GPIO, UART
I/O Ctrl
SATA
Clk
Platform Mgmt,
System Funct. &
Configuration
HDMI
CSI
7. 7
First Reference Design Based on Xilinx DPU
• Baseline for evaluation of FPGA accelerators developed in VEDLIoT
• Xilinx Deep Learning Processor Unit (DPU)
• Programmable engine
for convolutional neural networks
• Easy integration as an IP core in
Xilinx UltraScale+ MPSoCs
• Configurable hardware architecture
(e.g., parallelism, memory/DSP usage)
• Evaluation on various platforms with Xilinx UltraScale+ MPSoCs
• ZU3EG on Avnet Ultra96-v2 (154k Logic Cells)
• ZU4EG in the µ.RECS testbed (192k Logic Cells)
• ZU15EG on Trenz TE0808 MPSoC Module (747k Logic Cells)
• ZU19EG on Trenz COM-HPC Module in t.RECS (1,143k Logic Cells)
DPU
Peak
ops/clock
Peak performance
(300 MHz) [GOPS]
Peak performance
(200 MHz) [GOPS]
B512 512 153.6 102.4
B2304 2304 691.2 460.8
B4096 4096 1228.8 819.2
9. 9
Efficient Utilization of the Xilinx DPU
• Multithreading is crucial for high performance
• Environment supporting semi-automatic realization and evaluation
of multithreading during application development
• Execution Time – Single-threaded
• Execution Time – Multi-threaded
Read Data Preproc. DPU Processing Post.
Read Data Preproc.
Post.
DPU Processing DPU Processing
Post.
Read Data Preproc. Read Data Preproc.
t1 t2 t3 t4
t0
t0 ttotal
10. 10
Efficient Utilization of the Xilinx DPU
• Performance and power monitoring for single- and multi-threaded implementations
• Detailed power measurements on RECS platforms
• Power-aware profiling and optimization
15. 15
Dynamic Reconfiguration of DL Accelerators
• Change the characteristics of the DL accelerator at run-time
(e.g., change performance-power trade-off or performance-accuracy trade-off)
SMARC Module
SoC
FPGA-Fabric
Processing System
HDMI
CSI
PCIe x4
GigE
USB
DDR
(PS)
Memory
Subsystem
Interrupt Controller
Dual/Quad Arm
Cortex- A53
Dual Arm
Cortex-R5
I/O Interfaces
AXI
AXI-Lite
GPIO, UART
DDR
(PL)
Xilinx/ LiteX
Memory Ctrl
eMMC
Flash
SD
GPIO, UART
I/O Ctrl
SATA
Platform Mgmt,
System Funct. &
Configuration
HDMI
CSI
Clk
AXI
CB
AXI
–Lite
CB
Disconnect
PR-Region
DFX
Accelerator A
Accelerator B
16. 16
Dynamic Reconfiguration of DL Accelerators
SMARC Module
SoC
FPGA-Fabric
Processing System
HDMI
CSI
PCIe x4
GigE
USB
DDR
(PS)
Memory
Subsystem
Interrupt Controller
Dual/Quad Arm
Cortex- A53
Dual Arm
Cortex-R5
I/O Interfaces
AXI
AXI-Lite
GPIO, UART
DDR
(PL)
Xilinx/ LiteX
Memory Ctrl
eMMC
Flash
SD
GPIO, UART
I/O Ctrl
SATA
Platform Mgmt,
System Funct. &
Configuration
HDMI
CSI
Clk
AXI
CB
AXI
–Lite
CB
Disconnect
Accelerator
Disconnect
Accelerator
Accelerator
PR-Region
PR-Region
DFX
• Change the characteristics of the DL accelerator at run-time
(e.g., change performance-power trade-off or performance-accuracy trade-off)
17. 17
Reconfigurable DL Accelerators
• Accelerator to be used for the codesign approach:
Generation of dataflow-architectures
based on C++ templates
• Support for inference and training
• Targeting CNNs, deep reinforcement learning, and federated learning
• Definition of parameterizable layer templates in C++
(e.g., convolution, fully connected, pooling, and activation functions, …)
• Parameterizable, e.g., quantization (from low bit-width INT to float)
• Optimized for high-level synthesis
• All layers integrate three functions (if required):
inference/forward propagation, backpropagation, and update function
• Inference utilizes only forward path
• Learning (DeepRL): utilizes the full functionality of the layer templates
18. 18
Soft SoC Platform
• Generation of soft SoC platforms
• Utilize RISC-V soft cores
• Generic interface to AI-Accelerators
• Modelled in an open source
emulation environment
• Utilize LiteX SoC generator
• Run-time reconfiguration
• Accelerators
• Processor cores
FPGA
Base Architecture
AI-Accelerator
Run-Time
Reconfiguration
Interface
19. 19
• Configurable soft SoC generator provides a platform for low power AI accelerator
exploration
• The generator enables a functionality to generate a system with a set of peripherals
required for a specific tasks
• Scalable from MCU-class to Linux-capable platforms
• Support for generic, vendor independent accelerator integration interface makes it a
perfect AI research platform
• Portable across different hardware, based on open-source tooling
• CFUs - Custom Function Units – custom accelerators designed for specific workflows,
tightly coupled with the CPU
• Accessed via custom RISC-V instructions
• Can be implemented in high-level hardware description languages, like, e.g., Python-based Amaranth
Configurable SoC for ML Workflows
20. 20
• CFUs offer great flexibility
• Test various dedicated accelerators for specific
workflows
• Renode simulation framework
extended with CFU support
• Co-simulating functional models of the
SoC with verilated, cycle-accurate CFUs
• Invaluable tool for development
• Massive continuous integration testing
• Different CFU implementations
• Different inputs
• Allows for automatic result comparison and
analysis
• Everything open-sourced
Configurable SoC for ML Workflows
21. 21
• Platform
• Hardware: Scalable, heterogeneous, distributed
• Accelerators: Efficiency boost by FPGA and ASIC technology
• Toolchain: Optimizing Deep Learning for IoT
• Use cases
• Industrial IoT
• Automotive
• Smart Home
• Open call
• Open for submissions until 8. May
• Early use and evaluation of VEDLIoT technology
Very Efficient Deep Learning for IoT – VEDLIoT
22. 22
Follow our work
⇒ https://twitter.com/VEDLIoT
⇒ https://www.linkedin.com/company/vedliot/
⇒ https://vedliot.eu
Be part of it
⇒ Open call NOW!
⇒ Allows early use and evaluation of VEDLIoT
technology
24. 24
DL Accelerator
CPU
GPU
TPU
Compiler
DL
model
Heterogenenous
DL Accelerator
DL Accelerator
FPGA
Compiler
HW Spec
DL
model
Reconfigurable
DL Accelerator
DL Accelerator
FPGA
Compiler
DL
model
HW Spec
HW Spec
Compiler
Dynamically
Reconfigurable
DL Accelerator
DL Accelerator
FPGA
Compiler
Co-
Design
DL
model
Co-Designed
DL Accelerator
Deep Learning Accelerators
25. 25
Dynamic Reconfiguration of DL Accelerators
• Utilize dynamic reconfiguration
• Change the complete DL model and the corresponding accelerator at run-time
depending on application requirements
• Change the characteristics of the DL accelerator at run-time
(e.g., change performance-power trade-off or performance-accuracy trade-off)
• Enable accelerator to be partially reconfigured for different phases of the application
26. 26
First Reference Design Based on Xilinx DPU
Example
• Performance and power evaluation
for YoloV4
• Trade-off latency vs. performance
Platform SM-B71 on SOM-DB2500 Carrier
DPU B3136 x1, 300MHz B4096 x1, 300MHz
Number of threads 1 2 4 1 2 4
Latency [ms] 120.34 198.62 383.72 93.42 144.51 276.33
Achieved
performance
[Inferences/s]
8.28 10.76 10.76 10.66 15.12 15.12
Achieved
performance [GOPS]
500.11 649.90 649.90 643.86 913.25 913.25
Peak performance
[GOPS]
940.8 940.8 940.8 1228.8 1228.8 1228.8
Performance Ratio 53.16% 69.08% 69.08% 52.40% 74.32% 74.32%
Cost Metrics
Power [W] 11.20 12.49 12.51 13.14 15.42 15.44
Idle Power [W] 0.07/7.09 0.07/7.09 0.07/7.09 0.07/7.56 0.07/7.56 0.07/7.56
Energy/Inference [J] 1.352 1.161 1.173 1.233 1.020 1.021
Power Efficiency
[GOPS/W]
44.65 52.03 51.95 49.00 59.23 59.15