Kevin Mika
Bielefeld University
27. April 2022
VEDLIoT Hardware Platforms and
Accelerators
2
Big Picture
3
VEDLIoT Hardware Platform
 Heterogeneous, modular, scalable microserver system
 Supporting the full spectrum of IoT from embedded over the edge towards the cloud
 Different technology concepts for improving
x86
GPU
ML-ASIC
ARM v8
GPU
SoC
FPGA
SoC
RISC-V
FPGA
VEDLIOT Cognitive
IoT Platform
 Performance
 Cost-effectiveness
 Maintainability
 Reliability
 Energy-Efficiency
 Safety
4
RECS|BOX Overview
RECS Server Backplane (up to 15 Carriers)
Carrier (PCIe Expansion)
Carrier (High Performance)
e.g. GPU-Accelerator
Carrier (Low Power)
#3
#2
Microserver
(High Performance)
#1
Microserver
(Low Power)
#16
#3
#2
Microserver
(Low Power)
#1
High-Speed Low-Latency Network (PCIe, High-Speed Serial)
Compute Network (up to 40 GbE)
Management Network (KVM, Monitoring, …)
HDMI/USB
iPass+ HD
QSFP+
RJ45
Ext. Connectors
GPU
SoC
FPGA
SoC
ARM
Soc
Low-Power Microserver
(Apalis/Jetson)
x86 ARM v8
High-Performance Microserver (COM
Express)
FPGA SoC
High-Performance
Carrier
(up to 3 microservers)
Low-Power Carrier
(up to 16 microservers)
5
Server Architecture
• Microserver modules based on
established Computer on Module
standards
• COM Express
• Nvidia Jetson
• Toradex Apalis
Baseboard
Baseboard
Baseboard
3/16 Microservers
per Baseboard
Microserver Module
CPU Mem I/O
Backplane
Up to 15 Baseboards per Server
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
KVM & Monitoring
Storage and I/O-Extension
Ethernet (10/40 GbE)
High-Speed Low-Latency
Communication (>60 Gbit/s)
6
Server Architecture
• Dedicated monitoring and control
network
• iKVM to every microserver
• Fine-grained monitoring of power,
voltage and temperature
• Distributed network of
microcontrollers for data-
aggregation and pre-processing
• High-speed monitoring
Baseboard
Baseboard
Baseboard
3/16 Microservers
per Baseboard
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
Backplane
Up to 15 Baseboards per Server
Distributed
Monitoring
and KVM
Distributed
Monitoring
and KVM
KVM & Monitoring
Storage and I/O-Extension
Ethernet (10/40 GbE)
High-Speed Low-Latency
Communication (>60 Gbit/s)
7
Server Architecture
• Multiple 1Gb/10Gb Ethernet links
per Microserver
• 40 Gb Ethernet from Baseboard
to Backplane
• Internally switched on Baseboard
and Backplane
Baseboard
Baseboard
Baseboard
3/16 Microservers
per Baseboard
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
Backplane
Up to 15 Baseboards per Server
Distributed
Monitoring
and KVM
Ethernet
Communication
Infrastructure
Distributed
Monitoring
and KVM
Ethernet
Communication
Infrastructure
KVM & Monitoring
Storage and I/O-Extension
Ethernet (10/40 GbE)
High-Speed Low-Latency
Communication (>60 Gbit/s)
8
Server Architecture
Baseboard
Baseboard
Baseboard
3/16 Microservers
per Baseboard
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
Backplane
Up to 15 Baseboards per Server
Distributed
Monitoring
and KVM
High-speed
Low-latency
Communication
Ethernet
Communication
Infrastructure
Distributed
Monitoring
and KVM
Ethernet
Communication
Infrastructure
High-speed
Low-latency
Communication
KVM & Monitoring
Storage and I/O-Extension
Ethernet (10/40 GbE)
High-Speed Low-Latency
Communication (>60 Gbit/s)
• Multiple 1Gb/10Gb Ethernet links
per Microserver
• 40 Gb Ethernet from Baseboard
to Backplane
• Internally switched on Baseboard
and Backplane
9
Server Architecture
Baseboard
Baseboard
Baseboard
3/16 Microservers
per Baseboard
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
Microserver Module
CPU Mem I/O
Backplane
Up to 15 Baseboards per Server
Distributed
Monitoring
and KVM
High-speed
Low-latency
Communication
Ethernet
Communication
Infrastructure
Distributed
Monitoring
and KVM
Storage /
I/O-Ext.
Ethernet
Communication
Infrastructure
High-speed
Low-latency
Communication
KVM & Monitoring
Storage and I/O-Extension
Ethernet (10/40 GbE)
High-Speed Low-Latency
Communication (>60 Gbit/s)
• Connection to storage
and I/O extensions
• Easy integration of
PCIe-based extension cards
and storage subsystems
10
VEDLIoT Hardware Platform
 Heterogeneous, modular, scalable microserver system
 Supporting the full spectrum of IoT from embedded over the edge towards the cloud
 Different technology concepts for improving
x86
GPU
ML-ASIC
ARM v8
GPU
SoC
FPGA
SoC
RISC-V
FPGA
VEDLIOT Cognitive
IoT Platform
 Performance
 Cost-effectiveness
 Maintainability
 Reliability
 Energy-Efficiency
 Safety
11
t.RECS
 Optimized platform for
local / edge applications
 Provide interfaces for
 Video
 Camera
 Peripheral input (USB)
 Combine FPGA and
GPU acceleration
 Compact dimensions
1 RU, E-ATX form factor
(2 RU/ 3 RU for special cases)
t.RECS Overview
Microserver #3
(COM-HPC Client)
Microserver #1
(COM-HPC Client)
Microserver #2
(COM-HPC Server)
Switched PCIe (Host to Host)
External
interfaces
PCIe
expansion
Ethernet (up to 10 GbE)
Management Network (KVM, Monitoring, …)
I/O (Camera, Display, Radar/Lidar, Audio)
12
t.RECS Architecture
Modular architecture
 1x Large Form Factor (SFF)
 2x Small Form Factor (LFF)
Communication infrastructure
 High-Speed Low-Latency via PCIe
 Switched &
ring topology
 Support for cache-coherent
accelerators (CCIX)
 Switched ETH for
data (10 GbE) and management
(1GbE)
 PCIe expansion slot for
additional accelerators (GPU or
FPGA)
Microserver
Client 2
(COM-HPC Client
Type A,B or C)
Microserver
Client 1
(COM-HPC Client
Type A or B)
Microserver
Server 1
(COM-HPC Server
Type D)
x8 lanes
x16
To PCIe slot
PCIe
Switch
13
t.RECS Reconfigurable Communication
Infrastructure
a) A classical CPU-based clustering with PCIe host-2-host communication
b) A CPU-centric approach including two accelerators connected via PCIe as
endpoints
c) Ring topology using PCIe
d) Ring topology using Xilinx Aurora
14
VEDLIoT Hardware Platform
 Heterogeneous, modular, scalable microserver system
 Supporting the full spectrum of IoT from embedded over the edge towards the cloud
 Different technology concepts for improving
x86
GPU
ML-ASIC
ARM v8
GPU
SoC
FPGA
SoC
RISC-V
FPGA
VEDLIOT Cognitive
IoT Platform
 Performance
 Cost-effectiveness
 Maintainability
 Reliability
 Energy-Efficiency
 Safety
15
uRECS
uRECS AIoT Server
 Supports ML acceleration
 FPGA
 ASIC
 Communication interfaces
 Wired (CAN, Ethernet, CSI)
 Wireless (WLAN, LoRa, 5G)
 Sensors
 Camera
 Environment (Temp./Hum.)
 Housekeeping
 Embedded Device
(~ 20x20x6 cm)
u.RECS Overview
PCIe
Ethernet (1 GbE & SPE)
Management & Monitoring
I/O (Camera, WiFi, LoRa, 4G/5G)
Microserver #1
(SMARC 2.1)
Microserver #2
(Jetson NX)
ML
Acc.
(M.2)
Front
Panel
2x
HDMI
RJ45/
SPE
4x
USB 3.1
16
u.RECS Architecture
• Two Module Slots
• 2 Acc. Slots
GPIO
CSI
USB 3
SMARC 2.1
FPGA
x86
ARM
Nvidia
Xavier
Jetson NX
M.2 M-Key
Accelerator / Storage
mPCIe
Accelerator /
Communication
HDMI
USB-C
Power
Barrel Plug
COM
Brick
PCIe x4
PCIe x1
PCIe x4
PCIe x4
GigE
Switch
SpE
Phy
Single Pair
Ethernet
2x RJ45
with PoE
GigE
GigE
GPIO
CSI
USB 2
HDMI
USB 3
power sensing
USB 3
Mux
USB 3 USB 3
BMC
ESP32
LoRa
• PCIe x4 Gen3
• 1 Gbit ethernet
• USB 3.0
• Battery Powered
• Advanced Power
Measurement
• Board management
Controller with WiFi,
BLE and LoRa
17
RECS Power Measurement
• Power measurement for all microsevers
with 1 Hz sampling rate accessible via
graphana or web GUI
• Oscilloscope mode available with 1
Ksps sampling rate
18
Big Picture
19
D4.1 RECS|Box microserver & architectures
RECS|Box
Jetson TX2
NVIDIA
Tegra X2
COM Express
Intel Core i7
8th Gen
COM Express
ARM v8 Server
SoC Hi1616
COM Express
Intel Stratix 10
Jetson nano
NVIDIA
Xavier NX
COM Express
Xilinx Zynq 7045
Apalis
Exynos (2xARM
Cortex-A15)
Apalis
Xilinx Zynq 7020
COM Express
AMD Ryzen
V1807B
COM Express
AMD EPYC
3451
CPU
FPGA
SoC
GPU
SoC
Deneb
Durin
20
D4.1 RECS|Box microserver & architectures
Jetson TX2
NVIDIA
Tegra X2
COM Express
Intel Core i7
8th Gen
COM Express
ARM v8 Server
SoC Hi1616
COM Express
Intel Stratix 10
Jetson nano
NVIDIA
Xavier NX
COM Express
Xilinx Zynq 7045
Apalis
Exynos (2xARM
Cortex-A15)
Apalis
Xilinx Zynq 7020
COM Express
AMD Ryzen
V1807B
COM Express
AMD EPYC
3451
CPU
FPGA
SoC
GPU
SoC
t.RECS
Supports
COM Express
microserver
via adaptor
COM-HPC client
size B to
NVIDIA Xavier AGX
COM-HPC client
size B to
NVIDIA Orin AGX
TBD
COM-HPC
client size B
Xilinx Zynq
UltraScale+
COM-HPC
server size D
Intel Agilex
COM-HPC
client size A
Intel Core i7
11th Gen
COM-HPC
client size C
Intel Core i9
12th Gen
21
D4.1 RECS|Box microserver & architectures
uRECS
CPU
FPGA
SoC
GPU
SoC
Jetson nano
NVIDIA
Xavier NX
ML
Accel.
M.2 PCIe/ USB
Intel Myriad X
SMARC
Xilinx Zynq
UltraScale+
SMARC 2.1
NXP i.MX 8M
(4x Cortex-A53)
SMARC 2.1
Intel Atom
Raspberry Pi
Compute
Module 4
Xilinx Kria
K26
M.2
PCIe
Hailo-8
M.2 PCIe
Google Coral
TPU
Dual chip
Smarc 2.1
Coherent Logix
HX40416
RPi CM4
ARVSOM
22
VEDLIoT Deep Learning Plattforms
Supported Computer-On-Module form factors
Raspberry Pi Compute
Module 4
Jetson Xavier NX
SMARC
Xilinx Kria
Jetson AGX Xavier
COM Express
(Type 6/7)
COM-HPC
Client (Type A-C)
COM-HPC
Server (Type D/E)
Size
(higher distance
is smaller)
I/O
Flexibility
Performance
Supported
Architectures
Market
Share
uRECS
RECS|Box
&
t.RECS
23
Peak performance values of specialized accelerators, provided by the
vendors (precisions varying from INT8 to FP32)
Peak Performance of DL Accelerators
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
Kendryte K210
[CELLRANGE]
[CELL…
[CELLRA…
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRAN…
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
1
10
100
1000
10000
100000
1000000
0.001 0.01 0.1 1 10 100 1000
Performance
[GOPS]
Power [Watt]
Devices
IP Cores
Average efficiency at 1000 GOPS /W
24
Benchmark performance of DL accelerators
YoloV4
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLR…
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRANGE]
[CELLRAN…
[CELLRANGE]
10
100
1000
10000
2 4 8 16 32 64 128
Performance
[GOPS]
Power [Watt]
INT8 FP16 FP32
25
Benchmark performance of DL accelerators
 Comparison based on currently available architectures
 VEDLIoT will include new specialized accelerators
0
50
100
150
200
250
300
350
Coral (M.2) Coral (Dev.) Xavier AGX
(LP)
Xavier AGX
(HP)
Xavier NX TX2 Nano GTX1660 ZU15 ZU3 Xeon-D1577 Epyc3451 Myriad GAP8
Energy Efficiency [GOPS/W]
ResNet50 Int 8 ResNet50 FP16 ResNet50 FP32
YoloV4 Int 8 YoloV4 FP16 YoloV4 FP32
MobileNet Int 8 MobileNet FP16 MobileNet FP32
26
Summary
• VEDIoT provides a scalable modular and heterogeneous hardware platform for
next generation AIoT applications
• Wide variety of available micro servers with industry-proven form factors
• The integrated flexible and reconfigurable communication infrastructure enables
tight coupling between micro servers, resulting in highest energy efficiency and
performance
• Integrated management and monitoring enable comprehensive application
benchmarking and charactarization
27
27
Agenda
▪ 11:30 – 12:00 EEST (10:30 – 11:00 CEST)
Introduction to VEDLIoT
Pedro Trancoso (Chalmers University of Technology)
▪ 12:00 – 12:25 EEST (11:00 – 11:25 CEST)
VEDLIoT Hardware Platforms
Kevin Mika (Bielefeld University)
▪ 12:25 – 12:45 EEST (11:25 – 11:45 CEST)
Performance Evaluation and Benchmarking
in VEDLIoT
Mario Pormann (Osnabrueck University)
HAccIoT: Heterogeneous Hardware
Acceleration for Edge and IoT
28
Thank you for your
attention.
Contact
Kevin Mika
Bielefeld University, Germany
kmika@cit-ec.uni-bielefeld.de
34
Flexible Accelerators for Deep Learning
DL
Model
DL Model
CPU, GPU-
SoC,
ML-SoC
FPGA-SoC
 End of Moore’s law & dark silicon
=> Domain Specific Architectures (DSA)
 Efficient, flexible, scalable accelerators
for the compute continuum
 Algotecture
 Optimized DL algorithms
 Optimized toolchain
 Optimized computer architecture
Heterogeneous DL
Accelerator
Algotecture/
Co-Designed DL
Accelerator
Compiler
Co-Design
35
VEDLIoT‘s Deep Learning Toolchain
• Image
Classification
• Object Detection
• Semantic
Segmentation
• Instance
Segmentation
• Extractive
Question
Answering
Model Zoo Optimization
Engine
Compilers &
Runtime APIs
Heterogeneous
Hardware
Platforms
36
 Platform
 Hardware: Scalable, heterogeneous, distributed
 Accelerators: Efficiency boost by FPGA and ASIC technology
 Toolchain: Optimizing Deep Learning for IoT
 Use cases
 Industrial IoT
 Automotive
 Smart Home
 Open call
 At project mid-term
 Early use and evaluation of VEDLIoT technology
Very Efficient Deep Learning for IoT –
VEDLIoT
 Call: H2020-ICT2020-1
 Topic: ICT-56-2020 Next Generation Internet of Things
 Duration: 1. November 2020 – 31. Oktober 2023
 Coordinator: Bielefeld University (Germany)
 Overall budget: 7 996 646.25 €
 Consortium: 12 partners from 4 EU countries (Germany,
Poland, Portugal and Sweden) and one associated
country (Switzerland).
More info:
 https://www.vedliot.eu/
 https://twitter.com/VEDLIoT
 https://www.linkedin.com/company/vedliot/
37
 Bielefeld University (UNIBI) - Coordinator
 Christmann (CHR)
 University of Osnabrück (UOS)
 Siemens (SIEMENS)
 University of Neuchâtel (UNINE)
 University of Lisbon (FC.ID)
 Chalmers (CHALMERS)
 University of Gothenburg (UGOT)
 RISE (RISE)
 EmbeDL (EMBEDL)
 Veoneer (VEONEER)
 Antmicro (ANT)
Partners
38
Big Picture
39
 Increase safety, health and well being of residents – acceleration of AI
methods for demand-oriented user-home interaction
 Smart Mirror as central user interface
 Own mirror image can be seen normally
 Intuitive control over gesture and voice
 Shows personalized information
 Data Privacy as the highest priority
 Edge computation of many neural networks
Use case: Smart Home / Assisted Living
40
 Face recognition
 Mobilenet SSD trained on WIDERFACE dataset
 Object detection
 YoloV3, Efficient-Net, yoloV4-tiny
 Gesture detection
 YoloV4-tiny with 3 Yolo layers (usually: 2 layers)
 Speech recognition
 Mozilla DeepSpeech
 AI Art: Style-Gan trained on works of arts
 Collect usage data in situation memory
Use case: Smart Mirror – Neural Networks
41
Use case: Industrial IoT – drive condition
classification
 Control applications need DL-based condition classification
 On the edge device for low power consumption
 Suggestions for control and maintenance
 DL methods on all communication layers
 DL in a distributed architecture
 Dynamically configured systems
 Sensored testbench with 2 motors
 Acceleration, Magnetic field, Temperature,
IR-Cam (temperature), Current-Sensors, Torque
 On / Off detection without
motor current or voltage
 Cooling fault detection
 Bearing fault detection
42
Use case: Industrial IoT – Arc detection
 AI based pattern recognition for different local sensor data
 current, magnetic field, vibration, temperature, low resolution infrared picture
 Safety critical nature
 response time should be <10ms
 AI based or AI supported decision made by the sensor node itself or by a local part of the sensor
network
43
 Focus on collision detection/avoidance scenario
 Improve performance/cost ratio – AI processing hardware
distributed over the entire chain
Use case: Automotive
44
Follow our work
https://twitter.com/VEDLIoT
https://www.linkedin.com/company/vedliot/
https://vedliot.eu
Be part of it
Open call NOW!
Allow early use and evaluation of VEDLIoT
technology
45
SMARC 2.1
RaspPi CM4
FPGA
x86
ARM
Jetson NX
Xilinx Kria
RaspPi CM4
ARVSOM
GPU SoC
FPGA
M.2 M-Key
Accelerator / Storage
mPCIe
Accelerator /
Communication
HDMI
HDMI
USB
3
USB
2
CSI
GP
IO
CSI
GP
IO
46
SMARC 2.1
RaspPi CM4
FPGA
x86
ARM
Jetson NX
Xilinx Kria
RaspPi CM4
ARVSOM
GPU SoC
FPGA
M.2 M-Key
Accelerator / Storage
mPCIe
Accelerator /
Communication
HDMI
HDMI
USB
3
USB
2
ESP3
2
BMC
CSI
GP
IO
CSI
LoRa
ADC for Power Measurement
´ PWR Con
ADC
Power
Sub-system
ADC
ADC
ADC
ADC
ADC
PWR
PWR PWR
PWR
GP
IO
47
SMARC 2.1
RaspPi CM4
FPGA
x86
ARM
Jetson NX
Xilinx Kria
RaspPi CM4
ARVSOM
GPU SoC
FPGA
M.2 M-Key
Accelerator / Storage
mPCIe
Accelerator /
Communication
PCIe x1
PCIe
x4
PCIe
x4
PCIe x4
PCIe x1
PCIe x1
HDMI
HDMI
USB
3
USB
2
PCIe x4
ESP3
2
BMC
CSI
GP
IO
CSI
LoRa
ADC for Power Measurement
´ PWR Con
ADC
Power
Sub-system
ADC
ADC
ADC
ADC
ADC
PWR
PWR PWR
PWR
Com
Brick
GP
IO
48
USB 3
Mux
SMARC 2.1
RaspPi CM4
FPGA
x86
ARM
Jetson NX
Xilinx Kria
RaspPi CM4
ARVSOM
GPU SoC
FPGA
M.2 M-Key
Accelerator / Storage
mPCIe
Accelerator /
Communication
PCIe x1
PCIe
x4
PCIe
x4
PCIe x4
PCIe x1
PCIe x1
HDMI
HDMI
USB
3
USB
2
USB
3
USB
3
PCIe x4
USB3
Hub
WiFi/B
LE
´ Connector
ESP3
2
BMC
CSI
GP
IO
CSI
LoRa
ADC for Power Measurement
´ PWR Con
ADC
Power
Sub-system
ADC
ADC
ADC
ADC
ADC
PWR
PWR PWR
PWR
Com
Brick
GP
IO

HiPEAC-CSW 2022_Kevin Mika presentation

  • 1.
    Kevin Mika Bielefeld University 27.April 2022 VEDLIoT Hardware Platforms and Accelerators
  • 2.
  • 3.
    3 VEDLIoT Hardware Platform Heterogeneous, modular, scalable microserver system  Supporting the full spectrum of IoT from embedded over the edge towards the cloud  Different technology concepts for improving x86 GPU ML-ASIC ARM v8 GPU SoC FPGA SoC RISC-V FPGA VEDLIOT Cognitive IoT Platform  Performance  Cost-effectiveness  Maintainability  Reliability  Energy-Efficiency  Safety
  • 4.
    4 RECS|BOX Overview RECS ServerBackplane (up to 15 Carriers) Carrier (PCIe Expansion) Carrier (High Performance) e.g. GPU-Accelerator Carrier (Low Power) #3 #2 Microserver (High Performance) #1 Microserver (Low Power) #16 #3 #2 Microserver (Low Power) #1 High-Speed Low-Latency Network (PCIe, High-Speed Serial) Compute Network (up to 40 GbE) Management Network (KVM, Monitoring, …) HDMI/USB iPass+ HD QSFP+ RJ45 Ext. Connectors GPU SoC FPGA SoC ARM Soc Low-Power Microserver (Apalis/Jetson) x86 ARM v8 High-Performance Microserver (COM Express) FPGA SoC High-Performance Carrier (up to 3 microservers) Low-Power Carrier (up to 16 microservers)
  • 5.
    5 Server Architecture • Microservermodules based on established Computer on Module standards • COM Express • Nvidia Jetson • Toradex Apalis Baseboard Baseboard Baseboard 3/16 Microservers per Baseboard Microserver Module CPU Mem I/O Backplane Up to 15 Baseboards per Server Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O KVM & Monitoring Storage and I/O-Extension Ethernet (10/40 GbE) High-Speed Low-Latency Communication (>60 Gbit/s)
  • 6.
    6 Server Architecture • Dedicatedmonitoring and control network • iKVM to every microserver • Fine-grained monitoring of power, voltage and temperature • Distributed network of microcontrollers for data- aggregation and pre-processing • High-speed monitoring Baseboard Baseboard Baseboard 3/16 Microservers per Baseboard Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O Backplane Up to 15 Baseboards per Server Distributed Monitoring and KVM Distributed Monitoring and KVM KVM & Monitoring Storage and I/O-Extension Ethernet (10/40 GbE) High-Speed Low-Latency Communication (>60 Gbit/s)
  • 7.
    7 Server Architecture • Multiple1Gb/10Gb Ethernet links per Microserver • 40 Gb Ethernet from Baseboard to Backplane • Internally switched on Baseboard and Backplane Baseboard Baseboard Baseboard 3/16 Microservers per Baseboard Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O Backplane Up to 15 Baseboards per Server Distributed Monitoring and KVM Ethernet Communication Infrastructure Distributed Monitoring and KVM Ethernet Communication Infrastructure KVM & Monitoring Storage and I/O-Extension Ethernet (10/40 GbE) High-Speed Low-Latency Communication (>60 Gbit/s)
  • 8.
    8 Server Architecture Baseboard Baseboard Baseboard 3/16 Microservers perBaseboard Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O Backplane Up to 15 Baseboards per Server Distributed Monitoring and KVM High-speed Low-latency Communication Ethernet Communication Infrastructure Distributed Monitoring and KVM Ethernet Communication Infrastructure High-speed Low-latency Communication KVM & Monitoring Storage and I/O-Extension Ethernet (10/40 GbE) High-Speed Low-Latency Communication (>60 Gbit/s) • Multiple 1Gb/10Gb Ethernet links per Microserver • 40 Gb Ethernet from Baseboard to Backplane • Internally switched on Baseboard and Backplane
  • 9.
    9 Server Architecture Baseboard Baseboard Baseboard 3/16 Microservers perBaseboard Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O Microserver Module CPU Mem I/O Backplane Up to 15 Baseboards per Server Distributed Monitoring and KVM High-speed Low-latency Communication Ethernet Communication Infrastructure Distributed Monitoring and KVM Storage / I/O-Ext. Ethernet Communication Infrastructure High-speed Low-latency Communication KVM & Monitoring Storage and I/O-Extension Ethernet (10/40 GbE) High-Speed Low-Latency Communication (>60 Gbit/s) • Connection to storage and I/O extensions • Easy integration of PCIe-based extension cards and storage subsystems
  • 10.
    10 VEDLIoT Hardware Platform Heterogeneous, modular, scalable microserver system  Supporting the full spectrum of IoT from embedded over the edge towards the cloud  Different technology concepts for improving x86 GPU ML-ASIC ARM v8 GPU SoC FPGA SoC RISC-V FPGA VEDLIOT Cognitive IoT Platform  Performance  Cost-effectiveness  Maintainability  Reliability  Energy-Efficiency  Safety
  • 11.
    11 t.RECS  Optimized platformfor local / edge applications  Provide interfaces for  Video  Camera  Peripheral input (USB)  Combine FPGA and GPU acceleration  Compact dimensions 1 RU, E-ATX form factor (2 RU/ 3 RU for special cases) t.RECS Overview Microserver #3 (COM-HPC Client) Microserver #1 (COM-HPC Client) Microserver #2 (COM-HPC Server) Switched PCIe (Host to Host) External interfaces PCIe expansion Ethernet (up to 10 GbE) Management Network (KVM, Monitoring, …) I/O (Camera, Display, Radar/Lidar, Audio)
  • 12.
    12 t.RECS Architecture Modular architecture 1x Large Form Factor (SFF)  2x Small Form Factor (LFF) Communication infrastructure  High-Speed Low-Latency via PCIe  Switched & ring topology  Support for cache-coherent accelerators (CCIX)  Switched ETH for data (10 GbE) and management (1GbE)  PCIe expansion slot for additional accelerators (GPU or FPGA) Microserver Client 2 (COM-HPC Client Type A,B or C) Microserver Client 1 (COM-HPC Client Type A or B) Microserver Server 1 (COM-HPC Server Type D) x8 lanes x16 To PCIe slot PCIe Switch
  • 13.
    13 t.RECS Reconfigurable Communication Infrastructure a)A classical CPU-based clustering with PCIe host-2-host communication b) A CPU-centric approach including two accelerators connected via PCIe as endpoints c) Ring topology using PCIe d) Ring topology using Xilinx Aurora
  • 14.
    14 VEDLIoT Hardware Platform Heterogeneous, modular, scalable microserver system  Supporting the full spectrum of IoT from embedded over the edge towards the cloud  Different technology concepts for improving x86 GPU ML-ASIC ARM v8 GPU SoC FPGA SoC RISC-V FPGA VEDLIOT Cognitive IoT Platform  Performance  Cost-effectiveness  Maintainability  Reliability  Energy-Efficiency  Safety
  • 15.
    15 uRECS uRECS AIoT Server Supports ML acceleration  FPGA  ASIC  Communication interfaces  Wired (CAN, Ethernet, CSI)  Wireless (WLAN, LoRa, 5G)  Sensors  Camera  Environment (Temp./Hum.)  Housekeeping  Embedded Device (~ 20x20x6 cm) u.RECS Overview PCIe Ethernet (1 GbE & SPE) Management & Monitoring I/O (Camera, WiFi, LoRa, 4G/5G) Microserver #1 (SMARC 2.1) Microserver #2 (Jetson NX) ML Acc. (M.2) Front Panel 2x HDMI RJ45/ SPE 4x USB 3.1
  • 16.
    16 u.RECS Architecture • TwoModule Slots • 2 Acc. Slots GPIO CSI USB 3 SMARC 2.1 FPGA x86 ARM Nvidia Xavier Jetson NX M.2 M-Key Accelerator / Storage mPCIe Accelerator / Communication HDMI USB-C Power Barrel Plug COM Brick PCIe x4 PCIe x1 PCIe x4 PCIe x4 GigE Switch SpE Phy Single Pair Ethernet 2x RJ45 with PoE GigE GigE GPIO CSI USB 2 HDMI USB 3 power sensing USB 3 Mux USB 3 USB 3 BMC ESP32 LoRa • PCIe x4 Gen3 • 1 Gbit ethernet • USB 3.0 • Battery Powered • Advanced Power Measurement • Board management Controller with WiFi, BLE and LoRa
  • 17.
    17 RECS Power Measurement •Power measurement for all microsevers with 1 Hz sampling rate accessible via graphana or web GUI • Oscilloscope mode available with 1 Ksps sampling rate
  • 18.
  • 19.
    19 D4.1 RECS|Box microserver& architectures RECS|Box Jetson TX2 NVIDIA Tegra X2 COM Express Intel Core i7 8th Gen COM Express ARM v8 Server SoC Hi1616 COM Express Intel Stratix 10 Jetson nano NVIDIA Xavier NX COM Express Xilinx Zynq 7045 Apalis Exynos (2xARM Cortex-A15) Apalis Xilinx Zynq 7020 COM Express AMD Ryzen V1807B COM Express AMD EPYC 3451 CPU FPGA SoC GPU SoC Deneb Durin
  • 20.
    20 D4.1 RECS|Box microserver& architectures Jetson TX2 NVIDIA Tegra X2 COM Express Intel Core i7 8th Gen COM Express ARM v8 Server SoC Hi1616 COM Express Intel Stratix 10 Jetson nano NVIDIA Xavier NX COM Express Xilinx Zynq 7045 Apalis Exynos (2xARM Cortex-A15) Apalis Xilinx Zynq 7020 COM Express AMD Ryzen V1807B COM Express AMD EPYC 3451 CPU FPGA SoC GPU SoC t.RECS Supports COM Express microserver via adaptor COM-HPC client size B to NVIDIA Xavier AGX COM-HPC client size B to NVIDIA Orin AGX TBD COM-HPC client size B Xilinx Zynq UltraScale+ COM-HPC server size D Intel Agilex COM-HPC client size A Intel Core i7 11th Gen COM-HPC client size C Intel Core i9 12th Gen
  • 21.
    21 D4.1 RECS|Box microserver& architectures uRECS CPU FPGA SoC GPU SoC Jetson nano NVIDIA Xavier NX ML Accel. M.2 PCIe/ USB Intel Myriad X SMARC Xilinx Zynq UltraScale+ SMARC 2.1 NXP i.MX 8M (4x Cortex-A53) SMARC 2.1 Intel Atom Raspberry Pi Compute Module 4 Xilinx Kria K26 M.2 PCIe Hailo-8 M.2 PCIe Google Coral TPU Dual chip Smarc 2.1 Coherent Logix HX40416 RPi CM4 ARVSOM
  • 22.
    22 VEDLIoT Deep LearningPlattforms Supported Computer-On-Module form factors Raspberry Pi Compute Module 4 Jetson Xavier NX SMARC Xilinx Kria Jetson AGX Xavier COM Express (Type 6/7) COM-HPC Client (Type A-C) COM-HPC Server (Type D/E) Size (higher distance is smaller) I/O Flexibility Performance Supported Architectures Market Share uRECS RECS|Box & t.RECS
  • 23.
    23 Peak performance valuesof specialized accelerators, provided by the vendors (precisions varying from INT8 to FP32) Peak Performance of DL Accelerators [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] Kendryte K210 [CELLRANGE] [CELL… [CELLRA… [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRAN… [CELLRANGE] [CELLRANGE] [CELLRANGE] 1 10 100 1000 10000 100000 1000000 0.001 0.01 0.1 1 10 100 1000 Performance [GOPS] Power [Watt] Devices IP Cores Average efficiency at 1000 GOPS /W
  • 24.
    24 Benchmark performance ofDL accelerators YoloV4 [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLR… [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRANGE] [CELLRAN… [CELLRANGE] 10 100 1000 10000 2 4 8 16 32 64 128 Performance [GOPS] Power [Watt] INT8 FP16 FP32
  • 25.
    25 Benchmark performance ofDL accelerators  Comparison based on currently available architectures  VEDLIoT will include new specialized accelerators 0 50 100 150 200 250 300 350 Coral (M.2) Coral (Dev.) Xavier AGX (LP) Xavier AGX (HP) Xavier NX TX2 Nano GTX1660 ZU15 ZU3 Xeon-D1577 Epyc3451 Myriad GAP8 Energy Efficiency [GOPS/W] ResNet50 Int 8 ResNet50 FP16 ResNet50 FP32 YoloV4 Int 8 YoloV4 FP16 YoloV4 FP32 MobileNet Int 8 MobileNet FP16 MobileNet FP32
  • 26.
    26 Summary • VEDIoT providesa scalable modular and heterogeneous hardware platform for next generation AIoT applications • Wide variety of available micro servers with industry-proven form factors • The integrated flexible and reconfigurable communication infrastructure enables tight coupling between micro servers, resulting in highest energy efficiency and performance • Integrated management and monitoring enable comprehensive application benchmarking and charactarization
  • 27.
    27 27 Agenda ▪ 11:30 –12:00 EEST (10:30 – 11:00 CEST) Introduction to VEDLIoT Pedro Trancoso (Chalmers University of Technology) ▪ 12:00 – 12:25 EEST (11:00 – 11:25 CEST) VEDLIoT Hardware Platforms Kevin Mika (Bielefeld University) ▪ 12:25 – 12:45 EEST (11:25 – 11:45 CEST) Performance Evaluation and Benchmarking in VEDLIoT Mario Pormann (Osnabrueck University) HAccIoT: Heterogeneous Hardware Acceleration for Edge and IoT
  • 28.
    28 Thank you foryour attention. Contact Kevin Mika Bielefeld University, Germany kmika@cit-ec.uni-bielefeld.de
  • 29.
    34 Flexible Accelerators forDeep Learning DL Model DL Model CPU, GPU- SoC, ML-SoC FPGA-SoC  End of Moore’s law & dark silicon => Domain Specific Architectures (DSA)  Efficient, flexible, scalable accelerators for the compute continuum  Algotecture  Optimized DL algorithms  Optimized toolchain  Optimized computer architecture Heterogeneous DL Accelerator Algotecture/ Co-Designed DL Accelerator Compiler Co-Design
  • 30.
    35 VEDLIoT‘s Deep LearningToolchain • Image Classification • Object Detection • Semantic Segmentation • Instance Segmentation • Extractive Question Answering Model Zoo Optimization Engine Compilers & Runtime APIs Heterogeneous Hardware Platforms
  • 31.
    36  Platform  Hardware:Scalable, heterogeneous, distributed  Accelerators: Efficiency boost by FPGA and ASIC technology  Toolchain: Optimizing Deep Learning for IoT  Use cases  Industrial IoT  Automotive  Smart Home  Open call  At project mid-term  Early use and evaluation of VEDLIoT technology Very Efficient Deep Learning for IoT – VEDLIoT  Call: H2020-ICT2020-1  Topic: ICT-56-2020 Next Generation Internet of Things  Duration: 1. November 2020 – 31. Oktober 2023  Coordinator: Bielefeld University (Germany)  Overall budget: 7 996 646.25 €  Consortium: 12 partners from 4 EU countries (Germany, Poland, Portugal and Sweden) and one associated country (Switzerland). More info:  https://www.vedliot.eu/  https://twitter.com/VEDLIoT  https://www.linkedin.com/company/vedliot/
  • 32.
    37  Bielefeld University(UNIBI) - Coordinator  Christmann (CHR)  University of Osnabrück (UOS)  Siemens (SIEMENS)  University of Neuchâtel (UNINE)  University of Lisbon (FC.ID)  Chalmers (CHALMERS)  University of Gothenburg (UGOT)  RISE (RISE)  EmbeDL (EMBEDL)  Veoneer (VEONEER)  Antmicro (ANT) Partners
  • 33.
  • 34.
    39  Increase safety,health and well being of residents – acceleration of AI methods for demand-oriented user-home interaction  Smart Mirror as central user interface  Own mirror image can be seen normally  Intuitive control over gesture and voice  Shows personalized information  Data Privacy as the highest priority  Edge computation of many neural networks Use case: Smart Home / Assisted Living
  • 35.
    40  Face recognition Mobilenet SSD trained on WIDERFACE dataset  Object detection  YoloV3, Efficient-Net, yoloV4-tiny  Gesture detection  YoloV4-tiny with 3 Yolo layers (usually: 2 layers)  Speech recognition  Mozilla DeepSpeech  AI Art: Style-Gan trained on works of arts  Collect usage data in situation memory Use case: Smart Mirror – Neural Networks
  • 36.
    41 Use case: IndustrialIoT – drive condition classification  Control applications need DL-based condition classification  On the edge device for low power consumption  Suggestions for control and maintenance  DL methods on all communication layers  DL in a distributed architecture  Dynamically configured systems  Sensored testbench with 2 motors  Acceleration, Magnetic field, Temperature, IR-Cam (temperature), Current-Sensors, Torque  On / Off detection without motor current or voltage  Cooling fault detection  Bearing fault detection
  • 37.
    42 Use case: IndustrialIoT – Arc detection  AI based pattern recognition for different local sensor data  current, magnetic field, vibration, temperature, low resolution infrared picture  Safety critical nature  response time should be <10ms  AI based or AI supported decision made by the sensor node itself or by a local part of the sensor network
  • 38.
    43  Focus oncollision detection/avoidance scenario  Improve performance/cost ratio – AI processing hardware distributed over the entire chain Use case: Automotive
  • 39.
    44 Follow our work https://twitter.com/VEDLIoT https://www.linkedin.com/company/vedliot/ https://vedliot.eu Bepart of it Open call NOW! Allow early use and evaluation of VEDLIoT technology
  • 40.
    45 SMARC 2.1 RaspPi CM4 FPGA x86 ARM JetsonNX Xilinx Kria RaspPi CM4 ARVSOM GPU SoC FPGA M.2 M-Key Accelerator / Storage mPCIe Accelerator / Communication HDMI HDMI USB 3 USB 2 CSI GP IO CSI GP IO
  • 41.
    46 SMARC 2.1 RaspPi CM4 FPGA x86 ARM JetsonNX Xilinx Kria RaspPi CM4 ARVSOM GPU SoC FPGA M.2 M-Key Accelerator / Storage mPCIe Accelerator / Communication HDMI HDMI USB 3 USB 2 ESP3 2 BMC CSI GP IO CSI LoRa ADC for Power Measurement ´ PWR Con ADC Power Sub-system ADC ADC ADC ADC ADC PWR PWR PWR PWR GP IO
  • 42.
    47 SMARC 2.1 RaspPi CM4 FPGA x86 ARM JetsonNX Xilinx Kria RaspPi CM4 ARVSOM GPU SoC FPGA M.2 M-Key Accelerator / Storage mPCIe Accelerator / Communication PCIe x1 PCIe x4 PCIe x4 PCIe x4 PCIe x1 PCIe x1 HDMI HDMI USB 3 USB 2 PCIe x4 ESP3 2 BMC CSI GP IO CSI LoRa ADC for Power Measurement ´ PWR Con ADC Power Sub-system ADC ADC ADC ADC ADC PWR PWR PWR PWR Com Brick GP IO
  • 43.
    48 USB 3 Mux SMARC 2.1 RaspPiCM4 FPGA x86 ARM Jetson NX Xilinx Kria RaspPi CM4 ARVSOM GPU SoC FPGA M.2 M-Key Accelerator / Storage mPCIe Accelerator / Communication PCIe x1 PCIe x4 PCIe x4 PCIe x4 PCIe x1 PCIe x1 HDMI HDMI USB 3 USB 2 USB 3 USB 3 PCIe x4 USB3 Hub WiFi/B LE ´ Connector ESP3 2 BMC CSI GP IO CSI LoRa ADC for Power Measurement ´ PWR Con ADC Power Sub-system ADC ADC ADC ADC ADC PWR PWR PWR PWR Com Brick GP IO

Notas do Editor

  • #17 Energy aware Benchmark everyhting in addition to performance
  • #23 Much Module, so Wow You can build system you need!
  • #24 Extensive benchmarking campains – this is based on the peak performanace given by the manufacturer log skaling Dotted line shows avarage of 1TOPS/W
  • #25 Real world Benchmarks using own Dataset and power measurements Best performing currently GPU based acceleros form NVIDIA Ongoing activity and we look at promosing new architectures ResNet50 MobileNet v3
  • #26 Real world Benchmarks using own Dataset and power measurements, far away from 1TOPS/W Best performing currently GPU based acceleros form NVIDIA Ongoing activity and we look at promosing new architectures ResNet50 MobileNet v3
  • #27 Since we have all these FPGA accelerators