SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
Information Classification: General
December 8-10 | Virtual Event
Klessydra-T: Designing Vector Coprocessors for
Multi-Threaded Edge-Computing Cores
Mauro Olivieri
Professor
Sapienza University of Rome
#RISCVSUMMIT
Information Classification: General
Francesco Lannutti
collaborator @Synopsys
DIGITAL SYSTEM LAB @ SAPIENZA UNIVERSITY OF ROME
Marcello Barbirotta
PhD candidate
Mauro Olivieri
Associate Professor
Francesco Menichelli
Assistant Professor
Antonio Mastrandrea
Research Fellow
Abdallah Cheikh
Research Fellow
Luigi Blasi
PhD cand. @DSI Gmbh
Francesco Vigli
PhD cand. @ ELT Spa
Stefano Sordillo
PhD candidate
Information Classification: General
INTRODUCTION & MOTIVATION
THE KLESSYDRA-T ARCHITECTURE
• Interleaved Multi-Threading baseline
• Parameterized vector acceleration schemes
• Klessydra vector intrinsic functions
BENCHMARK WORKLOADS
• Convolution, Matmul, FFT
• Homogeneous and composite workload
RESULTS
• Cycle count and absolute execution time
• Maximum clock frequency and hardware resource
utilization
• Energy efficiency
CONCLUSIONS
OUTLINE
Information Classification: General
10/03/2021 Page 4
APPLICATION CONTEXT AND MOTIVATION
 There are recognized drives towards (extreme)
edge computing: availability, energy saving,
security, etc., having implications on both SW
design and HW design
 HW design challenges of extreme edge
computing devices:
• Local energy budget
• Cost & size
• Computing power
 General setting:
• Possibly taking advantage of inherently
multi-threaded application routines
• Inevitability of hardware acceleration
support
Information Classification: General
• “space-qualified” core,
• T0 microarchitecture
• + configurable HW/SW fault-
tolerance support
• “edge computing” core
• extends T0 microarchitecture
• RV32IM
• + configurable multiple
scratchpad memories
• + configurable vector unit
• extended ISA
• Starting point
• M mode v1.10
• RV32I user ISA
• single hart
• M mode v1.10
• RV32I user ISA
• Atomic ext. (partial)
• multiple PC & CSR
• multiple interleaved
harts
PULPino
feat.
Klessydra S0
core
PULPino
feat.
Klessydra
T0 cores
PULPino
feat.
Klessydra F0
cores
PULPino
feat.
Klessydra T1
cores
10/03/2021 Page 5
core
courtesy of
THE PULPINO-COMPATIBLE KLESSYDRA CORE FAMILY
Information Classification: General
THE KLESSYDRA IMT MICROARCHITECTURE
 Baseline Klessydra T03 core features:
• Thread context switch at each clock cycle
• in-order, single issue instruction execution
• feed-forward pipeline structure (no hardware
support for pipeline hazard handling)
• bare metal execution (RISCV M mode)
 The vector-accelerated Klessydra-T13 core has been
designed as a superset of the basic Klessydra-T03
microarchitecture.
Regfile
Decode
PC
PC
CSR
Data Mem
WB
Debug
Updater
harc
Updater
hart a
hart b
hart c
Fetch
Prg Mem
Execute
Program memory
Data memory
Information Classification: General
THE KLESSYDRA-T1 MICROARCHITECTURE FAMILY
Klessydra T13 core features
 multiple units in the execution stage
• scalar execution unit (EXEC)
• vector-oriented multi-purpose
functional unit (MFU) with Scratchpad
Memory support
• Load/Store unit (LSU)
 possible concurrent execution of
instructions of different types
Information Classification: General
HARDWARE ACCELERATION PARAMETRIC SCHEMES
The parametric coprocessor architecture in T13 cores,
comprised of the MFU and the SPMIs, can be
configured at synthesis level according to the
following values:
• the number of parallel lanes D in the MFU, which
defines the DLP degree and also corresponds to
the number of SPM banks in each SMPI block
• the number of MFUs F
• the SPM bank capacity B
• the number of SPMs N
• the number of SPMIs M
• The sharing scheme of MFUs and SMPI among the
harts, i.e. heterogeneous or symmetric
10/03/2021 Titolo Presentazione Pagina 8
 M=1, F=1, D=1: SISD
 M=1, F=1, D=2,4,8: Pure SIMD
 M=3, F=3, D=1: Symmetric MIMD
 M=3, F=3, D=2,4,8: Symmetric MIMD + SIMD
 M=3, F=1, D=1: Heterogeneous MIMD
 M=3, F=1, D=2,4,8: Heterogeneous MIMD + SIMD
Information Classification: General
KLESSYDRA VECTOR EXTENSION AND INTRINSIC FUNCTIONS
Assembly syntax – (r) denotes
memoryaddressing via register r
Short description
kmemld (rd),(rs1),(rs2) load vector into scratchpad region
kmemstr (rd),(rs1),(rs2) store vector into main memory
kaddv (rd),(rs1),(rs2) adds vectors in scratchpad region
ksubv (rd),(rs1),(rs2) subtract vectors in scratchpad region
kvmul (rd),(rs1),(rs2) multiply vectors in scratchpad region
kvred (rd),(rs1),(rs2) reduce vector by addition
kdotp (rd),(rs1),(rs2) vector dot product into register
ksvaddsc (rd),(rs1),(rs2) add vector + scalar into scratchpad
ksvaddrf (rd),(rs1),rs2 add vector + scalar into register
ksvmulsc (rd),(rs1),(rs2) multiply vector + scalar into scratchpad
ksvmulrf (rd),(rs1),rs2 multiply vector + scalar into register
kdotpps (rd),(rs1),(rs2) vector dot product and post scaling
ksrlv (rd),(rs1),rs2 vector logic shift within scratchpad
ksrav (rd),(rs1),rs2 vector arithmetic shift within scratchpad
krelu (rd),(rs1) vector ReLu within scratchpad
kvslt (rd),(rs1),(rs2) compare vectors and create mask vector
ksvslt (rd),(rs1),rs2 compare vector-scalar and create mask
kvcp (rd),(rs1) copy vector within scratchpad region
The instructions supported by the coprocessor sub-
system are exposed to the programmer in the form
of very simple intrinsic functions, fully integrated
in the RISC-V gcc compiler toolchain.
CSR_MVSIZE(Row_size); //set vector length
for( i = Zeropad_offset; i < Row_size-Zeropad_offset;i++) { //scan the Output Matrix rows
k_element = 0;
for ( FM_row_pointer = -Zeropad_offset; FM_row_pointer <= Zeropad_offset; FM_row_pointer++) {
for ( column_offset = 0; column_offset < kernel_size; column_offset++){
FM_offset = (i+FM_row_pointer)*Row_size + column_offset; // set pointer in SPM space
ksvmulsc( SPM_D, (SPM_A + FM_offset), (SPM_B + k_element++) ); // temporary vector result
ksrav( SPM_D, SPM_D, scaling_factor ); //scaling for fixed point alignment
OM_offset = (Row_size*i) + Zeropad_offset; // set pointer in SPM space
kaddv( (SPM_C + OM_offset), (SPM_C + OM_offset), SPM_D ); // update Output Matrix row
}
}
}
Information Classification: General
BENCHMARK WORKLOADS AND EVALUATION SETUP
 2D convolution
• 32-bit data elements in fixed-point representation
• 3x3 filter size
• matrix sizes of 4x4, 8x8, 16x16, and 32x32 elements
• additional analysis of larger than 3x3 filter sizes on 32x32 matrices
 FFT
• 256 complex samples
 Matmul
• Square matrices of 64x64 elements
• Homogeneous workload (3 harts running same program)
• Composite workload (3 harts running different programs)
10/03/2021 Titolo Presentazione Pagina 10
ANALYZED PERFORMANCE FIGURES
ON FPGA SOFT-CORE IMPLEMENTATION
• Average total cycle count per hart
• Maximum clock frequency
• Absolute execution time
• Hardware Resource Utilization
• Average energy per algorithmic
operation
Information Classification: General
SUMMARY OF PERFORMANCE RESULTS
 3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)
 2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
Information Classification: General
SUMMARY OF PERFORMANCE RESULTS
 3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)
 2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
• The clock speed exhibited the sharpest drops as the DLP grew larger.
• In the symmetric MIMD scheme, the large HW overhead forced FPGA
slices on the same critical path to be placed far from each other, thus
increasing interconnect delay.
• Pipelining the heterogeneous MIMD crossbar to reduce the critical
path, introduces additional HW overhead, compromising the area
advantage.
Information Classification: General
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
SUMMARY OF PERFORMANCE RESULTS
• Small matrix convolutions and FFT
on the accelerated core reached up
to 2X cycle count reduction over the
single-threaded, DSP-extended
RI5CY core.
• Large matrix convolutions and
MatMul obtain advantage from
vector-acceleration reaching 9X
cycle count reduction relative to
RI5CY.
Information Classification: General
• Assuming maximum clock frequency for each core
• Zeroriscy core taken as common reference
• In pure SIMD configurations, the speed-up grows
linearly with the DLP
• Going from a SISD/SIMD to MIMD+SIMD improved the
speedup in all cases, despite the frequency drop
associated to the MIMD hardware.
• The symmetric MIMD+SIMD schemes exhibit up to 17X
speed-up over Zeroriscy for Convolution 32x32 and up
to 13X speed-up for the composite workload.
• Heterogeneous MIMD configurations maintain an
almost perfect overlap with the symmetric MIMD.
• The non-accelerated Klessydra-T03, exhibits an
absolute performance gain over RI5CY and ZeroRiscy
Pagina 14
ABSOLUTE EXECUTION TIME SPEED-UP
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SISD,
DLP
1
pure
SIMD,
DLP
2
pure
SIMD,
DLP
4
pure
SIMD,
DLP
8
Sym.
MIMD,
DLP
1
Sym.
MIMD+SIMD,
DLP
2
Sym.
MIMD+SIMD,
DLP
4
Sym.
MIMD+SIMD,
DLP
8
Het.
MIMD,
DLP
1
Het.
MIMD+SIMD,
DLP
2
Het.
MIMD+SIMD,
DLP
4
Het.
MIMD+SIMD,
DLP
8
Klessydra
T03
(no
accel.)
RI5CY
(DSP
extension)
ZeroRiscy
(no
accel.)
Conv.2D 4x4
Conv.2D 8x8
Conv.2D 16x16
Conv.2D 32x32
FFT 256
MatMul 64x64
Composite
Information Classification: General
ENERGY EFFICIENCY
• The result of this analysis is expressed as
energy per algorithmic operation, for the FPGA
soft-core implementations, normalized to
Zeroriscy, taken as reference.
• The most energy efficient designs resulted to be
the T13 symmetric MIMD configurations
• The heterogenous MIMD approach exhibited an
almost complete overlap in energy consumption
with the symmetric MIMD
• The pure SIMD schemes resulted in a larger
energy consumption than other schemes, due to
the impossibility of efficiently exploiting TLP.
Pagina 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
SISD,
DLP
1
pure
SIMD,
DLP
2
pure
SIMD,
DLP
4
pure
SIMD,
DLP
8
Sym.
MIMD,
DLP
1
Sym.
MIMD+SIMD,
DLP
2
Sym.
MIMD+SIMD,
DLP
4
Sym.
MIMD+SIMD,
DLP
8
Het.
MIMD,
DLP
1
Het.
MIMD+SIMD,
DLP
2
Het.
MIMD+SIMD,
DLP
4
Het.
MIMD+SIMD,
DLP
8
Klessydra
T03
(no
accel.)
RI5CY
(DSP
extension)
ZeroRiscy
(no
accel.)
Conv.2D 4x4 Conv.2D 8x8
Conv.2D 16x16 Conv.2D 32x32
FFT 256 MatMul 64x64
Composite
Information Classification: General
Pagina 16
LARGER CONVOLUTION FILTERS
Core DLP
Filter (5x5) Filter (7x7) Filter (9x9) Filter (11x11)
Cycle
Cnt
X1000
T (us) E [uJ]
Cycle
Cnt
X1000
T (us) E [uJ]
Cycle
Cnt
X1000
T
(us)
E [uJ]
Cycle
Cnt
X1000
T
(us)
E [uJ]
T13 SIMD 2 52.7 362 50.6 101.2 694 97.1 165.8 1136 159.1 246.5 1689 236.6
T13 SIMD 8 24.6 179 34.4 46.1 335 64.5 74.7 543 104.7 110.6 803 154.8
T13 Sym MIMD 2 19.5 148 26.9 35.8 272 49.4 57.4 436 79.2 84.4 641 116.5
T13 Sym MIMD 8 11.8 113 28.9 19.2 183 46.9 29.8 284 72.7 42.9 408 104.7
T13 Het MIMD 2 20.5 159 28.3 37.5 291 51.8 60.2 467 83.1 88.5 687 122.1
T03 (no accel.) - 247 1120 215.5 514.8 2328 447.9 881.2 3985 766.6 1369.1 6191 1191.1
RISCY - 180 1971 252.0 385.3 4218 539.4 662.5 7252 927.5 1000.2 10949 1400.3
ZeroRiscy - 318.9 2721 226.4 674.5 5754 478.9 1129.7 9637 802.1 1697.8 14482 1205.4
• The matrix being convoluted is 32x32 elements
• The speed-up and energy efficiency trends continue as the filter dimensions grow, reaching X35 speedup over the Zeroriscy reference
Information Classification: General
 The MIMD-SIMD vector coprocessor schemes enable tuning the TLP and DLP
• >15X absolute time speed-up , -85% energy per operation.
 Kernels that are less effectively vectorizable can still take benefit SPMs and TLP, in an IMT core,
• 2X-3X speed-up.
 Fully symmetric MIMD and heterogeneous MIMD give very similar results,
• functional unit contention is less impacting than SPM contention.
• coprocessor contention can be effectively mitigated by functional unit heterogeneity
 Pure DLP acceleration always give inferior results than a balanced TLP/DLP acceleration.
• The IMT microarchitecture benefits from TLP and DLP acceleration in a single core.
 In the absence of hardware acceleration, IMT still exhibits a performance advantage over single-thread execution
• Simplified hardware structure phylosophy
10/03/2021 Pagina 17
CONCLUSIONS
Information Classification: General
December 8-10 | Virtual Event
Thank you for joining
Contribute to the RISC-V conversation on social!
#RISCVSUMMIT #KLESSYDRA @mauro_olivieri_
https://github.com/klessydra
Mauro.Olivieri@uniroma1.it

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

SemiDynamics new family of High Bandwidth Vector-capable Cores
SemiDynamics new family of High Bandwidth Vector-capable CoresSemiDynamics new family of High Bandwidth Vector-capable Cores
SemiDynamics new family of High Bandwidth Vector-capable Cores
 
Andes enhancing verification coverage for risc v vector extension using riscv-dv
Andes enhancing verification coverage for risc v vector extension using riscv-dvAndes enhancing verification coverage for risc v vector extension using riscv-dv
Andes enhancing verification coverage for risc v vector extension using riscv-dv
 
Tech talk with lampro mellon an open source solution for accelerating verific...
Tech talk with lampro mellon an open source solution for accelerating verific...Tech talk with lampro mellon an open source solution for accelerating verific...
Tech talk with lampro mellon an open source solution for accelerating verific...
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket Chip
 
RISC-V Zce Extension
RISC-V Zce ExtensionRISC-V Zce Extension
RISC-V Zce Extension
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentor
 
Semi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresSemi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V cores
 
Closing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzingClosing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzing
 
RISC-V 30908 patra
RISC-V 30908 patraRISC-V 30908 patra
RISC-V 30908 patra
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
Coco co-desing and co-verification of masked software implementations on cp us
Coco   co-desing and co-verification of masked software implementations on cp usCoco   co-desing and co-verification of masked software implementations on cp us
Coco co-desing and co-verification of masked software implementations on cp us
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
RISC-V assembly
RISC-V assemblyRISC-V assembly
RISC-V assembly
 
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
 
Getting started with RISC-V verification what's next after compliance testing
Getting started with RISC-V verification what's next after compliance testingGetting started with RISC-V verification what's next after compliance testing
Getting started with RISC-V verification what's next after compliance testing
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Online test program generator for RISC-V processors
Online test program generator for RISC-V processorsOnline test program generator for RISC-V processors
Online test program generator for RISC-V processors
 
RISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor FamilyRISC-V NOEL-V - A new high performance RISC-V Processor Family
RISC-V NOEL-V - A new high performance RISC-V Processor Family
 
Fueling the datasphere how RISC-V enables the storage ecosystem
Fueling the datasphere   how RISC-V enables the storage ecosystemFueling the datasphere   how RISC-V enables the storage ecosystem
Fueling the datasphere how RISC-V enables the storage ecosystem
 

Semelhante a Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded Edge-Computing Soft-Cores

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Fisnik Kraja
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
eSAT Journals
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
Förderverein Technische Fakultät
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 

Semelhante a Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded Edge-Computing Soft-Cores (20)

ICIECA 2014 Paper 10
ICIECA 2014 Paper 10ICIECA 2014 Paper 10
ICIECA 2014 Paper 10
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Review paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmeticReview paper on 32-BIT RISC processor with floating point arithmetic
Review paper on 32-BIT RISC processor with floating point arithmetic
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
Modified Distributive Arithmetic Based DWT-IDWT Processor Design and FPGA Imp...
 
[IJET-V1I3P17] Authors :Prof. U. R. More. S. R. Adhav
[IJET-V1I3P17] Authors :Prof. U. R. More. S. R. Adhav[IJET-V1I3P17] Authors :Prof. U. R. More. S. R. Adhav
[IJET-V1I3P17] Authors :Prof. U. R. More. S. R. Adhav
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
VEDLIoT at FPL'23_Accelerators for Heterogenous Computing in AIoT
VEDLIoT at FPL'23_Accelerators for Heterogenous Computing in AIoTVEDLIoT at FPL'23_Accelerators for Heterogenous Computing in AIoT
VEDLIoT at FPL'23_Accelerators for Heterogenous Computing in AIoT
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
Netlist Optimization for CMOS Place and Route in MICROWIND
Netlist Optimization for CMOS Place and Route in MICROWINDNetlist Optimization for CMOS Place and Route in MICROWIND
Netlist Optimization for CMOS Place and Route in MICROWIND
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 

Mais de RISC-V International

Mais de RISC-V International (18)

WD RISC-V inliner work effort
WD RISC-V inliner work effortWD RISC-V inliner work effort
WD RISC-V inliner work effort
 
RISC-V Online Tutor
RISC-V Online TutorRISC-V Online Tutor
RISC-V Online Tutor
 
London Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VLondon Open Source Meetup for RISC-V
London Open Source Meetup for RISC-V
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
 
Standardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VStandardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-V
 
Security and functional safety
Security and functional safetySecurity and functional safety
Security and functional safety
 
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_gen
 
RISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmware
 
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notes
 
RISC-V software state of the union
RISC-V software state of the unionRISC-V software state of the union
RISC-V software state of the union
 
Ripes tracking computer architecture throught visual and interactive simula...
Ripes   tracking computer architecture throught visual and interactive simula...Ripes   tracking computer architecture throught visual and interactive simula...
Ripes tracking computer architecture throught visual and interactive simula...
 
Porting tock to open titan
Porting tock to open titanPorting tock to open titan
Porting tock to open titan
 
Open j9 jdk on RISC-V
Open j9 jdk on RISC-VOpen j9 jdk on RISC-V
Open j9 jdk on RISC-V
 
Open source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process nodeOpen source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process node
 
Gernot heiser unsw sydney and se l4 foundation
Gernot heiser unsw sydney and se l4 foundationGernot heiser unsw sydney and se l4 foundation
Gernot heiser unsw sydney and se l4 foundation
 
Easily emulating full systems on amazon fpg as
Easily emulating full systems on amazon fpg asEasily emulating full systems on amazon fpg as
Easily emulating full systems on amazon fpg as
 
Developing for polar fire soc
Developing for polar fire socDeveloping for polar fire soc
Developing for polar fire soc
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded Edge-Computing Soft-Cores

  • 1. Information Classification: General December 8-10 | Virtual Event Klessydra-T: Designing Vector Coprocessors for Multi-Threaded Edge-Computing Cores Mauro Olivieri Professor Sapienza University of Rome #RISCVSUMMIT
  • 2. Information Classification: General Francesco Lannutti collaborator @Synopsys DIGITAL SYSTEM LAB @ SAPIENZA UNIVERSITY OF ROME Marcello Barbirotta PhD candidate Mauro Olivieri Associate Professor Francesco Menichelli Assistant Professor Antonio Mastrandrea Research Fellow Abdallah Cheikh Research Fellow Luigi Blasi PhD cand. @DSI Gmbh Francesco Vigli PhD cand. @ ELT Spa Stefano Sordillo PhD candidate
  • 3. Information Classification: General INTRODUCTION & MOTIVATION THE KLESSYDRA-T ARCHITECTURE • Interleaved Multi-Threading baseline • Parameterized vector acceleration schemes • Klessydra vector intrinsic functions BENCHMARK WORKLOADS • Convolution, Matmul, FFT • Homogeneous and composite workload RESULTS • Cycle count and absolute execution time • Maximum clock frequency and hardware resource utilization • Energy efficiency CONCLUSIONS OUTLINE
  • 4. Information Classification: General 10/03/2021 Page 4 APPLICATION CONTEXT AND MOTIVATION  There are recognized drives towards (extreme) edge computing: availability, energy saving, security, etc., having implications on both SW design and HW design  HW design challenges of extreme edge computing devices: • Local energy budget • Cost & size • Computing power  General setting: • Possibly taking advantage of inherently multi-threaded application routines • Inevitability of hardware acceleration support
  • 5. Information Classification: General • “space-qualified” core, • T0 microarchitecture • + configurable HW/SW fault- tolerance support • “edge computing” core • extends T0 microarchitecture • RV32IM • + configurable multiple scratchpad memories • + configurable vector unit • extended ISA • Starting point • M mode v1.10 • RV32I user ISA • single hart • M mode v1.10 • RV32I user ISA • Atomic ext. (partial) • multiple PC & CSR • multiple interleaved harts PULPino feat. Klessydra S0 core PULPino feat. Klessydra T0 cores PULPino feat. Klessydra F0 cores PULPino feat. Klessydra T1 cores 10/03/2021 Page 5 core courtesy of THE PULPINO-COMPATIBLE KLESSYDRA CORE FAMILY
  • 6. Information Classification: General THE KLESSYDRA IMT MICROARCHITECTURE  Baseline Klessydra T03 core features: • Thread context switch at each clock cycle • in-order, single issue instruction execution • feed-forward pipeline structure (no hardware support for pipeline hazard handling) • bare metal execution (RISCV M mode)  The vector-accelerated Klessydra-T13 core has been designed as a superset of the basic Klessydra-T03 microarchitecture. Regfile Decode PC PC CSR Data Mem WB Debug Updater harc Updater hart a hart b hart c Fetch Prg Mem Execute Program memory Data memory
  • 7. Information Classification: General THE KLESSYDRA-T1 MICROARCHITECTURE FAMILY Klessydra T13 core features  multiple units in the execution stage • scalar execution unit (EXEC) • vector-oriented multi-purpose functional unit (MFU) with Scratchpad Memory support • Load/Store unit (LSU)  possible concurrent execution of instructions of different types
  • 8. Information Classification: General HARDWARE ACCELERATION PARAMETRIC SCHEMES The parametric coprocessor architecture in T13 cores, comprised of the MFU and the SPMIs, can be configured at synthesis level according to the following values: • the number of parallel lanes D in the MFU, which defines the DLP degree and also corresponds to the number of SPM banks in each SMPI block • the number of MFUs F • the SPM bank capacity B • the number of SPMs N • the number of SPMIs M • The sharing scheme of MFUs and SMPI among the harts, i.e. heterogeneous or symmetric 10/03/2021 Titolo Presentazione Pagina 8  M=1, F=1, D=1: SISD  M=1, F=1, D=2,4,8: Pure SIMD  M=3, F=3, D=1: Symmetric MIMD  M=3, F=3, D=2,4,8: Symmetric MIMD + SIMD  M=3, F=1, D=1: Heterogeneous MIMD  M=3, F=1, D=2,4,8: Heterogeneous MIMD + SIMD
  • 9. Information Classification: General KLESSYDRA VECTOR EXTENSION AND INTRINSIC FUNCTIONS Assembly syntax – (r) denotes memoryaddressing via register r Short description kmemld (rd),(rs1),(rs2) load vector into scratchpad region kmemstr (rd),(rs1),(rs2) store vector into main memory kaddv (rd),(rs1),(rs2) adds vectors in scratchpad region ksubv (rd),(rs1),(rs2) subtract vectors in scratchpad region kvmul (rd),(rs1),(rs2) multiply vectors in scratchpad region kvred (rd),(rs1),(rs2) reduce vector by addition kdotp (rd),(rs1),(rs2) vector dot product into register ksvaddsc (rd),(rs1),(rs2) add vector + scalar into scratchpad ksvaddrf (rd),(rs1),rs2 add vector + scalar into register ksvmulsc (rd),(rs1),(rs2) multiply vector + scalar into scratchpad ksvmulrf (rd),(rs1),rs2 multiply vector + scalar into register kdotpps (rd),(rs1),(rs2) vector dot product and post scaling ksrlv (rd),(rs1),rs2 vector logic shift within scratchpad ksrav (rd),(rs1),rs2 vector arithmetic shift within scratchpad krelu (rd),(rs1) vector ReLu within scratchpad kvslt (rd),(rs1),(rs2) compare vectors and create mask vector ksvslt (rd),(rs1),rs2 compare vector-scalar and create mask kvcp (rd),(rs1) copy vector within scratchpad region The instructions supported by the coprocessor sub- system are exposed to the programmer in the form of very simple intrinsic functions, fully integrated in the RISC-V gcc compiler toolchain. CSR_MVSIZE(Row_size); //set vector length for( i = Zeropad_offset; i < Row_size-Zeropad_offset;i++) { //scan the Output Matrix rows k_element = 0; for ( FM_row_pointer = -Zeropad_offset; FM_row_pointer <= Zeropad_offset; FM_row_pointer++) { for ( column_offset = 0; column_offset < kernel_size; column_offset++){ FM_offset = (i+FM_row_pointer)*Row_size + column_offset; // set pointer in SPM space ksvmulsc( SPM_D, (SPM_A + FM_offset), (SPM_B + k_element++) ); // temporary vector result ksrav( SPM_D, SPM_D, scaling_factor ); //scaling for fixed point alignment OM_offset = (Row_size*i) + Zeropad_offset; // set pointer in SPM space kaddv( (SPM_C + OM_offset), (SPM_C + OM_offset), SPM_D ); // update Output Matrix row } } }
  • 10. Information Classification: General BENCHMARK WORKLOADS AND EVALUATION SETUP  2D convolution • 32-bit data elements in fixed-point representation • 3x3 filter size • matrix sizes of 4x4, 8x8, 16x16, and 32x32 elements • additional analysis of larger than 3x3 filter sizes on 32x32 matrices  FFT • 256 complex samples  Matmul • Square matrices of 64x64 elements • Homogeneous workload (3 harts running same program) • Composite workload (3 harts running different programs) 10/03/2021 Titolo Presentazione Pagina 10 ANALYZED PERFORMANCE FIGURES ON FPGA SOFT-CORE IMPLEMENTATION • Average total cycle count per hart • Maximum clock frequency • Absolute execution time • Hardware Resource Utilization • Average energy per algorithmic operation
  • 11. Information Classification: General SUMMARY OF PERFORMANCE RESULTS  3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)  2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
  • 12. Information Classification: General SUMMARY OF PERFORMANCE RESULTS  3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)  2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376 • The clock speed exhibited the sharpest drops as the DLP grew larger. • In the symmetric MIMD scheme, the large HW overhead forced FPGA slices on the same critical path to be placed far from each other, thus increasing interconnect delay. • Pipelining the heterogeneous MIMD crossbar to reduce the critical path, introduces additional HW overhead, compromising the area advantage.
  • 13. Information Classification: General MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376 SUMMARY OF PERFORMANCE RESULTS • Small matrix convolutions and FFT on the accelerated core reached up to 2X cycle count reduction over the single-threaded, DSP-extended RI5CY core. • Large matrix convolutions and MatMul obtain advantage from vector-acceleration reaching 9X cycle count reduction relative to RI5CY.
  • 14. Information Classification: General • Assuming maximum clock frequency for each core • Zeroriscy core taken as common reference • In pure SIMD configurations, the speed-up grows linearly with the DLP • Going from a SISD/SIMD to MIMD+SIMD improved the speedup in all cases, despite the frequency drop associated to the MIMD hardware. • The symmetric MIMD+SIMD schemes exhibit up to 17X speed-up over Zeroriscy for Convolution 32x32 and up to 13X speed-up for the composite workload. • Heterogeneous MIMD configurations maintain an almost perfect overlap with the symmetric MIMD. • The non-accelerated Klessydra-T03, exhibits an absolute performance gain over RI5CY and ZeroRiscy Pagina 14 ABSOLUTE EXECUTION TIME SPEED-UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 SISD, DLP 1 pure SIMD, DLP 2 pure SIMD, DLP 4 pure SIMD, DLP 8 Sym. MIMD, DLP 1 Sym. MIMD+SIMD, DLP 2 Sym. MIMD+SIMD, DLP 4 Sym. MIMD+SIMD, DLP 8 Het. MIMD, DLP 1 Het. MIMD+SIMD, DLP 2 Het. MIMD+SIMD, DLP 4 Het. MIMD+SIMD, DLP 8 Klessydra T03 (no accel.) RI5CY (DSP extension) ZeroRiscy (no accel.) Conv.2D 4x4 Conv.2D 8x8 Conv.2D 16x16 Conv.2D 32x32 FFT 256 MatMul 64x64 Composite
  • 15. Information Classification: General ENERGY EFFICIENCY • The result of this analysis is expressed as energy per algorithmic operation, for the FPGA soft-core implementations, normalized to Zeroriscy, taken as reference. • The most energy efficient designs resulted to be the T13 symmetric MIMD configurations • The heterogenous MIMD approach exhibited an almost complete overlap in energy consumption with the symmetric MIMD • The pure SIMD schemes resulted in a larger energy consumption than other schemes, due to the impossibility of efficiently exploiting TLP. Pagina 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 SISD, DLP 1 pure SIMD, DLP 2 pure SIMD, DLP 4 pure SIMD, DLP 8 Sym. MIMD, DLP 1 Sym. MIMD+SIMD, DLP 2 Sym. MIMD+SIMD, DLP 4 Sym. MIMD+SIMD, DLP 8 Het. MIMD, DLP 1 Het. MIMD+SIMD, DLP 2 Het. MIMD+SIMD, DLP 4 Het. MIMD+SIMD, DLP 8 Klessydra T03 (no accel.) RI5CY (DSP extension) ZeroRiscy (no accel.) Conv.2D 4x4 Conv.2D 8x8 Conv.2D 16x16 Conv.2D 32x32 FFT 256 MatMul 64x64 Composite
  • 16. Information Classification: General Pagina 16 LARGER CONVOLUTION FILTERS Core DLP Filter (5x5) Filter (7x7) Filter (9x9) Filter (11x11) Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] T13 SIMD 2 52.7 362 50.6 101.2 694 97.1 165.8 1136 159.1 246.5 1689 236.6 T13 SIMD 8 24.6 179 34.4 46.1 335 64.5 74.7 543 104.7 110.6 803 154.8 T13 Sym MIMD 2 19.5 148 26.9 35.8 272 49.4 57.4 436 79.2 84.4 641 116.5 T13 Sym MIMD 8 11.8 113 28.9 19.2 183 46.9 29.8 284 72.7 42.9 408 104.7 T13 Het MIMD 2 20.5 159 28.3 37.5 291 51.8 60.2 467 83.1 88.5 687 122.1 T03 (no accel.) - 247 1120 215.5 514.8 2328 447.9 881.2 3985 766.6 1369.1 6191 1191.1 RISCY - 180 1971 252.0 385.3 4218 539.4 662.5 7252 927.5 1000.2 10949 1400.3 ZeroRiscy - 318.9 2721 226.4 674.5 5754 478.9 1129.7 9637 802.1 1697.8 14482 1205.4 • The matrix being convoluted is 32x32 elements • The speed-up and energy efficiency trends continue as the filter dimensions grow, reaching X35 speedup over the Zeroriscy reference
  • 17. Information Classification: General  The MIMD-SIMD vector coprocessor schemes enable tuning the TLP and DLP • >15X absolute time speed-up , -85% energy per operation.  Kernels that are less effectively vectorizable can still take benefit SPMs and TLP, in an IMT core, • 2X-3X speed-up.  Fully symmetric MIMD and heterogeneous MIMD give very similar results, • functional unit contention is less impacting than SPM contention. • coprocessor contention can be effectively mitigated by functional unit heterogeneity  Pure DLP acceleration always give inferior results than a balanced TLP/DLP acceleration. • The IMT microarchitecture benefits from TLP and DLP acceleration in a single core.  In the absence of hardware acceleration, IMT still exhibits a performance advantage over single-thread execution • Simplified hardware structure phylosophy 10/03/2021 Pagina 17 CONCLUSIONS
  • 18. Information Classification: General December 8-10 | Virtual Event Thank you for joining Contribute to the RISC-V conversation on social! #RISCVSUMMIT #KLESSYDRA @mauro_olivieri_ https://github.com/klessydra Mauro.Olivieri@uniroma1.it