LEGaTO Integration

The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
LEGaTO Integration
LEGaTO thematic session – HiPEAC CSW
Autumn 2020
Xavier Martorell

19/01/2020 2
• OmpSs integration with Xitao
• OmpSs support for CUDA and OpenCL environments
• OmpSs with support for Xilinx FPGAs (integrated and discrete)
• OmpSs integration with Dfiant
• OmpSs integration with Maxeler
• OmpSs integration with SGX
• Linter tool
• Eclipse plugins
• Conclusions and Future Work
Outline

19/01/2020 3
• Targeting SMP and big-LITTLE
environments
• Nanos6 and XiTAO runtimes coexisting
to execute their own tasks
• Use taskset to separate core
resources between the two models
OmpSs integration with XiTAO
OmpSs@XiTAO
Application
GCC
Programmer splits the code
Nanos6 tasksXiTAO tasks
OmpSs@XiTAO.elf SMP
Nanos6
XiTAO
Main app

19/01/2020 4
• Source code including OmpSs and XiTAO tasks
OmpSs integration with XiTAO
#pragma oss task for shared(A_omp, C_omp, B, openmp_work_size, N)
for(size_t i = 0; i < openmp_work_size; ++i) {
for(size_t k = 0; k < N; ++k) {
for(size_t j = 0; j < N; ++j) {
C_omp[i * N + j] += A_omp[i * N + k] * B[k * N + j];
}
}
}
/*! this TAO will take two matrices and multiply them.
This TAO implements internal dynamic scheduling.*/
class MatVecTAO : public AssemblyTask
{
public:
//! Inherited pure virtual function that is called by the runtime upon executing the TAO.
void execute(int threadid)
{
// int tid = threadid - leader;
size_t li = i++;
while(li < nrows){
for (size_t j = 0; j < N; ++j) {
for(size_t k = 0; k < N; ++k) {
C[li*N + j] += A[li*N + k] * B[k*N + j];
}
}
li = i++;
}
}
OmpSs
TAO

19/01/2020 5
• AMD GPUs
• Intel/Altera FPGAs
• Using existing kernels
• “Implements”
allows to execute
kernels on the 3
architectures
OmpSs support for OpenCL and CUDA kernels
OmpSs@OpenCL running on AMD GPUs and FPGA Terasic DE5net_a7 / Attila / Stratix 10 boards
OmpSs Application
Mercurium
GCC
Nanos
Extrae Instr.
OmpSs.elf
Acc code
SMP
Acc
OmpSs phase OpenCL phase
Code generation
OpenCL compiler
Host code + Nanos calls
OpenCL (.cl)
files
Object files
CUDA phase
Nvidia CUDA compiler
CUDA (.cu) files

• Allow the runtime system to execute the same functionality in diverse
resources at the same time
o CPU: optimized implementation (MKL, OpenBLAS…)
o GPU: optimized kernel / CUBlas
o FPGA: synthetized kernel (OpenCL, HLS, [Maxeler, DFIANT],…)
16/10/2020
“Implements” approach
#pragma omp target device(smp) copy_deps
#pragma omp task in([bsize*bsize]A, [bsize*bsize]B) inout([bsize*bsize]C)
void matrixMult(REAL *C, REAL *A, REAL * B, int wa, int bsize)
{
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, bsize, bsize, bsize,
1.0f, A, bsize, B, bsize, 1.0f, C, bsize);
}
SMP

• Adding versions for the same task
o OpenCL
o CUDA
16/10/2020
#pragma omp target device(opencl) ndrange(2,NB,NB,BL_SIZE,BL_SIZE) copy_deps implements(matrixMult)
#pragma omp task inout([NB*NB]C) in([NB*NB]A,[NB*NB]B)
__kernel void matrixMult_opencl(__global REAL* C,__global REAL* A, __global REAL* B,int wA, int wB);
#pragma omp target device(cuda) ndrange(2,NB,NB,BL_SIZE,BL_SIZE) copy_deps implements(matrixMult)
#pragma omp task inout([NB*NB]C) in([NB*NB]A,[NB*NB]B)
__global__ void matrixMult_cuda(REAL* C, REAL* A, REAL * B, int wA, int wB);
FPGA
GPGPU

• Matrix multiplication (blocked), using versions
16/10/2020
void matmul( int m, int l, int n, int mDIM, int lDIM, int nDIM, REAL **tileA, REAL **tileB, REAL **tileC )
{
int i, j, k;
for(i = 0;i < mDIM; i++){
for (j = 0; j < nDIM; j++){
for (k = 0; k < lDIM; k++){
//Kernel call
matrixMult(tileC[i*nDIM+j], tileA[i*lDIM+k], tileB[k*nDIM+j],NB,NB);
}
}
}
#pragma omp taskwait
}
SMP
FPGA
GPGPU

• OmpSs with OpenCL on FPGA and GPU
16/10/2020
Intel 4-core i7-7700 @3.6GHz with 2 ht/core
Nvidia GeForce GTX TITAN X
Intel Arria 10 de5net_a7
0
50
100
150
200
250
300
350
400
Performance of Matrix Multiplication (2048x2048)
Gflop/s Gflop/s max

19/01/2020 10
• Targeting Xilinx FPGAs
List of supported FPGAs:
 Zynq-7000, 32 bits (Xilinx ZC702,
ZC706, Digilent Zedboard, Zybo)
 Zynq-U+, 64 bits (AXIOM board,
Trenz board, Xilinx ZCU102)
 Ported to the COM Express board
from LEGaTO
 Alpha-Data (discrete)
 Xilinx Alveo U200 (disc.)
 Similar implementation for
Maxeler target (discrete)
OmpSs support for HLS kernels
OmpSs@FPGA Compilation environment: improvement of autoVivado

19/01/2020 11
• Single source parallel programming
• FPGA and cores used at the same time
OmpSs@FPGA
#pragma omp target device(fpga) implements(matrix_multiply) num_instances(3)
#pragma omp task in(a,b) inout(c)
void matrix_multiply_fpga(float a[BS][BS], float b[BS][BS],
float c[BS][BS]);
#pragma omp target device(smp) copy_deps
#pragma omp task depend(in:a,b) depend (inout:c)
void matrix_multiply(float a[BS][BS], float b[BS][BS],
float c[BS][BS]);
SMP
FPGA
• implements() indicates
equivalent functions
in SMP and FPGA
• num_instances() allows
to generate the
indicated number of
IP accelerators

19/01/2020 12
Evaluation
• Environment: Xilinx ZCU102
• 4 Cortex-A53 Arm cores
• 1 Ultrascale+ FPGA
• Increasing data bus width
• 32 to 128 bits
• “Implements” allows cores
contribute to performance
• 130-150 Gflops sustained
depending on data bus width
…
for (i=0; i<NB; i++)
for (j=0; j<NB; j++)
for (k=0; k<NB; k++)
matrix_multiply(A[i][k], B[k][j], C[i][j]);
…

19/01/2020 13
• Targeting Xilinx FPGAs
• DFiant kernels
managed as
HLS kernels
Running DFiant kernels
AutoAIT tool
OmpSs@FPGA
Application
Mercurium
GCC
Nanos
FPGA dmalib
Extrae OMPT
OmpSs.elf
bitstream
SMP
FPGA
OS (Linux) +
Platform Device Tree +
FPGA DMA driver
(BOOT.bin, image.ub)
OmpSs phase FPGA phase
Code generation
Vivado HLS
Hardware
generation
Netlist
Task
Manager DMA
engines
Host code + Nanos calls
Instrumentation
Interconnect
Vivado
OmpSs.elf
bitstream
SMP
FPGA
DFiant compiler
DFiant kernels

19/01/2020 14
• Nanos runtime has been
integrated with Maxeler using the
SLiC interface
• Maxeler kernels
compiled offline
and invoked at
runtime
[similar to CUDA/OpenCL]
Running Maxeler kernels
OmpSs@Maxeler
Application
GCC
Programmer splits the code
Source code with calls
to Nanos++ runtime
Rewrite to Maxj version
Nanos
libmaxeleros
OmpSs.elf
bitstream
SMP
Maxeler
OS (Linux) +
MaxelerOS
driver
MaxJ Compiler
Slic object file

19/01/2020 15
• Nanos can invoke secure tasks
through SGX
Integration with SGX tasks
OmpSs@SGX
Application
GCC
Source code with calls
to Nanos++ runtime
Programmer adapts
tasks to SGX
Nanos
Enclave support
OmpSs.elf SMP
OS (Linux) +
iSGX driver
Compile for SGX
Enclave
kernels
Mercuriu
m

19/01/2020 16
• Profile support for OmpSs apps
• Collecting application memory accesses
• Based on the Pin tool
• Goals
• Identify issues with data annotations
• Generate a report with the problems
found
OmpSs Linter
Improving debugging of OmpSs applications
Linter
Trace
generator
Pin VM
Trace
Application
OmpSs-2
program
Debug
information
Nanos
Trace
processor
Task
primitives
Instrument API
Events
Task state
transitions

16/10/2020
IDE Plugin
Support in Eclipse
• Support for OpenMP and OmpSs development in eclipse
• Plugins developed
• Support for most of the programming models directives and clauses
• Including small help descriptions
• Based on context, with autocompletion
• Integration of the compilation environment
• Eclipse Che

16/10/2020
Eclipse CHE
Compilation environment

16/10/2020
Conclusions and Future Work
• LEGaTO integrations
• Tools to support GPUs, and HLS and OpenCL FPGAs
oFor task offloading and application acceleration
• Task scheduling with OmpSs and XiTAO
• Prototypes supporting DFiant, Maxeler, and secure tasks
• Linter support for detecting data races
• Eclipse plug-in for programming assistance
• Now preparing the code distribution with all these contributions
• Further experimentation with the LEGaTO benchmarks

LEGaTO Integration

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a LEGaTO Integration

Semelhante a LEGaTO Integration (20)

Mais de LEGATO project

Mais de LEGATO project (20)

Último

Último (20)

LEGaTO Integration