Module for Grade 9 for Asynchronous/Distance learning
LEGaTO Integration
1. The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
LEGaTO Integration
LEGaTO thematic session – HiPEAC CSW
Autumn 2020
Xavier Martorell
2. 19/01/2020 2
• OmpSs integration with Xitao
• OmpSs support for CUDA and OpenCL environments
• OmpSs with support for Xilinx FPGAs (integrated and discrete)
• OmpSs integration with Dfiant
• OmpSs integration with Maxeler
• OmpSs integration with SGX
• Linter tool
• Eclipse plugins
• Conclusions and Future Work
Outline
3. 19/01/2020 3
• Targeting SMP and big-LITTLE
environments
• Nanos6 and XiTAO runtimes coexisting
to execute their own tasks
• Use taskset to separate core
resources between the two models
OmpSs integration with XiTAO
OmpSs@XiTAO
Application
GCC
Programmer splits the code
Nanos6 tasksXiTAO tasks
OmpSs@XiTAO.elf SMP
Nanos6
XiTAO
Main app
4. 19/01/2020 4
• Source code including OmpSs and XiTAO tasks
OmpSs integration with XiTAO
#pragma oss task for shared(A_omp, C_omp, B, openmp_work_size, N)
for(size_t i = 0; i < openmp_work_size; ++i) {
for(size_t k = 0; k < N; ++k) {
for(size_t j = 0; j < N; ++j) {
C_omp[i * N + j] += A_omp[i * N + k] * B[k * N + j];
}
}
}
/*! this TAO will take two matrices and multiply them.
This TAO implements internal dynamic scheduling.*/
class MatVecTAO : public AssemblyTask
{
public:
//! Inherited pure virtual function that is called by the runtime upon executing the TAO.
void execute(int threadid)
{
// int tid = threadid - leader;
size_t li = i++;
while(li < nrows){
for (size_t j = 0; j < N; ++j) {
for(size_t k = 0; k < N; ++k) {
C[li*N + j] += A[li*N + k] * B[k*N + j];
}
}
li = i++;
}
}
OmpSs
TAO
5. 19/01/2020 5
• AMD GPUs
• Intel/Altera FPGAs
• Using existing kernels
• “Implements”
allows to execute
kernels on the 3
architectures
OmpSs support for OpenCL and CUDA kernels
OmpSs@OpenCL running on AMD GPUs and FPGA Terasic DE5net_a7 / Attila / Stratix 10 boards
OmpSs Application
Mercurium
GCC
Nanos
Extrae Instr.
OmpSs.elf
Acc code
SMP
Acc
OmpSs phase OpenCL phase
Code generation
OpenCL compiler
Host code + Nanos calls
OpenCL (.cl)
files
Object files
CUDA phase
Nvidia CUDA compiler
CUDA (.cu) files
6. • Allow the runtime system to execute the same functionality in diverse
resources at the same time
o CPU: optimized implementation (MKL, OpenBLAS…)
o GPU: optimized kernel / CUBlas
o FPGA: synthetized kernel (OpenCL, HLS, [Maxeler, DFIANT],…)
16/10/2020
“Implements” approach
#pragma omp target device(smp) copy_deps
#pragma omp task in([bsize*bsize]A, [bsize*bsize]B) inout([bsize*bsize]C)
void matrixMult(REAL *C, REAL *A, REAL * B, int wa, int bsize)
{
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, bsize, bsize, bsize,
1.0f, A, bsize, B, bsize, 1.0f, C, bsize);
}
SMP
7. • Adding versions for the same task
o OpenCL
o CUDA
16/10/2020
“Implements” approach
#pragma omp target device(opencl) ndrange(2,NB,NB,BL_SIZE,BL_SIZE) copy_deps implements(matrixMult)
#pragma omp task inout([NB*NB]C) in([NB*NB]A,[NB*NB]B)
__kernel void matrixMult_opencl(__global REAL* C,__global REAL* A, __global REAL* B,int wA, int wB);
#pragma omp target device(cuda) ndrange(2,NB,NB,BL_SIZE,BL_SIZE) copy_deps implements(matrixMult)
#pragma omp task inout([NB*NB]C) in([NB*NB]A,[NB*NB]B)
__global__ void matrixMult_cuda(REAL* C, REAL* A, REAL * B, int wA, int wB);
FPGA
GPGPU
8. • Matrix multiplication (blocked), using versions
16/10/2020
“Implements” approach
void matmul( int m, int l, int n, int mDIM, int lDIM, int nDIM, REAL **tileA, REAL **tileB, REAL **tileC )
{
int i, j, k;
for(i = 0;i < mDIM; i++){
for (j = 0; j < nDIM; j++){
for (k = 0; k < lDIM; k++){
//Kernel call
matrixMult(tileC[i*nDIM+j], tileA[i*lDIM+k], tileB[k*nDIM+j],NB,NB);
}
}
}
#pragma omp taskwait
}
SMP
FPGA
GPGPU
9. • OmpSs with OpenCL on FPGA and GPU
16/10/2020
“Implements” approach
Intel 4-core i7-7700 @3.6GHz with 2 ht/core
Nvidia GeForce GTX TITAN X
Intel Arria 10 de5net_a7
0
50
100
150
200
250
300
350
400
Performance of Matrix Multiplication (2048x2048)
Gflop/s Gflop/s max
10. 19/01/2020 10
• Targeting Xilinx FPGAs
List of supported FPGAs:
Zynq-7000, 32 bits (Xilinx ZC702,
ZC706, Digilent Zedboard, Zybo)
Zynq-U+, 64 bits (AXIOM board,
Trenz board, Xilinx ZCU102)
Ported to the COM Express board
from LEGaTO
Alpha-Data (discrete)
Xilinx Alveo U200 (disc.)
Similar implementation for
Maxeler target (discrete)
OmpSs support for HLS kernels
OmpSs@FPGA Compilation environment: improvement of autoVivado
11. 19/01/2020 11
• Single source parallel programming
• FPGA and cores used at the same time
OmpSs@FPGA
“Implements” approach
#pragma omp target device(fpga) implements(matrix_multiply) num_instances(3)
#pragma omp task in(a,b) inout(c)
void matrix_multiply_fpga(float a[BS][BS], float b[BS][BS],
float c[BS][BS]);
#pragma omp target device(smp) copy_deps
#pragma omp task depend(in:a,b) depend (inout:c)
void matrix_multiply(float a[BS][BS], float b[BS][BS],
float c[BS][BS]);
SMP
FPGA
• implements() indicates
equivalent functions
in SMP and FPGA
• num_instances() allows
to generate the
indicated number of
IP accelerators
12. 19/01/2020 12
Evaluation
• Environment: Xilinx ZCU102
• 4 Cortex-A53 Arm cores
• 1 Ultrascale+ FPGA
• Increasing data bus width
• 32 to 128 bits
• “Implements” allows cores
contribute to performance
• 130-150 Gflops sustained
depending on data bus width
…
for (i=0; i<NB; i++)
for (j=0; j<NB; j++)
for (k=0; k<NB; k++)
matrix_multiply(A[i][k], B[k][j], C[i][j]);
…
14. 19/01/2020 14
• Nanos runtime has been
integrated with Maxeler using the
SLiC interface
• Maxeler kernels
compiled offline
and invoked at
runtime
[similar to CUDA/OpenCL]
Running Maxeler kernels
OmpSs@Maxeler
Application
GCC
Programmer splits the code
Source code with calls
to Nanos++ runtime
Rewrite to Maxj version
Nanos
libmaxeleros
OmpSs.elf
bitstream
SMP
Maxeler
OS (Linux) +
MaxelerOS
driver
MaxJ Compiler
Slic object file
15. 19/01/2020 15
• Nanos can invoke secure tasks
through SGX
Integration with SGX tasks
OmpSs@SGX
Application
GCC
Source code with calls
to Nanos++ runtime
Programmer adapts
tasks to SGX
Nanos
Enclave support
OmpSs.elf SMP
OS (Linux) +
iSGX driver
Compile for SGX
Enclave
kernels
Mercuriu
m
16. 19/01/2020 16
• Profile support for OmpSs apps
• Collecting application memory accesses
• Based on the Pin tool
• Goals
• Identify issues with data annotations
• Generate a report with the problems
found
OmpSs Linter
Improving debugging of OmpSs applications
Linter
Trace
generator
Pin VM
Trace
Application
OmpSs-2
program
Debug
information
Nanos
Trace
processor
Task
primitives
Instrument API
Events
Task state
transitions
17. 16/10/2020
IDE Plugin
Support in Eclipse
• Support for OpenMP and OmpSs development in eclipse
• Plugins developed
• Support for most of the programming models directives and clauses
• Including small help descriptions
• Based on context, with autocompletion
• Integration of the compilation environment
• Eclipse Che
19. 16/10/2020
Conclusions and Future Work
• LEGaTO integrations
• Tools to support GPUs, and HLS and OpenCL FPGAs
oFor task offloading and application acceleration
• Task scheduling with OmpSs and XiTAO
• Prototypes supporting DFiant, Maxeler, and secure tasks
• Linter support for detecting data races
• Eclipse plug-in for programming assistance
• Now preparing the code distribution with all these contributions
• Further experimentation with the LEGaTO benchmarks