SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Exploring Programming Models for LUMI Supercomputer
ECMWF HPC Workshop on Meteorology, September 2021
George S. Markomanolis
Lead HPC Scientist, CSC – IT Center for Science Ltd. 1
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
AMD GPUs (MI100 example)
2
LUMI will have the next
generation of AMD
Instinct GPU
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Introduction to HIP
• Radeon Open Compute Platform (ROCm)
• HIP: Heterogeneous Interface for Portability is developed by AMD to
program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA platforms
• HIP is similar to CUDA and there is no performance overhead on NVIDIA
GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed directly in HIP
• The supported CUDAAPI is called with HIP prefix (cudamalloc ->
hipmalloc)
https://github.com/ROCm-Developer-Tools/HIP
3
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
ECMWF Atlas
• Atlas is an ECMWF library for parallel data-structures supporting unstructured grids and function spaces
cd src
hipconvertinplace-perl.sh .
…
info: TOTAL-converted 92 CUDA->HIP refs ( error:14 init:0 version:0 device:12 context:0 module:0
memory:17 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0
execution:0 graph:0
…
kernels (5 total) : unpack_kernel(1) kernel_multiblock(1) kernel_ex(1) loop_kernel_ex(1) kernel_exe(1)
hipSuccess 15
hipDeviceSynchronize 12
hipLaunchKernelGGL 10
hipGetLastError 8
…
hipMemcpyDeviceToHost 1
• The code is hipified and now it is only C++ with HIP and Fortran with OpenACC.
4
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Benchmark MatMul cuBLAS, hipBLAS
• Use the benchmark https://github.com/pc2/OMP-Offloading
• Matrix multiplication of 2048 x 2048, single precision
• All the CUDA calls were converted and it was linked with hipBlas
5
0
5000
10000
15000
20000
25000
V100 MI100
GFLOP/s
GPU
Matrix Multiplication (SP)
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
BabelStream
6
• A memory bound benchmark from the university of Bristol
• Five kernels
o add (a[i]=b[i]+c[i])
o multiply (a[i]=b*c[i])
o copy (a[i]=b[i])
o triad (a[i]=b[i]+d*c[i])
o dot (sum = sum+d*c[i])
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Improving OpenMP performance on BabelStream for MI100
• Original call:
#pragma omp target teams distribute parallel for simd
• Optimized call
#pragma omp target teams distribute parallel for simd thread_limit(256) num_teams(240)
• For the dot kernel we used 720 teams
7
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Fortran
• First Scenario: Fortran + CUDA C/C++
oAssuming there is no CUDA code in the Fortran files.
oHipify CUDA
oCompile and link with hipcc
• Second Scenario: CUDA Fortran
oThere is no hipify equivalent but there is another approach…
oHIP functions are callable from C, using `extern C`
oSee hipfort
8
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Hipfort
• We need to use hipfort, a Fortran interface library for GPU kernel *
• Steps:
1) We write the kernels in a new C++ file
2) Wrap the kernel launch in a C function
3) Use Fortran 2003 C binding to call the function
4) Things could change in the future (see GPUFORT)
• Example of Fortran with HIP:
https://github.com/cschpc/lumi/tree/main/hipfort
• Use OpenMP offload to GPUs
* https://github.com/ROCmSoftwarePlatform/hipfort
9
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc 10
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
GPUFort – Fortran with OpenACC (1/2)
11
Ifdef original file
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
GPUFort – Fortran with OpenACC (2/2)
12
Extern C routine Kernel
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
OpenACC
• GCC will provide OpenACC (Mentor Graphics contract, now called
Siemens EDA). Checking functionality
• HPE is supporting for OpenACC v2.0 for Fortran. This is quite old
OpenACC version. HPE announced support for OpenACC, newer
versions for all the main programming languages (Fortran/C/C++)
• Clacc from ORNL: https://github.com/llvm-doe-org/llvm-
project/tree/clacc/master OpenACC from LLVM only for C (Fortran and
C++ in the future)
oTranslate OpenACC to OpenMP Offloading
• If the code is in Fortran, we could use GPUFort
13
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Programming Models
• We have utilized with success at least the following programming
models/interfaces on AMD MI100 GPU:
oHIP
oOpenMP Offloading
ohipSYCL
oKokkos
oAlpaka
14
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
SYCL (hipSYCL)
• C++ Single-source Heterogeneous Programming for Acceleration Offload
• Generic programming with templates and lambda functions
• Big momentum currently, NERSC, ALCF, Codeplay partnership
• SYCL 2020 specification was announced early 2021
• Terminology: Unified Shared Memory (USM), buffer, accessor, data
movement, queue
• hipSYCL supports CPU, AMD/NVIDIA GPUs, Intel GPU (experimental)
15
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Kokkos
• Kokkos Core implements a programming model in C++ for writing
performance portable applications targeting all major HPC platforms.
It provides abstractions for both parallel execution of code and data
management. (ECP/NNSA)
• Terminology: view, execution space (serial, threads, OpenMP,
GPU,…), memory space (DRAM, NVRAM, …), pattern, policy
• Supports: CPU, AMD/NVIDIA GPUs, Intel KNL etc.
16
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Alpaka
• Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is
a header-only C++14 abstraction library for accelerator development.
Developed by HZDR.
• Similar to CUDA terminology, grid/block/thread plus element
• Platform decided at the compile time, single source interface
• Easy to port CUDA codes through CUPLA
• Terminology: queue (non/blocking), buffers, work division
• Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU) etc.
17
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
BabelStream Results
18
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Tuning
• Multiple wavefronts per compute unit (CU) is important to hide
latency and instruction throughput
• Tune number of threads per block, number of teams for OpenMP
offloading etc.
• Memory coalescing increases bandwidth
• Unrolling loops allow compiler to prefetch data
• Small kernels can cause latency overhead, adjust the workload
• Use of Local Data Share (LDS) memory
19
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Conclusion/Future work
• A code written in C/C++ and MPI+OpenMP is a bit easier to be ported to OpenMP
offloading compared to other approaches.
• The hipSYCL, Kokos, and Alpaka could be a good option considering that the code
is in C++.
• There can be challenges, depending on the code and what GPU functionalities are
integrated to an application
• It will be required to tune the code for high occupancy
• Track historical performance among new compilers
• GCC for OpenACC and OpenMP Offloading for AMD GPUs (issues will be solved
with GCC 12.x and LLVM 13.x)
• Tracking how profiling tools work on AMD GPUs (rocprof, TAU, HPCToolkit)
• We have trained more than 80 people on HIP porting: http://github.com/csc-
training/hip
20
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
www.lumi-supercomputer.eu
contact@lumi-supercomputer.eu
Follow us
Twitter: @LUMIhpc
LinkedIn: LUMI supercomputer
YouTube: LUMI supercomputer
georgios.markomanolis@csc.fi
CSC – IT Center for Science Ltd.
Lead HPC Scientist
George Markomanolis
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Porting Codes to LUMI (I)
22
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
Porting Codes to LUMI (II)
23
www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc
N-BODY SIMULATION
• N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2
• 171 CUDA calls converted to HIP without issues, close to 1000 lines of code
• 32768 number of small particles, 2000 time steps
• Tune the number of threads equal to 256 than 1024 default at ROCm 4.1
24
0
20
40
60
80
100
120
V100 MI100 MI100*
Seconds
GPU
N-Body Simulation

Mais conteúdo relacionado

Mais procurados

P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadOpen-NFP
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVELinaro
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorLarry Lang
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Anne Nicolas
 
Involvement in OpenHPC
Involvement in OpenHPC	Involvement in OpenHPC
Involvement in OpenHPC Linaro
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Cheng-Chun William Tu
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...inside-BigData.com
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVELinaro
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Netronome
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem	Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem Linaro
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFoholiab
 

Mais procurados (20)

Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
Hands on OpenCL
Hands on OpenCLHands on OpenCL
Hands on OpenCL
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
Involvement in OpenHPC
Involvement in OpenHPC	Involvement in OpenHPC
Involvement in OpenHPC
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017
 
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem	Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
 
ARM and Machine Learning
ARM and Machine LearningARM and Machine Learning
ARM and Machine Learning
 

Semelhante a Exploring the Programming Models for the LUMI Supercomputer

LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...Rogue Wave Software
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureAntoine Poliakov
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfOpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfssuser866937
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...ryancox
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADDesign World
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...CEE-SEC(R)
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18Yaoqing Gao
 
Porting To Symbian
Porting To SymbianPorting To Symbian
Porting To SymbianMark Wilcox
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLinaro
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++JetBrains
 

Semelhante a Exploring the Programming Models for the LUMI Supercomputer (20)

AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Feedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows AzureFeedback on Big Compute & HPC on Windows Azure
Feedback on Big Compute & HPC on Windows Azure
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdfOpenMP-OpenACC-Offload-Cauldron2022-1.pdf
OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
Массовый параллелизм для гетерогенных вычислений на C++ для беспилотных автом...
 
IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18IBM XL Compilers Performance Tuning 2016-11-18
IBM XL Compilers Performance Tuning 2016-11-18
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Porting To Symbian
Porting To SymbianPorting To Symbian
Porting To Symbian
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++
 

Mais de George Markomanolis

Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IGeorge Markomanolis
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIGeorge Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IGeorge Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerGeorge Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

Mais de George Markomanolis (13)

Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Último

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Último (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Exploring the Programming Models for the LUMI Supercomputer

  • 1. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Exploring Programming Models for LUMI Supercomputer ECMWF HPC Workshop on Meteorology, September 2021 George S. Markomanolis Lead HPC Scientist, CSC – IT Center for Science Ltd. 1
  • 2. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc AMD GPUs (MI100 example) 2 LUMI will have the next generation of AMD Instinct GPU
  • 3. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Introduction to HIP • Radeon Open Compute Platform (ROCm) • HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs • It is a C++ runtime API and it supports both AMD and NVIDIA platforms • HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs • Many well-known libraries have been ported on HIP • New projects or porting from CUDA, could be developed directly in HIP • The supported CUDAAPI is called with HIP prefix (cudamalloc -> hipmalloc) https://github.com/ROCm-Developer-Tools/HIP 3
  • 4. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc ECMWF Atlas • Atlas is an ECMWF library for parallel data-structures supporting unstructured grids and function spaces cd src hipconvertinplace-perl.sh . … info: TOTAL-converted 92 CUDA->HIP refs ( error:14 init:0 version:0 device:12 context:0 module:0 memory:17 virtual_memory:0 addressing:0 stream:0 event:0 external_resource_interop:0 stream_memory:0 execution:0 graph:0 … kernels (5 total) : unpack_kernel(1) kernel_multiblock(1) kernel_ex(1) loop_kernel_ex(1) kernel_exe(1) hipSuccess 15 hipDeviceSynchronize 12 hipLaunchKernelGGL 10 hipGetLastError 8 … hipMemcpyDeviceToHost 1 • The code is hipified and now it is only C++ with HIP and Fortran with OpenACC. 4
  • 5. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Benchmark MatMul cuBLAS, hipBLAS • Use the benchmark https://github.com/pc2/OMP-Offloading • Matrix multiplication of 2048 x 2048, single precision • All the CUDA calls were converted and it was linked with hipBlas 5 0 5000 10000 15000 20000 25000 V100 MI100 GFLOP/s GPU Matrix Multiplication (SP)
  • 6. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc BabelStream 6 • A memory bound benchmark from the university of Bristol • Five kernels o add (a[i]=b[i]+c[i]) o multiply (a[i]=b*c[i]) o copy (a[i]=b[i]) o triad (a[i]=b[i]+d*c[i]) o dot (sum = sum+d*c[i])
  • 7. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Improving OpenMP performance on BabelStream for MI100 • Original call: #pragma omp target teams distribute parallel for simd • Optimized call #pragma omp target teams distribute parallel for simd thread_limit(256) num_teams(240) • For the dot kernel we used 720 teams 7
  • 8. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Fortran • First Scenario: Fortran + CUDA C/C++ oAssuming there is no CUDA code in the Fortran files. oHipify CUDA oCompile and link with hipcc • Second Scenario: CUDA Fortran oThere is no hipify equivalent but there is another approach… oHIP functions are callable from C, using `extern C` oSee hipfort 8
  • 9. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Hipfort • We need to use hipfort, a Fortran interface library for GPU kernel * • Steps: 1) We write the kernels in a new C++ file 2) Wrap the kernel launch in a C function 3) Use Fortran 2003 C binding to call the function 4) Things could change in the future (see GPUFORT) • Example of Fortran with HIP: https://github.com/cschpc/lumi/tree/main/hipfort • Use OpenMP offload to GPUs * https://github.com/ROCmSoftwarePlatform/hipfort 9
  • 11. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc GPUFort – Fortran with OpenACC (1/2) 11 Ifdef original file
  • 12. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc GPUFort – Fortran with OpenACC (2/2) 12 Extern C routine Kernel
  • 13. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc OpenACC • GCC will provide OpenACC (Mentor Graphics contract, now called Siemens EDA). Checking functionality • HPE is supporting for OpenACC v2.0 for Fortran. This is quite old OpenACC version. HPE announced support for OpenACC, newer versions for all the main programming languages (Fortran/C/C++) • Clacc from ORNL: https://github.com/llvm-doe-org/llvm- project/tree/clacc/master OpenACC from LLVM only for C (Fortran and C++ in the future) oTranslate OpenACC to OpenMP Offloading • If the code is in Fortran, we could use GPUFort 13
  • 14. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Programming Models • We have utilized with success at least the following programming models/interfaces on AMD MI100 GPU: oHIP oOpenMP Offloading ohipSYCL oKokkos oAlpaka 14
  • 15. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc SYCL (hipSYCL) • C++ Single-source Heterogeneous Programming for Acceleration Offload • Generic programming with templates and lambda functions • Big momentum currently, NERSC, ALCF, Codeplay partnership • SYCL 2020 specification was announced early 2021 • Terminology: Unified Shared Memory (USM), buffer, accessor, data movement, queue • hipSYCL supports CPU, AMD/NVIDIA GPUs, Intel GPU (experimental) 15
  • 16. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Kokkos • Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. It provides abstractions for both parallel execution of code and data management. (ECP/NNSA) • Terminology: view, execution space (serial, threads, OpenMP, GPU,…), memory space (DRAM, NVRAM, …), pattern, policy • Supports: CPU, AMD/NVIDIA GPUs, Intel KNL etc. 16
  • 17. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Alpaka • Abstraction Library for Parallel Kernel Acceleration (Alpaka) library is a header-only C++14 abstraction library for accelerator development. Developed by HZDR. • Similar to CUDA terminology, grid/block/thread plus element • Platform decided at the compile time, single source interface • Easy to port CUDA codes through CUPLA • Terminology: queue (non/blocking), buffers, work division • Supports: HIP, CUDA, TBB, OpenMP (CPU and GPU) etc. 17
  • 19. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Tuning • Multiple wavefronts per compute unit (CU) is important to hide latency and instruction throughput • Tune number of threads per block, number of teams for OpenMP offloading etc. • Memory coalescing increases bandwidth • Unrolling loops allow compiler to prefetch data • Small kernels can cause latency overhead, adjust the workload • Use of Local Data Share (LDS) memory 19
  • 20. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc Conclusion/Future work • A code written in C/C++ and MPI+OpenMP is a bit easier to be ported to OpenMP offloading compared to other approaches. • The hipSYCL, Kokos, and Alpaka could be a good option considering that the code is in C++. • There can be challenges, depending on the code and what GPU functionalities are integrated to an application • It will be required to tune the code for high occupancy • Track historical performance among new compilers • GCC for OpenACC and OpenMP Offloading for AMD GPUs (issues will be solved with GCC 12.x and LLVM 13.x) • Tracking how profiling tools work on AMD GPUs (rocprof, TAU, HPCToolkit) • We have trained more than 80 people on HIP porting: http://github.com/csc- training/hip 20
  • 21. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc www.lumi-supercomputer.eu contact@lumi-supercomputer.eu Follow us Twitter: @LUMIhpc LinkedIn: LUMI supercomputer YouTube: LUMI supercomputer georgios.markomanolis@csc.fi CSC – IT Center for Science Ltd. Lead HPC Scientist George Markomanolis
  • 24. www.lumi-supercomputer.eu #lumisupercomputer #lumieurohpc N-BODY SIMULATION • N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2 • 171 CUDA calls converted to HIP without issues, close to 1000 lines of code • 32768 number of small particles, 2000 time steps • Tune the number of threads equal to 256 than 1024 default at ROCm 4.1 24 0 20 40 60 80 100 120 V100 MI100 MI100* Seconds GPU N-Body Simulation