SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Getting started with AMD GPUs
FOSDEM’21 HPC, Big Data and Data Science devroom
February 7th, 2021
George S. Markomanolis
Lead HPC Scientist, CSC – IT For Science Ltd.
Outline
• Motivation
• LUMI
• ROCm
• Introduction and porting codes to HIP
• Benchmarking
• Fortran and HIP
• Tuning
2
Disclaimer
• AMD ecosystem is under heavy development, many things can change without notice
• All the experiments took place on NVIDIA V100 GPU (Puhti cluster at CSC)
• Trying to use the latest versions of ROCm
• Some results are really fresh and investigating the outcome
3
LUMI
4
Motivation/Challenges
• LUMI will have AMD GPUs
• Need to learn how to program and port codes on AMD ecosystem
• Provide training to LUMI users
• Investigate in the future about possible problems
• Not yet access to AMD GPUs
5
AMD GPUs (MI100 example)
6
AMD MI100
LUMI will have a
different GPU
Differences between HIP and CUDA
• AMD GCN hardware wavefronts size is 64 (like warp for CUDA)
• Some CUDA library functions do not have AMD equivalents
• Shared memory and registers per thrad can differ between AMD and
NVIDIA hardware
7
ROCm
8
• Open Software Platform for GPU-
accelerated Computing by AMD
ROCm installation
• Many components need to be installed
• Rocm-cmake
• ROCT Thunk Interface
• HSA Runtime API and runtime for ROCm
• ROCM LLVM / Clang
• Rocminfo (only for AMD HW)
• ROCm-Device-Libs
• ROCm-CompilerSupport
• ROCclr - Radeon Open Compute Common Language Runtime
• HIP
Instructions: https://github.com/cschpc/lumi/blob/main/rocm/rocm_install.md
Script: https://github.com/cschpc/lumi/blob/main/rocm/rocm_installation.sh
Repo: https://github.com/cschpc/lumi
9
Introduction to HIP
• HIP: Heterogeneous Interface for Portability is developed by
AMD to program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA
platforms
• HIP is similar to CUDA and there is no performance overhead
on NVIDIA GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed
directly in HIP
https://github.com/ROCm-Developer-Tools/HIP
10
Differences between CUDA and HIPAPI
#include “cuda.h”
cudaMalloc(&d_x, N*sizeof(double));
cudaMemcpy(d_x,x,N*sizeof(double),
cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
11
#include “hip/hip_runtime.h”
hipMalloc(&d_x, N*sizeof(double));
hipMemcpy(d_x,x,N*sizeof(double),
hipMemcpyHostToDevice);
hipDeviceSynchronize();
CUDA HIP
Launching kernel with CUDA and HIP
kernel_name <<<gridsize, blocksize,
shared_mem_size,
stream>>>
(arg0, arg1, ...);
12
hipLaunchKernelGGL(kernel_name,
gridsize,
blocksize,
shared_mem_size,
stream,
arg0, arg1, ... );
CUDA HIP
HIP API
• Device Management:
o hipSetDevice(), hipGetDevice(), hipGetDeviceProperties(), hipDeviceSynchronize()
• Memory Management
o hipMalloc(), hipMemcpy(), hipMemcpyAsync(), hipFree(), hipHostMalloc
• Streams
o hipStreamCreate(), hipStreamSynchronize(), hipStreamDestroy()
• Events
o hipEventCreate(), hipEventRecord(), hipStreamWaitEvent(), hipEventElapsedTime()
• Device Kernels
o __global__, __device__, hipLaunchKernelGGL()
• Device code
o threadIdx, blockIdx, blockDim, __shared__
o Hundreds math functions covering entire CUDA math library
• Error handling
o hipGetLastError(), hipGetErrorString()
https://rocmdocs.amd.com/en/latest/ROCm_API_References/HIP-API.html
13
Hipify
• Hipify tools convert automatically CUDA codes
• It is possible that not all the code is converted, the remaining
needs the implementation of the developer
• Hipify-perl: text-based search and replace
• Hipify-clang: source-to-source translator that uses clang
compiler
• Porting guide: https://github.com/ROCm-Developer-
Tools/HIP/blob/main/docs/markdown/hip_porting_guide.md
14
Hipify-perl
• It can scan directories and converts CUDA codes with replacement of the cuda to hip
(sed –e ’s/cuda/hip/g’)
$ hipify-perl --inplace filename
It modifies the filename input inplace, replacing input with hipified output, save backup
in .prehip file.
$ hipconvertinplace-perl.sh directory
It converts all the related files that are located inside the directory
15
Hipify-perl (cont).
1) $ ls src/
Makefile.am matMulAB.c matMulAB.h matMul.c
2) $ hipconvertinplace-perl.sh src
3) $ ls src/
Makefile.am matMulAB.c matMulAB.c.prehip matMulAB.h matMulAB.h.prehip
matMul.c matMul.c.prehip
No compilation took place, just convertion.
16
Hipify-perl (cont).
• The hipify-perl will return a report for each file, and it looks like this:
info: TOTAL-converted 53 CUDA->HIP refs ( error:0 init:0 version:0 device:1 ... library:16
... numeric_literal:12 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:888
kernels (0 total) :
hipFree 18
HIPBLAS_STATUS_SUCCESS 6
hipSuccess 4
hipMalloc 3
HIPBLAS_OP_N 2
hipDeviceSynchronize 1
hip_runtime 1
17
Hipify-perl (cont).
#include <cuda_runtime.h>
#include "cublas_v2.h"
if (cudaSuccess != cudaMalloc((void **) &a_dev,
sizeof(*a) * n * n) ||
cudaSuccess != cudaMalloc((void **) &b_dev,
sizeof(*b) * n * n) ||
cudaSuccess != cudaMalloc((void **) &c_dev,
sizeof(*c) * n * n)) {
printf("error: memory allocation (CUDA)n");
cudaFree(a_dev); cudaFree(b_dev);
cudaFree(c_dev);
cudaDestroy(handle);
exit(EXIT_FAILURE);
}
18
#include <hip/hip_runtime.h>
#include "hipblas.h”
if (hipSuccess != hipMalloc((void **) &a_dev,
sizeof(*a) * n * n) ||
hipSuccess != hipMalloc((void **) &b_dev,
sizeof(*b) * n * n) ||
hipSuccess != hipMalloc((void **) &c_dev,
sizeof(*c) * n * n)) {
printf("error: memory allocation (CUDA)n");
hipFree(a_dev); hipFree(b_dev); hipFree(c_dev);
hipblasDestroy(handle);
exit(EXIT_FAILURE);
}
CUDA HIP
Compilation
1) Compilation with CC=hipcc
matMulAB.c:21:10: fatal error: hipblas.h: No such file or directory 21 | #include
"hipblas.h"
2) Install HipBLAS library *
3) Compile again and the binary is ready. When the HIP is on NVIDIA hardware, the .cpp
file should be compiled with the option “hipcc -x cu …”.
• The hipcc is using nvcc on NVIDIA GPUs and hcc for AMD GPUs
* https://github.com/cschpc/lumi/blob/main/hip/hipblas.md
19
Hipify-clang
• Build from source
• Some times needs to include manually the headers -I/...
$ hipify-clang --print-stats -o matMul.o matMul.c
[HIPIFY] info: file 'matMul.c' statistics:
CONVERTED refs count: 0
UNCONVERTED refs count: 0
CONVERSION %: 0
REPLACED bytes: 0
TOTAL bytes: 4662
CHANGED lines of code: 1
TOTAL lines of code: 155
CODE CHANGED (in bytes) %: 0
CODE CHANGED (in lines) %: 1
TIME ELAPSED s: 22.94
20
Benchmark MatMul OpenMP oflload
• Use the benchmark https://github.com/pc2/OMP-Offloading for testing purposes,
matrix multiplication of 2048 x 2048
• All the CUDA calls were converted and it was linked with hipBlas among also
OpenMP offload
• CUDA
matMulAB (11) : 1001.2 GFLOPS 11990.1 GFLOPS maxabserr = 0.0
• HIP
matMulAB (11) : 978.8 GFLOPS 12302.4 GFLOPS maxabserr = 0.0
• For the most executions, HIP version was equal or a bit better than CUDA version, for total
execution, there is ~2.23% overhead for HIP using NVIDIA GPUs
21
N-BODY SIMULATION
• N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA)
AllPairs_N2
• 171 CUDA calls converted to HIP without issues, close to 1000 lines of code
• HIP calls: hipMemcpy, hipMalloc, hipMemcpyHostToDevice,
hipMemcpyDeviceToHost, hipLaunchKernelGGL, hipDeviceSynchronize, hip_runtime,
hipSuccess, hipGetErrorString, hipGetLastError, hipError_t, HIP_DYNAMIC_SHARED
• 32768 number of small particles, 2000 time steps
• CUDA execution time: 68.5 seconds
• HIP execution time: 70.1 seconds, ~2.33% overhead
22
Fortran
• First Scenario: Fortran + CUDA C/C++
oAssuming there is no CUDA code in the Fortran files.
oHipify CUDA
oCompile and link with hipcc
• Second Scenario: CUDA Fortran
oThere is no HIP equivalent
oHIP functions are callable from C, using `extern C`
oSee hipfort
23
Hipfort
• The approach to port Fortran codes on AMD GPUs is different, the hipify tool does
not support it.
• We need to use hipfort, a Fortran interface library for GPU kernel *
• Steps:
1) We write the kernels in a new C++ file
2) Wrap the kernel launch in a C function
3) Use Fortran 2003 C binding to call the function
4) Things could change in the future
• Use OpenMP offload to GPUs
* https://github.com/ROCmSoftwarePlatform/hipfort
24
Fortran CUDA example
• Saxpy example
• Fortran CUDA, 29 lines of code
• Ported to HIP manually, two files of 52 lines, with more than 20 new lines.
• Quite a lot of changes for such a small code.
• Should we try to use OpenMP offload before we try to HIP the code?
• Need to adjust Makefile to compile the multiple files
• The HIP version is up to 30% faster, seems to be a comparison between nvcc and
pgf90, still checking to verify some results
• Example of Fortran with HIP: https://github.com/cschpc/lumi/tree/main/hipfort
25
Fortran CUDA example (cont.)
26
Original Fortran CUDA New Fortran 2003 with HIP C++ with HIP and extern C
AMD OpenMP (AOMP)
• We have tested the LLVM provided OpenMP offload and gets
improved by the time
• AOMP is under heavy development and we started testing it.
• AOMP has still some performance issues according to some
public results but we expect to be also improved significanlty
by the time LUMI is delivered
• https://github.com/ROCm-Developer-Tools/aomp
27
OpenMP or HIP?
• Some users will be questioning about the approach
• OpenMP can provide a quick porting but it is expected with HIP
to have better performance as we avoid some layers like that.
• For complicated codes and programming languages as Fortran,
probably OpenMP could provide a benefit. Always profile your
code to investigate the performance.
28
Porting code to LUMI (not official)
29
Profiling/Debugging
• AMD will provide APIs for profiling and debugging
• Cray will support the profiling API through CrayPat
• Some well known tools are collaborating with AMD and preparing their tools for
profiling and debugging
• Some simple environment variables such as
AMD_LOG_LEVEL=4 will provide some information.
• More information about a hipMemcpy error:
hipError_t err = hipMemcpy(c,c_d,nBytes,hipMemcpyDeviceToHost);
printf("%s ",hipGetErrorString(err));
30
Tuning
• Multiple wavefronts per compute unit (CU) is important to hide
latency and instruction throughput
• Memory coalescing increases bandwidth
• Unrolling loops allow compiler to prefetch data
• Small kernels can cause latency overhead, adjust the workload
• Use of Local Data Share (LDS)
31
Programming models
• OpenACC will be available through the GCC as Mentor
Graphics (now called Siemens EDA) is developing the
OpenACC integration
• Kokkos, Raja, Alpaka, and SYCL should be able to be used on
LUMI but they do not support all the programming languages
32
Conclusion
• Depending on the code the porting to HIP can be more straight forward
• There can be challenges, depending on the code and what GPU
functionalities are integrated to an application
• There are many approaches to port a code and you should select the one
that you are more familiar and provides as possible as good performance
• It will be required to tune the code for high occupancy
• Profiling can help to investigate data transfer issues
• Probably is more productive to try OpenMP with offload to GPUs
initially with Fortran codes
33
facebook.com/CSCfi
twitter.com/CSCfi
youtube.com/CSCfi
linkedin.com/company/csc---it-center-for-science
Kuvat CSC:n arkisto, Adobe Stock ja Thinkstock
github.com/CSCfi
Thank you!
georgios.markomanolis@csc.fi

Mais conteúdo relacionado

Mais procurados

Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118Wei Fu
 
Cvim half precision floating point
Cvim half precision floating pointCvim half precision floating point
Cvim half precision floating pointtomoaki0705
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems Rebekah Rodriguez
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringErik Krogen
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBshimosawa
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?Michelle Holley
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)Akhila Dakshina
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containersinside-BigData.com
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingDan Brinkmann
 
Top 15 Tips for vGPU Success - Part 3-3
Top 15 Tips for vGPU Success - Part 3-3Top 15 Tips for vGPU Success - Part 3-3
Top 15 Tips for vGPU Success - Part 3-3Lee Bushen
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabTaeung Song
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleDatabricks
 

Mais procurados (20)

Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118
 
Cvim half precision floating point
Cvim half precision floating pointCvim half precision floating point
Cvim half precision floating point
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems The Power of HPC with Next Generation Supermicro Systems
The Power of HPC with Next Generation Supermicro Systems
 
Board Bringup
Board BringupBoard Bringup
Board Bringup
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)
 
Linux Usb overview
Linux Usb  overviewLinux Usb  overview
Linux Usb overview
 
Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance Troubleshooting
 
IBM Storwize V7000
IBM Storwize V7000IBM Storwize V7000
IBM Storwize V7000
 
Top 15 Tips for vGPU Success - Part 3-3
Top 15 Tips for vGPU Success - Part 3-3Top 15 Tips for vGPU Success - Part 3-3
Top 15 Tips for vGPU Success - Part 3-3
 
Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Gpu
GpuGpu
Gpu
 

Semelhante a Getting started with AMD GPUs

Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
State of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopState of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopGanesh Raju
 
Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinciAkash Sahoo
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 Sri Ambati
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013dotCloud
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Docker, Inc.
 
Freedreno on Android – XDC 2023
Freedreno on Android          – XDC 2023Freedreno on Android          – XDC 2023
Freedreno on Android – XDC 2023Igalia
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...Chris Fregly
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsJim Dowling
 
Mesa and Its Debugging, Вадим Шовкопляс
Mesa and Its Debugging, Вадим ШовкоплясMesa and Its Debugging, Вадим Шовкопляс
Mesa and Its Debugging, Вадим ШовкоплясSigma Software
 

Semelhante a Getting started with AMD GPUs (20)

Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Mesa and Its Debugging
Mesa and Its DebuggingMesa and Its Debugging
Mesa and Its Debugging
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
State of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache BigtopState of Big Data on ARM64 / AArch64 - Apache Bigtop
State of Big Data on ARM64 / AArch64 - Apache Bigtop
 
Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinci
 
CUDA
CUDACUDA
CUDA
 
H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013Lightweight Virtualization with Linux Containers and Docker | YaC 2013
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
 
Freedreno on Android – XDC 2023
Freedreno on Android          – XDC 2023Freedreno on Android          – XDC 2023
Freedreno on Android – XDC 2023
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 
Mesa and Its Debugging, Вадим Шовкопляс
Mesa and Its Debugging, Вадим ШовкоплясMesa and Its Debugging, Вадим Шовкопляс
Mesa and Its Debugging, Вадим Шовкопляс
 

Mais de George Markomanolis

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IGeorge Markomanolis
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIGeorge Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IGeorge Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerGeorge Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

Mais de George Markomanolis (15)

Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Último

Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Youngkajalvid75
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 

Último (20)

Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Model Escorts | 100% verified
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 

Getting started with AMD GPUs

  • 1. Getting started with AMD GPUs FOSDEM’21 HPC, Big Data and Data Science devroom February 7th, 2021 George S. Markomanolis Lead HPC Scientist, CSC – IT For Science Ltd.
  • 2. Outline • Motivation • LUMI • ROCm • Introduction and porting codes to HIP • Benchmarking • Fortran and HIP • Tuning 2
  • 3. Disclaimer • AMD ecosystem is under heavy development, many things can change without notice • All the experiments took place on NVIDIA V100 GPU (Puhti cluster at CSC) • Trying to use the latest versions of ROCm • Some results are really fresh and investigating the outcome 3
  • 5. Motivation/Challenges • LUMI will have AMD GPUs • Need to learn how to program and port codes on AMD ecosystem • Provide training to LUMI users • Investigate in the future about possible problems • Not yet access to AMD GPUs 5
  • 6. AMD GPUs (MI100 example) 6 AMD MI100 LUMI will have a different GPU
  • 7. Differences between HIP and CUDA • AMD GCN hardware wavefronts size is 64 (like warp for CUDA) • Some CUDA library functions do not have AMD equivalents • Shared memory and registers per thrad can differ between AMD and NVIDIA hardware 7
  • 8. ROCm 8 • Open Software Platform for GPU- accelerated Computing by AMD
  • 9. ROCm installation • Many components need to be installed • Rocm-cmake • ROCT Thunk Interface • HSA Runtime API and runtime for ROCm • ROCM LLVM / Clang • Rocminfo (only for AMD HW) • ROCm-Device-Libs • ROCm-CompilerSupport • ROCclr - Radeon Open Compute Common Language Runtime • HIP Instructions: https://github.com/cschpc/lumi/blob/main/rocm/rocm_install.md Script: https://github.com/cschpc/lumi/blob/main/rocm/rocm_installation.sh Repo: https://github.com/cschpc/lumi 9
  • 10. Introduction to HIP • HIP: Heterogeneous Interface for Portability is developed by AMD to program on AMD GPUs • It is a C++ runtime API and it supports both AMD and NVIDIA platforms • HIP is similar to CUDA and there is no performance overhead on NVIDIA GPUs • Many well-known libraries have been ported on HIP • New projects or porting from CUDA, could be developed directly in HIP https://github.com/ROCm-Developer-Tools/HIP 10
  • 11. Differences between CUDA and HIPAPI #include “cuda.h” cudaMalloc(&d_x, N*sizeof(double)); cudaMemcpy(d_x,x,N*sizeof(double), cudaMemcpyHostToDevice); cudaDeviceSynchronize(); 11 #include “hip/hip_runtime.h” hipMalloc(&d_x, N*sizeof(double)); hipMemcpy(d_x,x,N*sizeof(double), hipMemcpyHostToDevice); hipDeviceSynchronize(); CUDA HIP
  • 12. Launching kernel with CUDA and HIP kernel_name <<<gridsize, blocksize, shared_mem_size, stream>>> (arg0, arg1, ...); 12 hipLaunchKernelGGL(kernel_name, gridsize, blocksize, shared_mem_size, stream, arg0, arg1, ... ); CUDA HIP
  • 13. HIP API • Device Management: o hipSetDevice(), hipGetDevice(), hipGetDeviceProperties(), hipDeviceSynchronize() • Memory Management o hipMalloc(), hipMemcpy(), hipMemcpyAsync(), hipFree(), hipHostMalloc • Streams o hipStreamCreate(), hipStreamSynchronize(), hipStreamDestroy() • Events o hipEventCreate(), hipEventRecord(), hipStreamWaitEvent(), hipEventElapsedTime() • Device Kernels o __global__, __device__, hipLaunchKernelGGL() • Device code o threadIdx, blockIdx, blockDim, __shared__ o Hundreds math functions covering entire CUDA math library • Error handling o hipGetLastError(), hipGetErrorString() https://rocmdocs.amd.com/en/latest/ROCm_API_References/HIP-API.html 13
  • 14. Hipify • Hipify tools convert automatically CUDA codes • It is possible that not all the code is converted, the remaining needs the implementation of the developer • Hipify-perl: text-based search and replace • Hipify-clang: source-to-source translator that uses clang compiler • Porting guide: https://github.com/ROCm-Developer- Tools/HIP/blob/main/docs/markdown/hip_porting_guide.md 14
  • 15. Hipify-perl • It can scan directories and converts CUDA codes with replacement of the cuda to hip (sed –e ’s/cuda/hip/g’) $ hipify-perl --inplace filename It modifies the filename input inplace, replacing input with hipified output, save backup in .prehip file. $ hipconvertinplace-perl.sh directory It converts all the related files that are located inside the directory 15
  • 16. Hipify-perl (cont). 1) $ ls src/ Makefile.am matMulAB.c matMulAB.h matMul.c 2) $ hipconvertinplace-perl.sh src 3) $ ls src/ Makefile.am matMulAB.c matMulAB.c.prehip matMulAB.h matMulAB.h.prehip matMul.c matMul.c.prehip No compilation took place, just convertion. 16
  • 17. Hipify-perl (cont). • The hipify-perl will return a report for each file, and it looks like this: info: TOTAL-converted 53 CUDA->HIP refs ( error:0 init:0 version:0 device:1 ... library:16 ... numeric_literal:12 define:0 extern_shared:0 kernel_launch:0 ) warn:0 LOC:888 kernels (0 total) : hipFree 18 HIPBLAS_STATUS_SUCCESS 6 hipSuccess 4 hipMalloc 3 HIPBLAS_OP_N 2 hipDeviceSynchronize 1 hip_runtime 1 17
  • 18. Hipify-perl (cont). #include <cuda_runtime.h> #include "cublas_v2.h" if (cudaSuccess != cudaMalloc((void **) &a_dev, sizeof(*a) * n * n) || cudaSuccess != cudaMalloc((void **) &b_dev, sizeof(*b) * n * n) || cudaSuccess != cudaMalloc((void **) &c_dev, sizeof(*c) * n * n)) { printf("error: memory allocation (CUDA)n"); cudaFree(a_dev); cudaFree(b_dev); cudaFree(c_dev); cudaDestroy(handle); exit(EXIT_FAILURE); } 18 #include <hip/hip_runtime.h> #include "hipblas.h” if (hipSuccess != hipMalloc((void **) &a_dev, sizeof(*a) * n * n) || hipSuccess != hipMalloc((void **) &b_dev, sizeof(*b) * n * n) || hipSuccess != hipMalloc((void **) &c_dev, sizeof(*c) * n * n)) { printf("error: memory allocation (CUDA)n"); hipFree(a_dev); hipFree(b_dev); hipFree(c_dev); hipblasDestroy(handle); exit(EXIT_FAILURE); } CUDA HIP
  • 19. Compilation 1) Compilation with CC=hipcc matMulAB.c:21:10: fatal error: hipblas.h: No such file or directory 21 | #include "hipblas.h" 2) Install HipBLAS library * 3) Compile again and the binary is ready. When the HIP is on NVIDIA hardware, the .cpp file should be compiled with the option “hipcc -x cu …”. • The hipcc is using nvcc on NVIDIA GPUs and hcc for AMD GPUs * https://github.com/cschpc/lumi/blob/main/hip/hipblas.md 19
  • 20. Hipify-clang • Build from source • Some times needs to include manually the headers -I/... $ hipify-clang --print-stats -o matMul.o matMul.c [HIPIFY] info: file 'matMul.c' statistics: CONVERTED refs count: 0 UNCONVERTED refs count: 0 CONVERSION %: 0 REPLACED bytes: 0 TOTAL bytes: 4662 CHANGED lines of code: 1 TOTAL lines of code: 155 CODE CHANGED (in bytes) %: 0 CODE CHANGED (in lines) %: 1 TIME ELAPSED s: 22.94 20
  • 21. Benchmark MatMul OpenMP oflload • Use the benchmark https://github.com/pc2/OMP-Offloading for testing purposes, matrix multiplication of 2048 x 2048 • All the CUDA calls were converted and it was linked with hipBlas among also OpenMP offload • CUDA matMulAB (11) : 1001.2 GFLOPS 11990.1 GFLOPS maxabserr = 0.0 • HIP matMulAB (11) : 978.8 GFLOPS 12302.4 GFLOPS maxabserr = 0.0 • For the most executions, HIP version was equal or a bit better than CUDA version, for total execution, there is ~2.23% overhead for HIP using NVIDIA GPUs 21
  • 22. N-BODY SIMULATION • N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA) AllPairs_N2 • 171 CUDA calls converted to HIP without issues, close to 1000 lines of code • HIP calls: hipMemcpy, hipMalloc, hipMemcpyHostToDevice, hipMemcpyDeviceToHost, hipLaunchKernelGGL, hipDeviceSynchronize, hip_runtime, hipSuccess, hipGetErrorString, hipGetLastError, hipError_t, HIP_DYNAMIC_SHARED • 32768 number of small particles, 2000 time steps • CUDA execution time: 68.5 seconds • HIP execution time: 70.1 seconds, ~2.33% overhead 22
  • 23. Fortran • First Scenario: Fortran + CUDA C/C++ oAssuming there is no CUDA code in the Fortran files. oHipify CUDA oCompile and link with hipcc • Second Scenario: CUDA Fortran oThere is no HIP equivalent oHIP functions are callable from C, using `extern C` oSee hipfort 23
  • 24. Hipfort • The approach to port Fortran codes on AMD GPUs is different, the hipify tool does not support it. • We need to use hipfort, a Fortran interface library for GPU kernel * • Steps: 1) We write the kernels in a new C++ file 2) Wrap the kernel launch in a C function 3) Use Fortran 2003 C binding to call the function 4) Things could change in the future • Use OpenMP offload to GPUs * https://github.com/ROCmSoftwarePlatform/hipfort 24
  • 25. Fortran CUDA example • Saxpy example • Fortran CUDA, 29 lines of code • Ported to HIP manually, two files of 52 lines, with more than 20 new lines. • Quite a lot of changes for such a small code. • Should we try to use OpenMP offload before we try to HIP the code? • Need to adjust Makefile to compile the multiple files • The HIP version is up to 30% faster, seems to be a comparison between nvcc and pgf90, still checking to verify some results • Example of Fortran with HIP: https://github.com/cschpc/lumi/tree/main/hipfort 25
  • 26. Fortran CUDA example (cont.) 26 Original Fortran CUDA New Fortran 2003 with HIP C++ with HIP and extern C
  • 27. AMD OpenMP (AOMP) • We have tested the LLVM provided OpenMP offload and gets improved by the time • AOMP is under heavy development and we started testing it. • AOMP has still some performance issues according to some public results but we expect to be also improved significanlty by the time LUMI is delivered • https://github.com/ROCm-Developer-Tools/aomp 27
  • 28. OpenMP or HIP? • Some users will be questioning about the approach • OpenMP can provide a quick porting but it is expected with HIP to have better performance as we avoid some layers like that. • For complicated codes and programming languages as Fortran, probably OpenMP could provide a benefit. Always profile your code to investigate the performance. 28
  • 29. Porting code to LUMI (not official) 29
  • 30. Profiling/Debugging • AMD will provide APIs for profiling and debugging • Cray will support the profiling API through CrayPat • Some well known tools are collaborating with AMD and preparing their tools for profiling and debugging • Some simple environment variables such as AMD_LOG_LEVEL=4 will provide some information. • More information about a hipMemcpy error: hipError_t err = hipMemcpy(c,c_d,nBytes,hipMemcpyDeviceToHost); printf("%s ",hipGetErrorString(err)); 30
  • 31. Tuning • Multiple wavefronts per compute unit (CU) is important to hide latency and instruction throughput • Memory coalescing increases bandwidth • Unrolling loops allow compiler to prefetch data • Small kernels can cause latency overhead, adjust the workload • Use of Local Data Share (LDS) 31
  • 32. Programming models • OpenACC will be available through the GCC as Mentor Graphics (now called Siemens EDA) is developing the OpenACC integration • Kokkos, Raja, Alpaka, and SYCL should be able to be used on LUMI but they do not support all the programming languages 32
  • 33. Conclusion • Depending on the code the porting to HIP can be more straight forward • There can be challenges, depending on the code and what GPU functionalities are integrated to an application • There are many approaches to port a code and you should select the one that you are more familiar and provides as possible as good performance • It will be required to tune the code for high occupancy • Profiling can help to investigate data transfer issues • Probably is more productive to try OpenMP with offload to GPUs initially with Fortran codes 33
  • 34. facebook.com/CSCfi twitter.com/CSCfi youtube.com/CSCfi linkedin.com/company/csc---it-center-for-science Kuvat CSC:n arkisto, Adobe Stock ja Thinkstock github.com/CSCfi Thank you! georgios.markomanolis@csc.fi