Getting started with AMD GPUs

Getting started with AMD GPUs
FOSDEM’21 HPC, Big Data and Data Science devroom
February 7th, 2021
George S. Markomanolis
Lead HPC Scientist, CSC – IT For Science Ltd.

Outline
• Motivation
• LUMI
• ROCm
• Introduction and porting codes to HIP
• Benchmarking
• Fortran and HIP
• Tuning
2

Disclaimer
• AMD ecosystem is under heavy development, many things can change without notice
• All the experiments took place on NVIDIA V100 GPU (Puhti cluster at CSC)
• Trying to use the latest versions of ROCm
• Some results are really fresh and investigating the outcome
3

Motivation/Challenges
• LUMI will have AMD GPUs
• Need to learn how to program and port codes on AMD ecosystem
• Provide training to LUMI users
• Investigate in the future about possible problems
• Not yet access to AMD GPUs
5

AMD GPUs (MI100 example)
6
AMD MI100
LUMI will have a
different GPU

Differences between HIP and CUDA
• AMD GCN hardware wavefronts size is 64 (like warp for CUDA)
• Some CUDA library functions do not have AMD equivalents
• Shared memory and registers per thrad can differ between AMD and
NVIDIA hardware
7

ROCm
8
• Open Software Platform for GPU-
accelerated Computing by AMD

ROCm installation
• Many components need to be installed
• Rocm-cmake
• ROCT Thunk Interface
• HSA Runtime API and runtime for ROCm
• ROCM LLVM / Clang
• Rocminfo (only for AMD HW)
• ROCm-Device-Libs
• ROCm-CompilerSupport
• ROCclr - Radeon Open Compute Common Language Runtime
• HIP
Instructions: https://github.com/cschpc/lumi/blob/main/rocm/rocm_install.md
Script: https://github.com/cschpc/lumi/blob/main/rocm/rocm_installation.sh
Repo: https://github.com/cschpc/lumi
9

Introduction to HIP
• HIP: Heterogeneous Interface for Portability is developed by
AMD to program on AMD GPUs
• It is a C++ runtime API and it supports both AMD and NVIDIA
platforms
• HIP is similar to CUDA and there is no performance overhead
on NVIDIA GPUs
• Many well-known libraries have been ported on HIP
• New projects or porting from CUDA, could be developed
directly in HIP
https://github.com/ROCm-Developer-Tools/HIP
10

Differences between CUDA and HIPAPI
#include “cuda.h”
cudaMalloc(&d_x, N*sizeof(double));
cudaMemcpy(d_x,x,N*sizeof(double),
cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
11
#include “hip/hip_runtime.h”
hipMalloc(&d_x, N*sizeof(double));
hipMemcpy(d_x,x,N*sizeof(double),
hipMemcpyHostToDevice);
hipDeviceSynchronize();
CUDA HIP

Launching kernel with CUDA and HIP
kernel_name <<<gridsize, blocksize,
shared_mem_size,
stream>>>
(arg0, arg1, ...);
12
hipLaunchKernelGGL(kernel_name,
gridsize,
blocksize,
shared_mem_size,
stream,
arg0, arg1, ... );
CUDA HIP

HIP API
• Device Management:
o hipSetDevice(), hipGetDevice(), hipGetDeviceProperties(), hipDeviceSynchronize()
• Memory Management
o hipMalloc(), hipMemcpy(), hipMemcpyAsync(), hipFree(), hipHostMalloc
• Streams
o hipStreamCreate(), hipStreamSynchronize(), hipStreamDestroy()
• Events
o hipEventCreate(), hipEventRecord(), hipStreamWaitEvent(), hipEventElapsedTime()
• Device Kernels
o __global__, __device__, hipLaunchKernelGGL()
• Device code
o threadIdx, blockIdx, blockDim, __shared__
o Hundreds math functions covering entire CUDA math library
• Error handling
o hipGetLastError(), hipGetErrorString()
https://rocmdocs.amd.com/en/latest/ROCm_API_References/HIP-API.html
13

Hipify
• Hipify tools convert automatically CUDA codes
• It is possible that not all the code is converted, the remaining
needs the implementation of the developer
• Hipify-perl: text-based search and replace
• Hipify-clang: source-to-source translator that uses clang
compiler
• Porting guide: https://github.com/ROCm-Developer-
Tools/HIP/blob/main/docs/markdown/hip_porting_guide.md
14

Hipify-perl
• It can scan directories and converts CUDA codes with replacement of the cuda to hip
(sed –e ’s/cuda/hip/g’)
$ hipify-perl --inplace filename
It modifies the filename input inplace, replacing input with hipified output, save backup
in .prehip file.
$ hipconvertinplace-perl.sh directory
It converts all the related files that are located inside the directory
15

Hipify-perl (cont).
1) $ ls src/
Makefile.am matMulAB.c matMulAB.h matMul.c
2) $ hipconvertinplace-perl.sh src
3) $ ls src/
Makefile.am matMulAB.c matMulAB.c.prehip matMulAB.h matMulAB.h.prehip
matMul.c matMul.c.prehip
No compilation took place, just convertion.
16

Hipify-perl (cont).
• The hipify-perl will return a report for each file, and it looks like this:
info: TOTAL-converted 53 CUDA->HIP refs ( error:0 init:0 version:0 device:1 ... library:16
... numeric_literal:12 define:0 extern_shared:0 kernel_launch:0 )
warn:0 LOC:888
kernels (0 total) :
hipFree 18
HIPBLAS_STATUS_SUCCESS 6
hipSuccess 4
hipMalloc 3
HIPBLAS_OP_N 2
hipDeviceSynchronize 1
hip_runtime 1
17

Hipify-perl (cont).
#include <cuda_runtime.h>
#include "cublas_v2.h"
if (cudaSuccess != cudaMalloc((void **) &a_dev,
sizeof(*a) * n * n) ||
cudaSuccess != cudaMalloc((void **) &b_dev,
sizeof(*b) * n * n) ||
cudaSuccess != cudaMalloc((void **) &c_dev,
sizeof(*c) * n * n)) {
printf("error: memory allocation (CUDA)n");
cudaFree(a_dev); cudaFree(b_dev);
cudaFree(c_dev);
cudaDestroy(handle);
exit(EXIT_FAILURE);
}
18
#include <hip/hip_runtime.h>
#include "hipblas.h”
if (hipSuccess != hipMalloc((void **) &a_dev,
sizeof(*a) * n * n) ||
hipSuccess != hipMalloc((void **) &b_dev,
sizeof(*b) * n * n) ||
hipSuccess != hipMalloc((void **) &c_dev,
sizeof(*c) * n * n)) {
printf("error: memory allocation (CUDA)n");
hipFree(a_dev); hipFree(b_dev); hipFree(c_dev);
hipblasDestroy(handle);
exit(EXIT_FAILURE);
}
CUDA HIP

Compilation
1) Compilation with CC=hipcc
matMulAB.c:21:10: fatal error: hipblas.h: No such file or directory 21 | #include
"hipblas.h"
2) Install HipBLAS library *
3) Compile again and the binary is ready. When the HIP is on NVIDIA hardware, the .cpp
file should be compiled with the option “hipcc -x cu …”.
• The hipcc is using nvcc on NVIDIA GPUs and hcc for AMD GPUs
* https://github.com/cschpc/lumi/blob/main/hip/hipblas.md
19

Hipify-clang
• Build from source
• Some times needs to include manually the headers -I/...
$ hipify-clang --print-stats -o matMul.o matMul.c
[HIPIFY] info: file 'matMul.c' statistics:
CONVERTED refs count: 0
UNCONVERTED refs count: 0
CONVERSION %: 0
REPLACED bytes: 0
TOTAL bytes: 4662
CHANGED lines of code: 1
TOTAL lines of code: 155
CODE CHANGED (in bytes) %: 0
CODE CHANGED (in lines) %: 1
TIME ELAPSED s: 22.94
20

Benchmark MatMul OpenMP oflload
• Use the benchmark https://github.com/pc2/OMP-Offloading for testing purposes,
matrix multiplication of 2048 x 2048
• All the CUDA calls were converted and it was linked with hipBlas among also
OpenMP offload
• CUDA
matMulAB (11) : 1001.2 GFLOPS 11990.1 GFLOPS maxabserr = 0.0
• HIP
matMulAB (11) : 978.8 GFLOPS 12302.4 GFLOPS maxabserr = 0.0
• For the most executions, HIP version was equal or a bit better than CUDA version, for total
execution, there is ~2.23% overhead for HIP using NVIDIA GPUs
21

N-BODY SIMULATION
• N-Body Simulation (https://github.com/themathgeek13/N-Body-Simulations-CUDA)
AllPairs_N2
• 171 CUDA calls converted to HIP without issues, close to 1000 lines of code
• HIP calls: hipMemcpy, hipMalloc, hipMemcpyHostToDevice,
hipMemcpyDeviceToHost, hipLaunchKernelGGL, hipDeviceSynchronize, hip_runtime,
hipSuccess, hipGetErrorString, hipGetLastError, hipError_t, HIP_DYNAMIC_SHARED
• 32768 number of small particles, 2000 time steps
• CUDA execution time: 68.5 seconds
• HIP execution time: 70.1 seconds, ~2.33% overhead
22

Fortran
• First Scenario: Fortran + CUDA C/C++
oAssuming there is no CUDA code in the Fortran files.
oHipify CUDA
oCompile and link with hipcc
• Second Scenario: CUDA Fortran
oThere is no HIP equivalent
oHIP functions are callable from C, using `extern C`
oSee hipfort
23

Hipfort
• The approach to port Fortran codes on AMD GPUs is different, the hipify tool does
not support it.
• We need to use hipfort, a Fortran interface library for GPU kernel *
• Steps:
1) We write the kernels in a new C++ file
2) Wrap the kernel launch in a C function
3) Use Fortran 2003 C binding to call the function
4) Things could change in the future
• Use OpenMP offload to GPUs
* https://github.com/ROCmSoftwarePlatform/hipfort
24

Fortran CUDA example
• Saxpy example
• Fortran CUDA, 29 lines of code
• Ported to HIP manually, two files of 52 lines, with more than 20 new lines.
• Quite a lot of changes for such a small code.
• Should we try to use OpenMP offload before we try to HIP the code?
• Need to adjust Makefile to compile the multiple files
• The HIP version is up to 30% faster, seems to be a comparison between nvcc and
pgf90, still checking to verify some results
• Example of Fortran with HIP: https://github.com/cschpc/lumi/tree/main/hipfort
25

Fortran CUDA example (cont.)
26
Original Fortran CUDA New Fortran 2003 with HIP C++ with HIP and extern C

AMD OpenMP (AOMP)
• We have tested the LLVM provided OpenMP offload and gets
improved by the time
• AOMP is under heavy development and we started testing it.
• AOMP has still some performance issues according to some
public results but we expect to be also improved significanlty
by the time LUMI is delivered
• https://github.com/ROCm-Developer-Tools/aomp
27

OpenMP or HIP?
• Some users will be questioning about the approach
• OpenMP can provide a quick porting but it is expected with HIP
to have better performance as we avoid some layers like that.
• For complicated codes and programming languages as Fortran,
probably OpenMP could provide a benefit. Always profile your
code to investigate the performance.
28

Porting code to LUMI (not official)
29

Profiling/Debugging
• AMD will provide APIs for profiling and debugging
• Cray will support the profiling API through CrayPat
• Some well known tools are collaborating with AMD and preparing their tools for
profiling and debugging
• Some simple environment variables such as
AMD_LOG_LEVEL=4 will provide some information.
• More information about a hipMemcpy error:
hipError_t err = hipMemcpy(c,c_d,nBytes,hipMemcpyDeviceToHost);
printf("%s ",hipGetErrorString(err));
30

Tuning
• Multiple wavefronts per compute unit (CU) is important to hide
latency and instruction throughput
• Memory coalescing increases bandwidth
• Unrolling loops allow compiler to prefetch data
• Small kernels can cause latency overhead, adjust the workload
• Use of Local Data Share (LDS)
31

Programming models
• OpenACC will be available through the GCC as Mentor
Graphics (now called Siemens EDA) is developing the
OpenACC integration
• Kokkos, Raja, Alpaka, and SYCL should be able to be used on
LUMI but they do not support all the programming languages
32

Conclusion
• Depending on the code the porting to HIP can be more straight forward
• There can be challenges, depending on the code and what GPU
functionalities are integrated to an application
• There are many approaches to port a code and you should select the one
that you are more familiar and provides as possible as good performance
• It will be required to tune the code for high occupancy
• Profiling can help to investigate data transfer issues
• Probably is more productive to try OpenMP with offload to GPUs
initially with Fortran codes
33

facebook.com/CSCfi
twitter.com/CSCfi
youtube.com/CSCfi
linkedin.com/company/csc---it-center-for-science
Kuvat CSC:n arkisto, Adobe Stock ja Thinkstock
github.com/CSCfi
Thank you!
georgios.markomanolis@csc.fi

Getting started with AMD GPUs

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Getting started with AMD GPUs

Semelhante a Getting started with AMD GPUs (20)

Mais de George Markomanolis

Mais de George Markomanolis (15)

Último

Último (20)

Getting started with AMD GPUs