In this talk, we are presenting a short introduction to CUDA and GPU computing to help anyone who reads it to get started with this technology.
At first, we are introducing the GPU from the hardware point of view: what is it? How is it built? Why use it for General Purposes (GPGPU)? How does it differ from the CPU?
The second part of the presentation is dealing with the software abstraction and the use of CUDA to implement parallel computing. The software architecture, the kernels and the different types of memories are tackled in this part.
Finally, and to illustrate what has been presented previously, examples of codes are given. These examples are also highlighting the issues that may occur while using parallel-computing.
4. What are GPUs?
‣ Processors designed to handle graphic computations and
scene generation
- Optimized for parallel computation
‣ GPGPU: the use of GPU for general purpose computing
instead of graphic operations like shading, texture mapping,
etc.
5. Why use GPUs for general purposes?
‣ CPUs are suffering from:
- Performance growth slow-down
- Limits to exploiting instruction-level parallelism
- Power and thermal limitations
‣ GPUs are found in all PCs
‣ GPUs are energy efficient
- Performance per watt
6. Why use GPUs for general purposes?
‣ Modern GPUs provide extensive ressources
- Massive parallelism and processing cores
- flexible and increased programmability
- high floating point precision
- high arithmetic intensity
- high memory bandwidth
- Inherent parallelism
8. How do CPUs and GPUs differ
‣ Latency: delay between request and first data return
- delay btw request for texture reading and texture data returns
‣ Throughput: amount of work/amount of time
‣ CPU: low-latency, low-throughput
‣ GPU: high-latency, high-throughput
- Processing millions of pixels in a single frame
- No cache : more transistors dedicated to horsepower
9. How do CPUs and GPUs differ
Task Parallelism for CPU
‣ multiple tasks map to multiple
threads
‣ tasks run different instructions
‣ 10s of heavy weight threads on
10s of cores
‣ each thread managed and
scheduled explicitly
Data Parallelism for GPU
‣ SIMD model
‣ same instruction on different
data
‣ 10,000s light weight threads,
working on 100s of cores
‣ threads managed and scheduled
by hardware
11. Host and Device
‣ CUDA assumes a distinction between Host and Device
‣ Terminology
- Host The CPU and its memory (host memory)
- Device The GPU and its memory (device memory)
12. Threads, blocks and grid
‣ Threads are independent
sequences of program that run
concurrently
‣ Threads are organized in blocks,
which are organized in a grid
‣ Blocks and Threads can be
accessed using 3D coordinates
‣ Threads in the same block share
fast memory with each other
13. Blocks
‣ The number of Threads in a Block is limited and depends on the
Graphic Card
‣ Threads in a Block are divided in groups of 32 Threads called
Warps
- Threads in the same Warp are executed in parallel
‣ Automatic scalability,
because of blocks
14. Kernels
‣ The Kernel consists in the code
each Thread is supposed to
execute
‣ Threads can be thought of as
entities mapping the elements of
a certain data structure
‣ Kernels are launched by the
Host, and can also be launched
by other Kernels in recent CUDA
versions
15. How to use kernels ?
‣ A kernel can only be a void function
‣ The CUDA __global__ instruction means the Kernel is
accessible either from the Host and Device. But it is run on the
Device
‣ Each Kernel can access its Thread and
block position to get a unique identifier
16. How to use kernels ?
‣ Kernel call
‣ If you want to call a normal function in your Kernel, you must
declare it with the CUDA __device__ instruction.
‣ A __device__ function can only be accessed by the Device and
is automatically defined as inline
19. Global Memory
‣ Host and Device global memory are separate entities
- Device pointers point to GPU memory
May not be dereferenced in Host code
- Host pointers point to CPU memory
May not be dereferenced in Device code
‣ Slowest memory
‣ Easy to use
‣ ~1,5Go on GPU
C
int *h_T;
C for CUDA
int *d_T;
malloc() cudaMalloc()
free() cudaFree()
memcpy() cudaMemcpy()
21. Constant Memory
‣ Constant memory is a read-only memory located in the Global
memory and can be accessed by every thread
‣ Two reason to use Constant memory:
- A single read can be broadcast up to 15 others threads
(half-warp)
- Constant memory is cached on GPU
‣ Drawback:
- The half-warp broadcast feature can degrade the performance
when all 16 threads read different addresses.
22. How to use constant memory ?
‣ The instruction to define constant memory is __constant__
‣ Must be declared out of the main body and cudaMemcpyToSymbol
is used to copy values from the Host to the Device
‣ Constant Memory variables don't need to be declared to be accessed
in the kernel invocation
23. Texture memory
‣ Texture memory is located in the Global memory and can be
accessed by every thread
‣ Accessed through a dedicated read-only cache
‣ Cache includes hardware filtering which can perform linear
floating point interpolation as part of the read process.
‣ Cache optimised for spatial locality, in the coordinate system of the
texture, not in memory.
24. Shared Memory
‣ [16-64] KB of memory per block
‣ Extremely fast on-chip memory,
user managed
‣ Declare using __shared__,
allocated per block
‣ Data is not visible to threads in
other blocks
‣ !!!Bank Conflict!!!
‣ When to use? When threads will
access many times the global
memory
25. Shared Memory - Example
1D stencil
SUM
How many times is it
read?
7 Times
26. Shared Memory - Example
__global__ void stencil_1d(int *in, int *out)
{
__shared__ int temp[BLOCK_SIZE];
int lindex = threadIdx.x ;
// Read input elements into shared memory
temp[lindex] = in[lindex];
if (lindex > RADIUS && lindex < BLOCK_SIZE-RADIUS);
{
for (...) //Loop for calculating the sum
out[lindex] = res;
}
}
??
__syncthreads()
31. Conclusion
‣ GPUs are designed for parallel computing
‣ CUDA’s software abstraction is adapted to the GPU
architecture with grids, blocks and threads
‣ The management of which functions access what type of
memory is very important
- Be careful of bank conflicts!
‣ Data transfer between host and device is slow (5GB device to
host/host to device and 16GB device-device/host/host)
32. Resources
‣ We skipped some details, you can learn more with
- CUDA programming guide
- CUDA Zone – tools, training, webinars and more
- http://developer.nvidia.com/cuda
‣ Install from
- https://developer.nvidia.com/category/zone/cuda-zone and
learn from provided examples