This presentation by Stanislav Donets (Lead Software Engineer, Consultant, GlobalLogic, Kharkiv) was delivered at GlobalLogic Kharkiv C++ Workshop #1 on September 14, 2019.
In this talk were covered:
- Graphics Processing Units: Architecture and Programming (theory).
- Scratch Example: Barnes Hut n-Body Algorithm (practice).
Conference materials: https://www.globallogic.com/ua/events/kharkiv-cpp-workshop/
3. 3
Agenda
1. Graphics Processing Units (GPUs): Architecture and
Programming (theory):
• An Introduction to OpenCL;
• Host programs;
• Kernel programs;
• Writing Kernel Programs;
• Working with the OpenCL memory model;
4. 4
Agenda
2. Example: Barnes Hut n-Body Algorithm (practice):
• Introduction, Problem Statement, and Context;
• Core Methods;
• Algorithms and Implementations;
7. 7
Amdahl's Law vs Gustafson – Barsis's Law
• Amdahl’s law:
𝑆 𝑝 =
1
𝛼 +
1 − 𝛼
𝑝
• Gustafson – Barsis's law
𝑆 𝑝 = 𝑝 + 𝛼 1 − 𝑝
Where:
𝛼 - strictly serial or non-parallelizable
code;
𝑝 – number threads
Workload
remains
constant
When
workload
increases with
number of
processors
more speedup
is obtained
Gustafson – Barsis's LawAmdahl's Law
8. 8
Parallel Programming Techniques
OpenMP MPI OpenACC CUDA
OpenMP is an API that
supports multi-platform
shared memory
multiprocessing
programming in C, C++,
and Fortran. It is prevalent
only on a multi-core
computer platform with a
shared memory subsystem.
Message Passing Interface
(MPI) has an advantage
over OpenMP, that it can
run on either the shared or
distributed memory
architecture. Distributed
memory computers are
less expensive than large
shared memory computers.
One major disadvantage of
MPI parallel framework is
that the performance is
limited by the
communication network
between the nodes.
The OpenACC Application
Program Interface (API)
describes a collection of
compiler directives to
specify loops and regions
of code in standard C, C++,
and Fortran to be offloaded
from a host CPU to an
attached accelerator,
providing portability across
operating systems, host
CPUs, and accelerators.
Compute Unified Device
Architecture (CUDA) is a
parallel computing
architecture developed by
NVIDIA for graphics
processing and GPU
(General Purpose GPU)
programming. There is a
fairly good developer
community following for the
CUDA software framework.
11. 11
Conventional CPU Architecture
• Space devoted to control logic
instead of ALU
• CPUs are optimized to minimize
the latency of a single thread
- Can efficiently handle control
flow intensive workloads
• Multi level caches used to hide
latency
• Limited number of registers due
to smaller number of active
threads
• Control logic to reorder
execution, provide ILP and
minimize pipeline stalls
Conventional CPU Block Diagram
Control Logic
ALU L1 Cache
L2 Cache
L3
Cache
~ 25GBPS
System Memory
A present day multicore CPU could have more
than one ALU ( typically < 32) and some of the
cache hierarchy is usually shared across cores
12. 12
Conventional GPU Architecture
• Generic many core GPU
- Less space devoted to control
logic and caches
- Large register files to support
multiple thread contexts
• Low latency hardware managed
thread switching
• Large number of ALU per “core”
with small user managed cache
per core
• Memory bus optimized for
bandwidth
- ~150 GBPS bandwidth allows
us to service a large number of
ALUs simultaneously
On Board System Memory
High Bandwidth
bus to ALUs
Simple
ALUs
Cache
13. 13
The Heterogeneous System
TI DSP’s, FPGA’s,
Hardware Accelerators.
Programming using propreitary
tools only.
CPU’s x86, x86_64,
ARM
Multicore architecture.
Programming using OpenMP,
POSIX Threads etc.
GPU’s AMD, NVIDIA,
MALI.
Programming using propreitary
tools.
OpenCL
15. 15
Platform Model
• One Host and one or more
OpenCL Devices
• Each OpenCL Device is
composed of one or more
• Compute Units
• Each Compute Unit is divided
into one or more Processing
Elements
• Memory divided into host
memory and device memory
16. 16
Execution Model
• Host defines a command queue and
associates it with a context (devices,
kernels, memory, etc).
• Host enqueues commands to the
command queue
• Kernel execution commands launch
work-items: i.e. a kernel for each point
in an abstract Index Space
A single copy of the compute kernel,
running on one data element
In Data Parallel mode, kernel
execution contains multiple work-
items
In Task Parallel mode, kernel
execution contains a single work-item
• Work items execute together as a
work-group.
• Work-item
17. 17
Execution Model (NDRange)
• Synchronization between
work-items possible only
within work-groups:
barriers and memory
fences
• Cannot synchronize
between work-groups
within a kernel
• Choose the dimensions
(1, 2, or 3) that are “best”
for your algorithm
18. 18
Opencl Memory Model
• Four Memory Types:
Global : default for images/buffers
Constant : global const variables
Local : shared between work-items
Private : kernel internal variables
• Global or Constant memory are visible to
all work-items. They are the largest and
slowest memory types used by devices.
Constant memory is read only for the
device but is read/write by the host.
• Local memory is sharable within a
workgroup. It is smaller but faster than
global memory.
• Private memory is available to individual
work-items.
19. 19
OpenCL Memory Consistency
• OpenCL uses a “relaxed consistency memory model”
o State of memory visible to a work-item not guaranteed to be consistent across the
collection of work-items at all times
• Memory has load/store consistency within a work-item
• Local memory has consistency across work-items within a work-group at a barrier
• Global memory is consistent within a work-group at a barrier, but not guaranteed across
different work-groups
• Memory consistency for objects shared between commands enforced at synchronization
points
20. 20
The Memory Hierarchy
Private memory
O(10) words/WI
Local memory
O(1-10) KBytes/WG
Global memory
O(1-16) GBytes
Host memory
Above O(100) GBytes
Private memory
O(2-3) words/cycle/WI
Local memory
O(10) words/cycle/WG
Global memory
O(100-300) GBytes/s
Host memory
Above O(100) GBytes/s
Speeds and feeds approx. for a high-end discrete GPU
Bandwidths Sizes
21. 21
Why Using Too Much Private Memory Can Be A Good Thing
• In reality private memory is just hardware registers, so only dozens of these are
available per work-item
• Many kernels will allocate too many variables to private memory
• So the compiler already has to be able to deal with this
• It does so by spilling excess private variables to (global) memory
• You still told the compiler something useful – that the data will only be accessed by a
single work-item
• This lets the compiler allocate the data in such as way as to enable more efficient
memory access
22. 22
OpenCL Programming Model
• Data Parallel, SPMD
Traditionally, this term refers to a programming model where concurrency is expressed as
instructions from a single program applied to multiple elements within a set of data
structures. The term has been generalized in OpenCL to refer to a model wherein a set of
instructions from a single program is applied concurrently to each point within an abstract
domain of indices.
• Task Parallel
A programming model in which computations are expressed in terms of multiple
concurrent tasks, where a task is a kernel executing in a single work-group of size one. The
concurrent tasks can be running different kernels,
23. 23
OpenCL Compilation Model
OpenCL uses dynamic (runtime) compilation model (like DirectX and OpenGL)
• Static compilation:
–The code is compiled from source to machine execution code at a specific point in
the past (when the developer complied it using the IDE)
• Dynamic compilation:
–Also known as runtime compilation
–Step 1 : The code is complied to an Intermediate Representation (IR), which is
usually an assembler of a virtual machine. This step is known as offline compilation,
and it’s done by the Front-End compiler
–Step 2: The IR is compiled to a machine code for execution. This step is much
shorter. It is known as online compilation, and it’s done by the Back-end compiler
• In dynamic compilation, step 1 is done usually only once, and the IR is stored. The
App loads the IR and performs step 2 during the App’s runtime (hence the term…)
24. 24
OpenCL Terms (1/4)
• Application - The combination of the program running on the host and the
OpenCL devices.
• Buffer Object - A memory object that stores a linear collection of bytes. Buffer
objects are accessible using a pointer in a kernel executing on a device. Buffer
objects can be manipulated by the host, using OpenCL API calls.
• Platform - The host plus a collection of devices managed by the OpenCL
framework that allow an application to share resources and execute kernels on
devices in the platform.
25. 25
OpenCL Terms (2/4)
• Context - The environment within which the kernels execute and the domain in
which synchronization and memory management are defined. The context
includes a set of devices, the memory accessible to those devices, the
corresponding memory properties and one or more command-queues used to
schedule execution of a kernel(s) or operations on memory objects.
• Command - OpenCL operation that is submitted to a command-queue for
execution. For example, OpenCL commands issue kernels for execution on a
compute device, manipulate memory objects, etc.
26. 26
OpenCL Terms (3/4)
• Command-Queue - An object that holds commands that will be executed on a
specific device. The command-queue is created on a specific device in a
context. Commands to a command queue are queued in order but may be
executed in order or out of order.
• Kernel - A kernel is a function declared in a program and executed on an
OpenCL device. A kernel is identified by the __kernel or kernel qualifier applied
to any function defined in a program.
27. 27
OpenCL Terms (4/4)
• Kernel Object - A kernel object encapsulates a specific kernel function
declared in a program and the argument values to be used when executing
this kernel function.
• Program - An OpenCL program consists of a set of kernels. Programs may
also contain auxiliary functions called by the kernel functions and constant
data.
• Program Object - A program object encapsulates the following information:
A reference to an associated context.
A program source or binary.
The latest successfully built program executable, the list of devices for which the
program executable is built, the build options used and a build log.
The number of kernel objects currently attached.
28. 28
OpenCL Runtime
arg [0]
value
arg [1]
value
arg [2]
value
arg [0]
value
arg [1]
value
arg [2]
value
In
Order
Queu
e
Out
of
Order
QueuGPU
Context
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c)
{
int id = get_global_id(0);
c[id] = a[id] * b[id];
}
dp_mul
CPU program binary
dp_mul
GPU program binary
Programs
arg[0] value
arg[1] value
arg[2] value
Buffers Images
In
Order
Queue
Out of
Order
Queue
Compute Device
GPUCPU
dp_mul
Programs Kernels Memory Objects Command Queues
30. 30
Structure Of Opencl Main Program
The structure of OpenCL host program is following:
1. Get information about platform and devices available on system
2. Select devices to use
3. Create an OpenCL command queue
4. Create memory buffers on device
5. Create kernel program object
6. Build (compile) kernel in-line (or load precompiled binary)
7. Create OpenCL kernel object
8. Set kernel arguments
9. Execute kernel
10. Read kernel memory and copy to host memory.
31. 31
Setting Up The Host Program
• Khronos has defined a common C++ header file containing a high level interface to OpenCL,
cl.hpp
• This interface is dramatically easier to work with
• Include key header files … both standard and custom
#include <CL/cl.hpp> // Khronos C++ Wrapper API
#include <cstdio> // For C style
#include <iostream> // For C++ style IO
#include <vector> // For C++ vector types
32. 32
Host Program Initialization
1. Get list of available platforms. List of platforms can be obtained using cl::Platform::get
method.
2. Set properties that will be used to create a new context.
3. Create the context by the constructor cl::Context. Constructor takes type of device
that will be included in the context and list of properties.
4. Create command queue. The host code cannot directly call kernels; it has to put them
in a queue.
33. 33
Preparation of OpenCL Programs
1. The compilation process consist of four steps:
1. Load the sources into list of sources (cl::Program::Sources).
2. Create a program using constructor cl::Program.
3. Build the program using cl::Program::build.
4. Create kernel using cl::Kernel.
2. As alternative program can be cached and loaded from binary
34. 34
Context and Command-Queues
• Context:
- The environment within which kernels execute and in which
synchronization and memory management is defined.
• The context includes:
- One or more devices
- Device memory
- One or more command-queues
• All commands for a device (kernel execution,
synchronization, and memory transfer operations) are
submitted through a command-queue.
• Each command-queue points to a single device within
a context.
Queue
Context
Device
Device Memory
35. 35
Command-Queues
• Commands include:
- Kernel executions
- Memory object management
- Synchronization
• The only way to submit commands to a device
is through a command-queue.
• Each command-queue points to a single device
within a context.
• Multiple command-queues can feed a single
device.
- Used to define independent streams of
commands that don’t require synchronization
Queue Queue
Context
GPU CPU
36. 36
Command-queue Execution Details
• Command queues can be configured in different
ways to control how commands execute
• In-order queues:
- Commands are enqueued and complete in the order they appear
in the program (program-order)
• Out-of-order queues:
- Commands are enqueued in program-order but can execute (and
hence complete) in any order.
• Execution of commands in the command-queue are
guaranteed to be completed at synchronization points
- Discussed later
Queue Queue
Context
GPU CPU
37. 37
Opencl Synchronization: Queues & Events
• Events connect
command
invocations. Can
be used to
synchronize
executions inside
out-of-order
queues or
between queues
• Example: 2
queues with 2
devices
GPU
CPU
GPU
CPU
Time Time
Kernel 1
Kernel 2
Enqueue
Kernel1
Enqueue
Kernel2
Kernel 2 starts before
the results from
Kernel 1 are ready
Kernel 1
Kernel 2
Enqueue
Kernel1
Enqueue
Kernel2
Kernel 2 waits for an
event from Kernel 1
and does not start
until the results are
ready
38. 38
Why Events? Won’t A Barrier Do?
• A barrier defines a synchronization point … commands
following a barrier wait to execute until all prior
enqueued commands complete
- cl_int clEnqueueBarrier(cl_command_queue
queue)
• Events provide fine grained control … this can really
matter with an out-of-order queue.
• Events work between commands in the different queues
… as long as they share a context
• Events convey more information than a barrier … provide
info on state of a command, not just whether it’s
complete or not.
Queue Queue
Context
GPU CPU
Event
39. 39
Release of Resources
In C++ the release of resources is done on the function exit. This is because destructor
methods in OpenCL objects call the API functions to release resources. The release of
resources would be necessary in the case of using dynamically allocated memory and
pointers to OpenCL objects
41. 41
Working with Kernel
• The kernels are where all the action is in an OpenCL program.
• Steps to using kernels:
1. Load kernel source code into a program object from a file
2. Make a kernel functor from a function within the program
3. Initialize device memory
4. Call the kernel functor, specifying memory objects and global/local sizes
5. Read results back from the device
• Note the kernel function argument list must match the kernel definition on the host.
42. 42
OpenCL C for Compute Kernels
• Derived from ISO C99
- A few restrictions: no recursion, function pointers, functions in C99 standard headers ...
- Preprocessing directives defined by C99 are supported (#include etc.)
• Built-in data types
- Scalar and vector data types, pointers
- Data-type conversion functions:
• convert_type<_sat><_roundingmode>
- Image types:
• image2d_t, image3d_t and sampler_t
43. 43
OpenCL C for Compute Kernels
• Built-in functions — mandatory
- Work-Item functions, math.h, read and write image
- Relational, geometric functions, synchronization functions
- printf
• Built-in functions — optional (called “extensions”)
- Double precision, atomics to global and local memory
- Selection of rounding mode, writes to image3d_t surface
44. 44
OpenCL C Language Highlights
• Function qualifiers
- __kernel qualifier declares a function as a kernel
• I.e. makes it visible to host code so it can be enqueued
- Kernels can call other kernel-side functions
• Address space qualifiers
- __global, __local, __constant, __private
- Pointer kernel arguments must be declared with an address space qualifier
• Work-item functions
- get_work_dim(), get_global_id(), get_local_id(), get_group_id()
• Synchronization functions
- Barriers - all work-items within a work-group must execute the barrier function before
any work-item can continue
- Memory fences - provides ordering between memory operations
45. 45
OpenCL C Language Restrictions
• Pointers to functions are not allowed
• Pointers to pointers allowed within a kernel, but not as an argument to a kernel
invocation
• Bit-fields are not supported
• Variable length arrays and structures are not supported
• Recursion is not supported
• The return type for a kernel function must be void.
• Arguments to kernel functions that are declared to be a struct cannot pass OpenCL
objects (such as buffers, images) as elements of the struct.
• The extern, static, auto, and register storage class specifiers are not supported.
48. 48
Kernel Vector Data Types 2/2
• The forms of the function that are available are the set of possible argument lists for
which all arguments have the same element type as the result vector, and the total
number of elements is equal to the number of elements in the result vector.
• For vector float4 following combinations are allowed:
(float4)( float, float, float, float )
(float4)( float2, float, float )
(float4)( float, float2, float )
(float4)( float, float, float2 )
(float4)( float2, float2 )
(float4)( float3, float )
(float4)( float, float3 )
(float4)( float )
49. 49
Endianness And Memory Access
• OpenCL standard tells nothing about how the bytes are ordered in memory. The reason for this
is that different devices and operating systems order bytes differently.
• There are two ways to determine whether a device is little-endian or big-endian:
From the host, you can call clGetDeviceInfo with CL_DEVICE_ENDIAN_LITTLE as the
parameter. If this returns CL_TRUE, the device is little-endian. If it returns CL_FALSE, the
device is big-endian.
Within the kernel, you can use #ifdef to determine whether the __ENDIAN_LITTLE__ macro is
defined. If this macro is defined, the device is little-endian. If not, the device is big-endian.
uint4 vec = (vec4)
(0x00010203, 0x04050607,
0x08090A0B, 0x0C0D0E0F);
50. 50
Alignment Of Data Types
• Every built-in and vector data type in OpenCL is aligned to the size of the data type
itself.
• A built-in data type that is not a power of two bytes in size must be aligned to the
next larger power of two. This rule applies to built-in types only, not structs or unions.
For example, a float3 contains 12 bytes. This vector will be stored on a 16-byte
boundary because 16 is the smallest power of 2 greater than or equal to 12.
52. 52
Kernel Synchronization Functions
void barrier ( cl_mem_fence_flags flags ) - All work-items in a work-group executing the kernel on
a processor must execute this function before any are allowed to continue execution beyond the
barrier.
void mem_fence ( cl_mem_fence_flags flags ) - Orders loads and stores of a work-item executing
a kernel.
void read_mem_fence ( cl_mem_fence_flags flags ) - Read memory barrier that orders only loads.
void write_mem_fence ( cl_mem_fence_flags flags ) - Write memory barrier that orders only stores.
Available options for flags are:
• CLK_GLOBAL_MEM_FENCE
• CLK_LOCAL_MEM_FENCE
• CLK_IMAGE_MEM_FENCE
55. 55
1. Host recognize merory located in the host ant the memory located in context
2. Host allocates memory in both regions
3. Context can be located in the host memory or in the device memory
4. Host program is responsible for the memory transfers
5. Host program cannot access to the local memory
6. OpenCL program operates with four memory pools – constant, global, local and
private. Data can be explicitly copied between these regions
56. 56
Relaxed Memory Consistency
• The different memory pools provide different memory consistence models.
• Local and global memory pools provide a relaxed consistency model.
• Relaxed Consistency - A memory consistency model in which the contents of
memory visible to different work-items or commands may be different except at a
barrier or other explicit synchronization point.
58. 58
• It is important to be aware that synchronization is possible only per workgroup.
• It is impossible to synchronize different work-groups.
• In the case of global memory usage, the work-groups should write to different
regions of it.
60. 60
Barnes Hut Algorithm
• Set bodies’ initial position and velocity
• Iterate over time steps
1. Subdivide space until at most one body per cell
• Record this spatial hierarchy in an octree
2. Compute mass and center of mass of each cell
3. Compute force on bodies by traversing octree
• Stop traversal path when encountering a leaf (body) or an internal node (cell) that is far enough
away
4. Update each body’s position and velocity
61. 61
Build Tree (Level 1)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o
Subdivide space until at most one body per cell
62. 62
Build Tree (Level 2)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o o o o
o
Subdivide space until at most one body per cell
63. 63
Build Tree (Level 3)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o o o o
o
o o o o o o o o o o o o
Subdivide space until at most one body per cell
64. 64
Build Tree (Level 4)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Subdivide space until at most one body per cell
65. 65
Build Tree (Level 5)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Subdivide space until at most one body per cell
66. 66
Compute Cells’ Center of Mass
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
For each internal cell, compute sum of mass and weighted average
of position of all bodies in subtree; example shows two cells only
67. 67
Compute Forces
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Compute force, for example, acting upon green body
68. 68
Compute Force (Short Distance)
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Scan tree depth first from left to right; green portion already completed
69. 69
Compute Force (Down One Level)
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Red center of mass is too close, need to go down one level
70. 70
Compute Force (Long Distance)
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Yellow center of mass is far enough away
71. 71
Compute Force (Skip Subtree)
Barnes Hut N-body Simulation
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
71
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Therefore, entire subtree rooted in the yellow cell can be skipped
72. 72
Pseudocode
Set bodySet = ...
foreach timestep do {
Octree octree = new Octree();
foreach Body b in bodySet {
octree.Insert(b);
}
OrderedList cellList = octree.CellsByLevel();
foreach Cell c in cellList {
c.Summarize();
}
foreach Body b in bodySet {
b.ComputeForce(octree);
}
foreach Body b in bodySet {
b.Advance();
}
}
73. 73
Complexity
Set bodySet = ...
foreach timestep do { // O(n log n)
Octree octree = new Octree();
foreach Body b in bodySet { // O(n log n)
octree.Insert(b);
}
OrderedList cellList = octree.CellsByLevel();
foreach Cell c in cellList { // O(n)
c.Summarize();
}
foreach Body b in bodySet { // O(n log n)
b.ComputeForce(octree);
}
foreach Body b in bodySet { // O(n)
b.Advance();
}
}
73
74. 74
Parallelism
Set bodySet = ...
foreach timestep do { // sequential
Octree octree = new Octree();
foreach Body b in bodySet { // tree building
octree.Insert(b);
}
OrderedList cellList = octree.CellsByLevel();
foreach Cell c in cellList { // tree traversal
c.Summarize();
}
foreach Body b in bodySet { // fully parallel
b.ComputeForce(octree);
}
foreach Body b in bodySet { // fully parallel
b.Advance();
}
}
75. 75
Resources: https://www.khronos.org/opencl/
The OpenCL specification
Surprisingly approachable for a spec!
https://www.khronos.org/registry/cl/
OpenCL reference card
Useful to have on your desk(top)
Available on the same page as the spec.