SlideShare uma empresa Scribd logo
1 de 76
Baixar para ler offline
1
2
GPGPU computing.
Stanislav Donets
September 2019
3
Agenda
1. Graphics Processing Units (GPUs): Architecture and
Programming (theory):
• An Introduction to OpenCL;
• Host programs;
• Kernel programs;
• Writing Kernel Programs;
• Working with the OpenCL memory model;
4
Agenda
2. Example: Barnes Hut n-Body Algorithm (practice):
• Introduction, Problem Statement, and Context;
• Core Methods;
• Algorithms and Implementations;
55
Graphics Processing Units (GPUs):
Architecture and Programming (theory)
66
An Introduction to OpenCL
7
Amdahl's Law vs Gustafson – Barsis's Law
• Amdahl’s law:
𝑆 𝑝 =
1
𝛼 +
1 − 𝛼
𝑝
• Gustafson – Barsis's law
𝑆 𝑝 = 𝑝 + 𝛼 1 − 𝑝
Where:
𝛼 - strictly serial or non-parallelizable
code;
𝑝 – number threads
Workload
remains
constant
When
workload
increases with
number of
processors
more speedup
is obtained
Gustafson – Barsis's LawAmdahl's Law
8
Parallel Programming Techniques
OpenMP MPI OpenACC CUDA
OpenMP is an API that
supports multi-platform
shared memory
multiprocessing
programming in C, C++,
and Fortran. It is prevalent
only on a multi-core
computer platform with a
shared memory subsystem.
Message Passing Interface
(MPI) has an advantage
over OpenMP, that it can
run on either the shared or
distributed memory
architecture. Distributed
memory computers are
less expensive than large
shared memory computers.
One major disadvantage of
MPI parallel framework is
that the performance is
limited by the
communication network
between the nodes.
The OpenACC Application
Program Interface (API)
describes a collection of
compiler directives to
specify loops and regions
of code in standard C, C++,
and Fortran to be offloaded
from a host CPU to an
attached accelerator,
providing portability across
operating systems, host
CPUs, and accelerators.
Compute Unified Device
Architecture (CUDA) is a
parallel computing
architecture developed by
NVIDIA for graphics
processing and GPU
(General Purpose GPU)
programming. There is a
fairly good developer
community following for the
CUDA software framework.
9
OpenCL Ecosystem
10
Typical System
Host
GPU
GPU
PCI-Express
• Host initiated memory
transfers
• Host initiated
computations on the
GPU (kernels)
11
Conventional CPU Architecture
• Space devoted to control logic
instead of ALU
• CPUs are optimized to minimize
the latency of a single thread
- Can efficiently handle control
flow intensive workloads
• Multi level caches used to hide
latency
• Limited number of registers due
to smaller number of active
threads
• Control logic to reorder
execution, provide ILP and
minimize pipeline stalls
Conventional CPU Block Diagram
Control Logic
ALU L1 Cache
L2 Cache
L3
Cache
~ 25GBPS
System Memory
A present day multicore CPU could have more
than one ALU ( typically < 32) and some of the
cache hierarchy is usually shared across cores
12
Conventional GPU Architecture
• Generic many core GPU
- Less space devoted to control
logic and caches
- Large register files to support
multiple thread contexts
• Low latency hardware managed
thread switching
• Large number of ALU per “core”
with small user managed cache
per core
• Memory bus optimized for
bandwidth
- ~150 GBPS bandwidth allows
us to service a large number of
ALUs simultaneously
On Board System Memory
High Bandwidth
bus to ALUs
Simple
ALUs
Cache
13
The Heterogeneous System
TI DSP’s, FPGA’s,
Hardware Accelerators.
Programming using propreitary
tools only.
CPU’s x86, x86_64,
ARM
Multicore architecture.
Programming using OpenMP,
POSIX Threads etc.
GPU’s AMD, NVIDIA,
MALI.
Programming using propreitary
tools.
OpenCL
14
Components in OpenCL
15
Platform Model
• One Host and one or more
OpenCL Devices
• Each OpenCL Device is
composed of one or more
• Compute Units
• Each Compute Unit is divided
into one or more Processing
Elements
• Memory divided into host
memory and device memory
16
Execution Model
• Host defines a command queue and
associates it with a context (devices,
kernels, memory, etc).
• Host enqueues commands to the
command queue
• Kernel execution commands launch
work-items: i.e. a kernel for each point
in an abstract Index Space
 A single copy of the compute kernel,
running on one data element
 In Data Parallel mode, kernel
execution contains multiple work-
items
 In Task Parallel mode, kernel
execution contains a single work-item
• Work items execute together as a
work-group.
• Work-item
17
Execution Model (NDRange)
• Synchronization between
work-items possible only
within work-groups:
barriers and memory
fences
• Cannot synchronize
between work-groups
within a kernel
• Choose the dimensions
(1, 2, or 3) that are “best”
for your algorithm
18
Opencl Memory Model
• Four Memory Types:
 Global : default for images/buffers
 Constant : global const variables
 Local : shared between work-items
 Private : kernel internal variables
• Global or Constant memory are visible to
all work-items. They are the largest and
slowest memory types used by devices.
Constant memory is read only for the
device but is read/write by the host.
• Local memory is sharable within a
workgroup. It is smaller but faster than
global memory.
• Private memory is available to individual
work-items.
19
OpenCL Memory Consistency
• OpenCL uses a “relaxed consistency memory model”
o State of memory visible to a work-item not guaranteed to be consistent across the
collection of work-items at all times
• Memory has load/store consistency within a work-item
• Local memory has consistency across work-items within a work-group at a barrier
• Global memory is consistent within a work-group at a barrier, but not guaranteed across
different work-groups
• Memory consistency for objects shared between commands enforced at synchronization
points
20
The Memory Hierarchy
Private memory
O(10) words/WI
Local memory
O(1-10) KBytes/WG
Global memory
O(1-16) GBytes
Host memory
Above O(100) GBytes
Private memory
O(2-3) words/cycle/WI
Local memory
O(10) words/cycle/WG
Global memory
O(100-300) GBytes/s
Host memory
Above O(100) GBytes/s
Speeds and feeds approx. for a high-end discrete GPU
Bandwidths Sizes
21
Why Using Too Much Private Memory Can Be A Good Thing
• In reality private memory is just hardware registers, so only dozens of these are
available per work-item
• Many kernels will allocate too many variables to private memory
• So the compiler already has to be able to deal with this
• It does so by spilling excess private variables to (global) memory
• You still told the compiler something useful – that the data will only be accessed by a
single work-item
• This lets the compiler allocate the data in such as way as to enable more efficient
memory access
22
OpenCL Programming Model
• Data Parallel, SPMD
Traditionally, this term refers to a programming model where concurrency is expressed as
instructions from a single program applied to multiple elements within a set of data
structures. The term has been generalized in OpenCL to refer to a model wherein a set of
instructions from a single program is applied concurrently to each point within an abstract
domain of indices.
• Task Parallel
A programming model in which computations are expressed in terms of multiple
concurrent tasks, where a task is a kernel executing in a single work-group of size one. The
concurrent tasks can be running different kernels,
23
OpenCL Compilation Model
OpenCL uses dynamic (runtime) compilation model (like DirectX and OpenGL)
• Static compilation:
–The code is compiled from source to machine execution code at a specific point in
the past (when the developer complied it using the IDE)
• Dynamic compilation:
–Also known as runtime compilation
–Step 1 : The code is complied to an Intermediate Representation (IR), which is
usually an assembler of a virtual machine. This step is known as offline compilation,
and it’s done by the Front-End compiler
–Step 2: The IR is compiled to a machine code for execution. This step is much
shorter. It is known as online compilation, and it’s done by the Back-end compiler
• In dynamic compilation, step 1 is done usually only once, and the IR is stored. The
App loads the IR and performs step 2 during the App’s runtime (hence the term…)
24
OpenCL Terms (1/4)
• Application - The combination of the program running on the host and the
OpenCL devices.
• Buffer Object - A memory object that stores a linear collection of bytes. Buffer
objects are accessible using a pointer in a kernel executing on a device. Buffer
objects can be manipulated by the host, using OpenCL API calls.
• Platform - The host plus a collection of devices managed by the OpenCL
framework that allow an application to share resources and execute kernels on
devices in the platform.
25
OpenCL Terms (2/4)
• Context - The environment within which the kernels execute and the domain in
which synchronization and memory management are defined. The context
includes a set of devices, the memory accessible to those devices, the
corresponding memory properties and one or more command-queues used to
schedule execution of a kernel(s) or operations on memory objects.
• Command - OpenCL operation that is submitted to a command-queue for
execution. For example, OpenCL commands issue kernels for execution on a
compute device, manipulate memory objects, etc.
26
OpenCL Terms (3/4)
• Command-Queue - An object that holds commands that will be executed on a
specific device. The command-queue is created on a specific device in a
context. Commands to a command queue are queued in order but may be
executed in order or out of order.
• Kernel - A kernel is a function declared in a program and executed on an
OpenCL device. A kernel is identified by the __kernel or kernel qualifier applied
to any function defined in a program.
27
OpenCL Terms (4/4)
• Kernel Object - A kernel object encapsulates a specific kernel function
declared in a program and the argument values to be used when executing
this kernel function.
• Program - An OpenCL program consists of a set of kernels. Programs may
also contain auxiliary functions called by the kernel functions and constant
data.
• Program Object - A program object encapsulates the following information:
 A reference to an associated context.
 A program source or binary.
 The latest successfully built program executable, the list of devices for which the
program executable is built, the build options used and a build log.
 The number of kernel objects currently attached.
28
OpenCL Runtime
arg [0]
value
arg [1]
value
arg [2]
value
arg [0]
value
arg [1]
value
arg [2]
value
In
Order
Queu
e
Out
of
Order
QueuGPU
Context
__kernel void
dp_mul(global const float *a,
global const float *b,
global float *c)
{
int id = get_global_id(0);
c[id] = a[id] * b[id];
}
dp_mul
CPU program binary
dp_mul
GPU program binary
Programs
arg[0] value
arg[1] value
arg[2] value
Buffers Images
In
Order
Queue
Out of
Order
Queue
Compute Device
GPUCPU
dp_mul
Programs Kernels Memory Objects Command Queues
2929
Host programs
30
Structure Of Opencl Main Program
The structure of OpenCL host program is following:
1. Get information about platform and devices available on system
2. Select devices to use
3. Create an OpenCL command queue
4. Create memory buffers on device
5. Create kernel program object
6. Build (compile) kernel in-line (or load precompiled binary)
7. Create OpenCL kernel object
8. Set kernel arguments
9. Execute kernel
10. Read kernel memory and copy to host memory.
31
Setting Up The Host Program
• Khronos has defined a common C++ header file containing a high level interface to OpenCL,
cl.hpp
• This interface is dramatically easier to work with
• Include key header files … both standard and custom
#include <CL/cl.hpp> // Khronos C++ Wrapper API
#include <cstdio> // For C style
#include <iostream> // For C++ style IO
#include <vector> // For C++ vector types
32
Host Program Initialization
1. Get list of available platforms. List of platforms can be obtained using cl::Platform::get
method.
2. Set properties that will be used to create a new context.
3. Create the context by the constructor cl::Context. Constructor takes type of device
that will be included in the context and list of properties.
4. Create command queue. The host code cannot directly call kernels; it has to put them
in a queue.
33
Preparation of OpenCL Programs
1. The compilation process consist of four steps:
1. Load the sources into list of sources (cl::Program::Sources).
2. Create a program using constructor cl::Program.
3. Build the program using cl::Program::build.
4. Create kernel using cl::Kernel.
2. As alternative program can be cached and loaded from binary
34
Context and Command-Queues
• Context:
- The environment within which kernels execute and in which
synchronization and memory management is defined.
• The context includes:
- One or more devices
- Device memory
- One or more command-queues
• All commands for a device (kernel execution,
synchronization, and memory transfer operations) are
submitted through a command-queue.
• Each command-queue points to a single device within
a context.
Queue
Context
Device
Device Memory
35
Command-Queues
• Commands include:
- Kernel executions
- Memory object management
- Synchronization
• The only way to submit commands to a device
is through a command-queue.
• Each command-queue points to a single device
within a context.
• Multiple command-queues can feed a single
device.
- Used to define independent streams of
commands that don’t require synchronization
Queue Queue
Context
GPU CPU
36
Command-queue Execution Details
• Command queues can be configured in different
ways to control how commands execute
• In-order queues:
- Commands are enqueued and complete in the order they appear
in the program (program-order)
• Out-of-order queues:
- Commands are enqueued in program-order but can execute (and
hence complete) in any order.
• Execution of commands in the command-queue are
guaranteed to be completed at synchronization points
- Discussed later
Queue Queue
Context
GPU CPU
37
Opencl Synchronization: Queues & Events
• Events connect
command
invocations. Can
be used to
synchronize
executions inside
out-of-order
queues or
between queues
• Example: 2
queues with 2
devices
GPU
CPU
GPU
CPU
Time Time
Kernel 1
Kernel 2
Enqueue
Kernel1
Enqueue
Kernel2
Kernel 2 starts before
the results from
Kernel 1 are ready
Kernel 1
Kernel 2
Enqueue
Kernel1
Enqueue
Kernel2
Kernel 2 waits for an
event from Kernel 1
and does not start
until the results are
ready
38
Why Events? Won’t A Barrier Do?
• A barrier defines a synchronization point … commands
following a barrier wait to execute until all prior
enqueued commands complete
- cl_int clEnqueueBarrier(cl_command_queue
queue)
• Events provide fine grained control … this can really
matter with an out-of-order queue.
• Events work between commands in the different queues
… as long as they share a context
• Events convey more information than a barrier … provide
info on state of a command, not just whether it’s
complete or not.
Queue Queue
Context
GPU CPU
Event
39
Release of Resources
In C++ the release of resources is done on the function exit. This is because destructor
methods in OpenCL objects call the API functions to release resources. The release of
resources would be necessary in the case of using dynamically allocated memory and
pointers to OpenCL objects
4040
Kernel programs
41
Working with Kernel
• The kernels are where all the action is in an OpenCL program.
• Steps to using kernels:
1. Load kernel source code into a program object from a file
2. Make a kernel functor from a function within the program
3. Initialize device memory
4. Call the kernel functor, specifying memory objects and global/local sizes
5. Read results back from the device
• Note the kernel function argument list must match the kernel definition on the host.
42
OpenCL C for Compute Kernels
• Derived from ISO C99
- A few restrictions: no recursion, function pointers, functions in C99 standard headers ...
- Preprocessing directives defined by C99 are supported (#include etc.)
• Built-in data types
- Scalar and vector data types, pointers
- Data-type conversion functions:
• convert_type<_sat><_roundingmode>
- Image types:
• image2d_t, image3d_t and sampler_t
43
OpenCL C for Compute Kernels
• Built-in functions — mandatory
- Work-Item functions, math.h, read and write image
- Relational, geometric functions, synchronization functions
- printf
• Built-in functions — optional (called “extensions”)
- Double precision, atomics to global and local memory
- Selection of rounding mode, writes to image3d_t surface
44
OpenCL C Language Highlights
• Function qualifiers
- __kernel qualifier declares a function as a kernel
• I.e. makes it visible to host code so it can be enqueued
- Kernels can call other kernel-side functions
• Address space qualifiers
- __global, __local, __constant, __private
- Pointer kernel arguments must be declared with an address space qualifier
• Work-item functions
- get_work_dim(), get_global_id(), get_local_id(), get_group_id()
• Synchronization functions
- Barriers - all work-items within a work-group must execute the barrier function before
any work-item can continue
- Memory fences - provides ordering between memory operations
45
OpenCL C Language Restrictions
• Pointers to functions are not allowed
• Pointers to pointers allowed within a kernel, but not as an argument to a kernel
invocation
• Bit-fields are not supported
• Variable length arrays and structures are not supported
• Recursion is not supported
• The return type for a kernel function must be void.
• Arguments to kernel functions that are declared to be a struct cannot pass OpenCL
objects (such as buffers, images) as elements of the struct.
• The extern, static, auto, and register storage class specifiers are not supported.
46
Kernel Scalar Types
OpenCL C API type size notes
bool - -
char cl_char 8
unsigned char, uchar cl_uchar 8
short cl_short 16
unsigned short, ushort cl_ushort 16
int cl_int 32
unsigned int, uint cl_uint 32
long cl_long 64
unsigned long, ulong cl_ulong 64
float cl_float 32 IEEE 754
half cl_half 16 IEEE 754-2008
double cl_double 64 optional, IEEE-754
size_t - 32/64
ptrdiff_t - 32/64
intptr_t - 32/64
uintptr_t - 32/64
void void 0
47
Kernel Vector Data Types 1/2
OpenCL C API type
charn cl_charn
ucharn cl_ucharn
shortn cl_shortn
ushortn cl_ushortn
intn cl_intn
uintn cl_uintn
longn cl_longn
ulongn cl_ulongn
floatn cl_floatn
float4 pos = (float4)(1.0f, 2.0f, 3.0f, 4.0f);
float4 swiz= pos.wzyx; // swiz = (4.0f, 3.0f, 2.0f, 1.0f)
float4 dup = pos.xxyy; // dup = (1.0f, 1.0f, 2.0f, 2.0f)
float4 f, a, b;
f.xyzw = a.s0123 + b.s0123;
2-component
3-component
4-component
8-component
16-component
0, 1
0, 1, 2
0, 1, 2, 3
0, 1, 2, 3, 4, 5, 6, 7
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, A, b, B, c, C,
d, D, e, E, f, F
48
Kernel Vector Data Types 2/2
• The forms of the function that are available are the set of possible argument lists for
which all arguments have the same element type as the result vector, and the total
number of elements is equal to the number of elements in the result vector.
• For vector float4 following combinations are allowed:
(float4)( float, float, float, float )
(float4)( float2, float, float )
(float4)( float, float2, float )
(float4)( float, float, float2 )
(float4)( float2, float2 )
(float4)( float3, float )
(float4)( float, float3 )
(float4)( float )
49
Endianness And Memory Access
• OpenCL standard tells nothing about how the bytes are ordered in memory. The reason for this
is that different devices and operating systems order bytes differently.
• There are two ways to determine whether a device is little-endian or big-endian:
 From the host, you can call clGetDeviceInfo with CL_DEVICE_ENDIAN_LITTLE as the
parameter. If this returns CL_TRUE, the device is little-endian. If it returns CL_FALSE, the
device is big-endian.
 Within the kernel, you can use #ifdef to determine whether the __ENDIAN_LITTLE__ macro is
defined. If this macro is defined, the device is little-endian. If not, the device is big-endian.
uint4 vec = (vec4)
(0x00010203, 0x04050607,
0x08090A0B, 0x0C0D0E0F);
50
Alignment Of Data Types
• Every built-in and vector data type in OpenCL is aligned to the size of the data type
itself.
• A built-in data type that is not a power of two bytes in size must be aligned to the
next larger power of two. This rule applies to built-in types only, not structs or unions.
For example, a float3 contains 12 bytes. This vector will be stored on a 16-byte
boundary because 16 is the smallest power of 2 greater than or equal to 12.
51
Alignment Of Data Types (Example)
typedef struct
{
float8 x; //Needs 32 byte alignment
float3 y; //Needs 16 byte alignment
} OpenCLStruct;
#if defined( _MSC_VER)
typedef struct
{
cl_float8 __declspec(align(32)) y;
cl_float3 __declspec(align(16)) x;
} OpenCLStruct;
#elif defined( __GNUC__ )
typedef struct
{
cl_float8 __attribute__ (aligned(32)) y;
cl_float3 __attribute__ (aligned(16)) x;
} OpenCLStruct;
#endif
52
Kernel Synchronization Functions
void barrier ( cl_mem_fence_flags flags ) - All work-items in a work-group executing the kernel on
a processor must execute this function before any are allowed to continue execution beyond the
barrier.
void mem_fence ( cl_mem_fence_flags flags ) - Orders loads and stores of a work-item executing
a kernel.
void read_mem_fence ( cl_mem_fence_flags flags ) - Read memory barrier that orders only loads.
void write_mem_fence ( cl_mem_fence_flags flags ) - Write memory barrier that orders only stores.
Available options for flags are:
• CLK_GLOBAL_MEM_FENCE
• CLK_LOCAL_MEM_FENCE
• CLK_IMAGE_MEM_FENCE
5353
Working with the OpenCL memory model
5454
Memory organization viewed from the kernel perspective and from the host
program is slightly different
55
1. Host recognize merory located in the host ant the memory located in context
2. Host allocates memory in both regions
3. Context can be located in the host memory or in the device memory
4. Host program is responsible for the memory transfers
5. Host program cannot access to the local memory
6. OpenCL program operates with four memory pools – constant, global, local and
private. Data can be explicitly copied between these regions
56
Relaxed Memory Consistency
• The different memory pools provide different memory consistence models.
• Local and global memory pools provide a relaxed consistency model.
• Relaxed Consistency - A memory consistency model in which the contents of
memory visible to different work-items or commands may be different except at a
barrier or other explicit synchronization point.
57
Explanation
Lets calculate sum of the array elements and put it into every element.
58
• It is important to be aware that synchronization is possible only per workgroup.
• It is impossible to synchronize different work-groups.
• In the case of global memory usage, the work-groups should write to different
regions of it.
5959
Example: Barnes Hut n-Body Algorithm
(practice)
60
Barnes Hut Algorithm
• Set bodies’ initial position and velocity
• Iterate over time steps
1. Subdivide space until at most one body per cell
• Record this spatial hierarchy in an octree
2. Compute mass and center of mass of each cell
3. Compute force on bodies by traversing octree
• Stop traversal path when encountering a leaf (body) or an internal node (cell) that is far enough
away
4. Update each body’s position and velocity
61
Build Tree (Level 1)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o
Subdivide space until at most one body per cell
62
Build Tree (Level 2)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o o o o
o
Subdivide space until at most one body per cell
63
Build Tree (Level 3)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o o o o
o
o o o o o o o o o o o o
Subdivide space until at most one body per cell
64
Build Tree (Level 4)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o o o o
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Subdivide space until at most one body per cell
65
Build Tree (Level 5)
*
* *
*
* *
* *
* * *
* * *
* *
*
* *
*
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Subdivide space until at most one body per cell
66
Compute Cells’ Center of Mass
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
For each internal cell, compute sum of mass and weighted average
of position of all bodies in subtree; example shows two cells only
67
Compute Forces
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Compute force, for example, acting upon green body
68
Compute Force (Short Distance)
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Scan tree depth first from left to right; green portion already completed
69
Compute Force (Down One Level)
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Red center of mass is too close, need to go down one level
70
Compute Force (Long Distance)
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Yellow center of mass is far enough away
71
Compute Force (Skip Subtree)
Barnes Hut N-body Simulation
*
* *
*
* *
* *
* * *
* * *
o
* *
*
* *
o *
* *
*
71
o o o o
oooo
o
o o o o
o o o o o o o o o o o o
o o o o o o o o
Therefore, entire subtree rooted in the yellow cell can be skipped
72
Pseudocode
Set bodySet = ...
foreach timestep do {
Octree octree = new Octree();
foreach Body b in bodySet {
octree.Insert(b);
}
OrderedList cellList = octree.CellsByLevel();
foreach Cell c in cellList {
c.Summarize();
}
foreach Body b in bodySet {
b.ComputeForce(octree);
}
foreach Body b in bodySet {
b.Advance();
}
}
73
Complexity
Set bodySet = ...
foreach timestep do { // O(n log n)
Octree octree = new Octree();
foreach Body b in bodySet { // O(n log n)
octree.Insert(b);
}
OrderedList cellList = octree.CellsByLevel();
foreach Cell c in cellList { // O(n)
c.Summarize();
}
foreach Body b in bodySet { // O(n log n)
b.ComputeForce(octree);
}
foreach Body b in bodySet { // O(n)
b.Advance();
}
}
73
74
Parallelism
Set bodySet = ...
foreach timestep do { // sequential
Octree octree = new Octree();
foreach Body b in bodySet { // tree building
octree.Insert(b);
}
OrderedList cellList = octree.CellsByLevel();
foreach Cell c in cellList { // tree traversal
c.Summarize();
}
foreach Body b in bodySet { // fully parallel
b.ComputeForce(octree);
}
foreach Body b in bodySet { // fully parallel
b.Advance();
}
}
75
Resources: https://www.khronos.org/opencl/
The OpenCL specification
Surprisingly approachable for a spec!
https://www.khronos.org/registry/cl/
OpenCL reference card
Useful to have on your desk(top)
Available on the same page as the spec.
76
Sep 2019
Thank you
Stanislav Donets
Lead Software Engineer
DRM Software Reviews

Mais conteúdo relacionado

Mais procurados

Tech Days 2015: ARM Programming with GNAT and Ada 2012
Tech Days 2015: ARM Programming with GNAT and Ada 2012Tech Days 2015: ARM Programming with GNAT and Ada 2012
Tech Days 2015: ARM Programming with GNAT and Ada 2012AdaCore
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardsonharryvanhaaren
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modulesHao-Ran Liu
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and WhereKernel TLV
 
Process scheduling
Process schedulingProcess scheduling
Process schedulingHao-Ran Liu
 
Linux Performance Tunning introduction
Linux Performance Tunning introductionLinux Performance Tunning introduction
Linux Performance Tunning introductionShay Cohen
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Ceph Community
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)Nicola Bonelli
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Functional approach to packet processing
Functional approach to packet processingFunctional approach to packet processing
Functional approach to packet processingNicola Bonelli
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Amazon Web Services
 

Mais procurados (20)

PF_DIRECT@TMA12
PF_DIRECT@TMA12PF_DIRECT@TMA12
PF_DIRECT@TMA12
 
Tech Days 2015: ARM Programming with GNAT and Ada 2012
Tech Days 2015: ARM Programming with GNAT and Ada 2012Tech Days 2015: ARM Programming with GNAT and Ada 2012
Tech Days 2015: ARM Programming with GNAT and Ada 2012
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce RichardsonThe 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
 
PFQ@ PAM12
PFQ@ PAM12PFQ@ PAM12
PFQ@ PAM12
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modules
 
File Systems: Why, How and Where
File Systems: Why, How and WhereFile Systems: Why, How and Where
File Systems: Why, How and Where
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
How swift is your Swift - SD.pptx
How swift is your Swift - SD.pptxHow swift is your Swift - SD.pptx
How swift is your Swift - SD.pptx
 
Linux Performance Tunning introduction
Linux Performance Tunning introductionLinux Performance Tunning introduction
Linux Performance Tunning introduction
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
SDAccel Design Contest: SDAccel and F1 Instances
SDAccel Design Contest: SDAccel and F1 InstancesSDAccel Design Contest: SDAccel and F1 Instances
SDAccel Design Contest: SDAccel and F1 Instances
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Kgdb kdb modesetting
Kgdb kdb modesettingKgdb kdb modesetting
Kgdb kdb modesetting
 
PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Functional approach to packet processing
Functional approach to packet processingFunctional approach to packet processing
Functional approach to packet processing
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 

Semelhante a General Purpose GPU Computing

Large Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - NautilusLarge Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - NautilusGabriele Di Bernardo
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-reviewabinaya m
 
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdffinaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdfNazarAhmadAlkhidir
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsHeechul Yun
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Fast boot
Fast bootFast boot
Fast bootSZ Lin
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdfwisard1
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Pradeep Singh
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPrashant Rane
 
Current and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linuxmountpoint.io
 
Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-reviewMothi R
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016Kuniyasu Suzaki
 

Semelhante a General Purpose GPU Computing (20)

Large Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - NautilusLarge Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - Nautilus
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-review
 
Device Drivers
Device DriversDevice Drivers
Device Drivers
 
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdffinaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
finaldraft-intelcorei5processorsarchitecture-130207093535-phpapp01.pdf
 
Memory model
Memory modelMemory model
Memory model
 
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Fast boot
Fast bootFast boot
Fast boot
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdf
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
Current and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linux
 
Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-review
 
Thread
ThreadThread
Thread
 
Implement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVMImplement Runtime Environments for HSA using LLVM
Implement Runtime Environments for HSA using LLVM
 
”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016”Bare-Metal Container" presented at HPCC2016
”Bare-Metal Container" presented at HPCC2016
 

Mais de GlobalLogic Ukraine

GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”GlobalLogic Ukraine
 
Штучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptxШтучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptxGlobalLogic Ukraine
 
Задачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptxЗадачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptxGlobalLogic Ukraine
 
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptxЩо треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptxGlobalLogic Ukraine
 
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...GlobalLogic Ukraine
 
JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"GlobalLogic Ukraine
 
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...GlobalLogic Ukraine
 
Страх і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic EducationСтрах і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic EducationGlobalLogic Ukraine
 
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”GlobalLogic Ukraine
 
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”GlobalLogic Ukraine
 
“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?GlobalLogic Ukraine
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Ukraine
 
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Ukraine
 
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”GlobalLogic Ukraine
 
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"GlobalLogic Ukraine
 
GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"GlobalLogic Ukraine
 
C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"GlobalLogic Ukraine
 
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...GlobalLogic Ukraine
 
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...GlobalLogic Ukraine
 
GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”
GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”
GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”GlobalLogic Ukraine
 

Mais de GlobalLogic Ukraine (20)

GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
 
Штучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptxШтучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptx
 
Задачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptxЗадачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptx
 
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptxЩо треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
 
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
 
JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"
 
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
 
Страх і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic EducationСтрах і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic Education
 
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
 
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
 
“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
 
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
 
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
 
GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"
 
C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"
 
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
 
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
GlobalLogic Test Automation Online TechTalk “Test Driven Development as a Per...
 
GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”
GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”
GlobalLogic Azure TechTalk ONLINE “Marketing Data Lake in Azure”
 

Último

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Último (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

General Purpose GPU Computing

  • 1. 1
  • 3. 3 Agenda 1. Graphics Processing Units (GPUs): Architecture and Programming (theory): • An Introduction to OpenCL; • Host programs; • Kernel programs; • Writing Kernel Programs; • Working with the OpenCL memory model;
  • 4. 4 Agenda 2. Example: Barnes Hut n-Body Algorithm (practice): • Introduction, Problem Statement, and Context; • Core Methods; • Algorithms and Implementations;
  • 5. 55 Graphics Processing Units (GPUs): Architecture and Programming (theory)
  • 7. 7 Amdahl's Law vs Gustafson – Barsis's Law • Amdahl’s law: 𝑆 𝑝 = 1 𝛼 + 1 − 𝛼 𝑝 • Gustafson – Barsis's law 𝑆 𝑝 = 𝑝 + 𝛼 1 − 𝑝 Where: 𝛼 - strictly serial or non-parallelizable code; 𝑝 – number threads Workload remains constant When workload increases with number of processors more speedup is obtained Gustafson – Barsis's LawAmdahl's Law
  • 8. 8 Parallel Programming Techniques OpenMP MPI OpenACC CUDA OpenMP is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. It is prevalent only on a multi-core computer platform with a shared memory subsystem. Message Passing Interface (MPI) has an advantage over OpenMP, that it can run on either the shared or distributed memory architecture. Distributed memory computers are less expensive than large shared memory computers. One major disadvantage of MPI parallel framework is that the performance is limited by the communication network between the nodes. The OpenACC Application Program Interface (API) describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs, and accelerators. Compute Unified Device Architecture (CUDA) is a parallel computing architecture developed by NVIDIA for graphics processing and GPU (General Purpose GPU) programming. There is a fairly good developer community following for the CUDA software framework.
  • 10. 10 Typical System Host GPU GPU PCI-Express • Host initiated memory transfers • Host initiated computations on the GPU (kernels)
  • 11. 11 Conventional CPU Architecture • Space devoted to control logic instead of ALU • CPUs are optimized to minimize the latency of a single thread - Can efficiently handle control flow intensive workloads • Multi level caches used to hide latency • Limited number of registers due to smaller number of active threads • Control logic to reorder execution, provide ILP and minimize pipeline stalls Conventional CPU Block Diagram Control Logic ALU L1 Cache L2 Cache L3 Cache ~ 25GBPS System Memory A present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores
  • 12. 12 Conventional GPU Architecture • Generic many core GPU - Less space devoted to control logic and caches - Large register files to support multiple thread contexts • Low latency hardware managed thread switching • Large number of ALU per “core” with small user managed cache per core • Memory bus optimized for bandwidth - ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneously On Board System Memory High Bandwidth bus to ALUs Simple ALUs Cache
  • 13. 13 The Heterogeneous System TI DSP’s, FPGA’s, Hardware Accelerators. Programming using propreitary tools only. CPU’s x86, x86_64, ARM Multicore architecture. Programming using OpenMP, POSIX Threads etc. GPU’s AMD, NVIDIA, MALI. Programming using propreitary tools. OpenCL
  • 15. 15 Platform Model • One Host and one or more OpenCL Devices • Each OpenCL Device is composed of one or more • Compute Units • Each Compute Unit is divided into one or more Processing Elements • Memory divided into host memory and device memory
  • 16. 16 Execution Model • Host defines a command queue and associates it with a context (devices, kernels, memory, etc). • Host enqueues commands to the command queue • Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space  A single copy of the compute kernel, running on one data element  In Data Parallel mode, kernel execution contains multiple work- items  In Task Parallel mode, kernel execution contains a single work-item • Work items execute together as a work-group. • Work-item
  • 17. 17 Execution Model (NDRange) • Synchronization between work-items possible only within work-groups: barriers and memory fences • Cannot synchronize between work-groups within a kernel • Choose the dimensions (1, 2, or 3) that are “best” for your algorithm
  • 18. 18 Opencl Memory Model • Four Memory Types:  Global : default for images/buffers  Constant : global const variables  Local : shared between work-items  Private : kernel internal variables • Global or Constant memory are visible to all work-items. They are the largest and slowest memory types used by devices. Constant memory is read only for the device but is read/write by the host. • Local memory is sharable within a workgroup. It is smaller but faster than global memory. • Private memory is available to individual work-items.
  • 19. 19 OpenCL Memory Consistency • OpenCL uses a “relaxed consistency memory model” o State of memory visible to a work-item not guaranteed to be consistent across the collection of work-items at all times • Memory has load/store consistency within a work-item • Local memory has consistency across work-items within a work-group at a barrier • Global memory is consistent within a work-group at a barrier, but not guaranteed across different work-groups • Memory consistency for objects shared between commands enforced at synchronization points
  • 20. 20 The Memory Hierarchy Private memory O(10) words/WI Local memory O(1-10) KBytes/WG Global memory O(1-16) GBytes Host memory Above O(100) GBytes Private memory O(2-3) words/cycle/WI Local memory O(10) words/cycle/WG Global memory O(100-300) GBytes/s Host memory Above O(100) GBytes/s Speeds and feeds approx. for a high-end discrete GPU Bandwidths Sizes
  • 21. 21 Why Using Too Much Private Memory Can Be A Good Thing • In reality private memory is just hardware registers, so only dozens of these are available per work-item • Many kernels will allocate too many variables to private memory • So the compiler already has to be able to deal with this • It does so by spilling excess private variables to (global) memory • You still told the compiler something useful – that the data will only be accessed by a single work-item • This lets the compiler allocate the data in such as way as to enable more efficient memory access
  • 22. 22 OpenCL Programming Model • Data Parallel, SPMD Traditionally, this term refers to a programming model where concurrency is expressed as instructions from a single program applied to multiple elements within a set of data structures. The term has been generalized in OpenCL to refer to a model wherein a set of instructions from a single program is applied concurrently to each point within an abstract domain of indices. • Task Parallel A programming model in which computations are expressed in terms of multiple concurrent tasks, where a task is a kernel executing in a single work-group of size one. The concurrent tasks can be running different kernels,
  • 23. 23 OpenCL Compilation Model OpenCL uses dynamic (runtime) compilation model (like DirectX and OpenGL) • Static compilation: –The code is compiled from source to machine execution code at a specific point in the past (when the developer complied it using the IDE) • Dynamic compilation: –Also known as runtime compilation –Step 1 : The code is complied to an Intermediate Representation (IR), which is usually an assembler of a virtual machine. This step is known as offline compilation, and it’s done by the Front-End compiler –Step 2: The IR is compiled to a machine code for execution. This step is much shorter. It is known as online compilation, and it’s done by the Back-end compiler • In dynamic compilation, step 1 is done usually only once, and the IR is stored. The App loads the IR and performs step 2 during the App’s runtime (hence the term…)
  • 24. 24 OpenCL Terms (1/4) • Application - The combination of the program running on the host and the OpenCL devices. • Buffer Object - A memory object that stores a linear collection of bytes. Buffer objects are accessible using a pointer in a kernel executing on a device. Buffer objects can be manipulated by the host, using OpenCL API calls. • Platform - The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform.
  • 25. 25 OpenCL Terms (2/4) • Context - The environment within which the kernels execute and the domain in which synchronization and memory management are defined. The context includes a set of devices, the memory accessible to those devices, the corresponding memory properties and one or more command-queues used to schedule execution of a kernel(s) or operations on memory objects. • Command - OpenCL operation that is submitted to a command-queue for execution. For example, OpenCL commands issue kernels for execution on a compute device, manipulate memory objects, etc.
  • 26. 26 OpenCL Terms (3/4) • Command-Queue - An object that holds commands that will be executed on a specific device. The command-queue is created on a specific device in a context. Commands to a command queue are queued in order but may be executed in order or out of order. • Kernel - A kernel is a function declared in a program and executed on an OpenCL device. A kernel is identified by the __kernel or kernel qualifier applied to any function defined in a program.
  • 27. 27 OpenCL Terms (4/4) • Kernel Object - A kernel object encapsulates a specific kernel function declared in a program and the argument values to be used when executing this kernel function. • Program - An OpenCL program consists of a set of kernels. Programs may also contain auxiliary functions called by the kernel functions and constant data. • Program Object - A program object encapsulates the following information:  A reference to an associated context.  A program source or binary.  The latest successfully built program executable, the list of devices for which the program executable is built, the build options used and a build log.  The number of kernel objects currently attached.
  • 28. 28 OpenCL Runtime arg [0] value arg [1] value arg [2] value arg [0] value arg [1] value arg [2] value In Order Queu e Out of Order QueuGPU Context __kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } dp_mul CPU program binary dp_mul GPU program binary Programs arg[0] value arg[1] value arg[2] value Buffers Images In Order Queue Out of Order Queue Compute Device GPUCPU dp_mul Programs Kernels Memory Objects Command Queues
  • 30. 30 Structure Of Opencl Main Program The structure of OpenCL host program is following: 1. Get information about platform and devices available on system 2. Select devices to use 3. Create an OpenCL command queue 4. Create memory buffers on device 5. Create kernel program object 6. Build (compile) kernel in-line (or load precompiled binary) 7. Create OpenCL kernel object 8. Set kernel arguments 9. Execute kernel 10. Read kernel memory and copy to host memory.
  • 31. 31 Setting Up The Host Program • Khronos has defined a common C++ header file containing a high level interface to OpenCL, cl.hpp • This interface is dramatically easier to work with • Include key header files … both standard and custom #include <CL/cl.hpp> // Khronos C++ Wrapper API #include <cstdio> // For C style #include <iostream> // For C++ style IO #include <vector> // For C++ vector types
  • 32. 32 Host Program Initialization 1. Get list of available platforms. List of platforms can be obtained using cl::Platform::get method. 2. Set properties that will be used to create a new context. 3. Create the context by the constructor cl::Context. Constructor takes type of device that will be included in the context and list of properties. 4. Create command queue. The host code cannot directly call kernels; it has to put them in a queue.
  • 33. 33 Preparation of OpenCL Programs 1. The compilation process consist of four steps: 1. Load the sources into list of sources (cl::Program::Sources). 2. Create a program using constructor cl::Program. 3. Build the program using cl::Program::build. 4. Create kernel using cl::Kernel. 2. As alternative program can be cached and loaded from binary
  • 34. 34 Context and Command-Queues • Context: - The environment within which kernels execute and in which synchronization and memory management is defined. • The context includes: - One or more devices - Device memory - One or more command-queues • All commands for a device (kernel execution, synchronization, and memory transfer operations) are submitted through a command-queue. • Each command-queue points to a single device within a context. Queue Context Device Device Memory
  • 35. 35 Command-Queues • Commands include: - Kernel executions - Memory object management - Synchronization • The only way to submit commands to a device is through a command-queue. • Each command-queue points to a single device within a context. • Multiple command-queues can feed a single device. - Used to define independent streams of commands that don’t require synchronization Queue Queue Context GPU CPU
  • 36. 36 Command-queue Execution Details • Command queues can be configured in different ways to control how commands execute • In-order queues: - Commands are enqueued and complete in the order they appear in the program (program-order) • Out-of-order queues: - Commands are enqueued in program-order but can execute (and hence complete) in any order. • Execution of commands in the command-queue are guaranteed to be completed at synchronization points - Discussed later Queue Queue Context GPU CPU
  • 37. 37 Opencl Synchronization: Queues & Events • Events connect command invocations. Can be used to synchronize executions inside out-of-order queues or between queues • Example: 2 queues with 2 devices GPU CPU GPU CPU Time Time Kernel 1 Kernel 2 Enqueue Kernel1 Enqueue Kernel2 Kernel 2 starts before the results from Kernel 1 are ready Kernel 1 Kernel 2 Enqueue Kernel1 Enqueue Kernel2 Kernel 2 waits for an event from Kernel 1 and does not start until the results are ready
  • 38. 38 Why Events? Won’t A Barrier Do? • A barrier defines a synchronization point … commands following a barrier wait to execute until all prior enqueued commands complete - cl_int clEnqueueBarrier(cl_command_queue queue) • Events provide fine grained control … this can really matter with an out-of-order queue. • Events work between commands in the different queues … as long as they share a context • Events convey more information than a barrier … provide info on state of a command, not just whether it’s complete or not. Queue Queue Context GPU CPU Event
  • 39. 39 Release of Resources In C++ the release of resources is done on the function exit. This is because destructor methods in OpenCL objects call the API functions to release resources. The release of resources would be necessary in the case of using dynamically allocated memory and pointers to OpenCL objects
  • 41. 41 Working with Kernel • The kernels are where all the action is in an OpenCL program. • Steps to using kernels: 1. Load kernel source code into a program object from a file 2. Make a kernel functor from a function within the program 3. Initialize device memory 4. Call the kernel functor, specifying memory objects and global/local sizes 5. Read results back from the device • Note the kernel function argument list must match the kernel definition on the host.
  • 42. 42 OpenCL C for Compute Kernels • Derived from ISO C99 - A few restrictions: no recursion, function pointers, functions in C99 standard headers ... - Preprocessing directives defined by C99 are supported (#include etc.) • Built-in data types - Scalar and vector data types, pointers - Data-type conversion functions: • convert_type<_sat><_roundingmode> - Image types: • image2d_t, image3d_t and sampler_t
  • 43. 43 OpenCL C for Compute Kernels • Built-in functions — mandatory - Work-Item functions, math.h, read and write image - Relational, geometric functions, synchronization functions - printf • Built-in functions — optional (called “extensions”) - Double precision, atomics to global and local memory - Selection of rounding mode, writes to image3d_t surface
  • 44. 44 OpenCL C Language Highlights • Function qualifiers - __kernel qualifier declares a function as a kernel • I.e. makes it visible to host code so it can be enqueued - Kernels can call other kernel-side functions • Address space qualifiers - __global, __local, __constant, __private - Pointer kernel arguments must be declared with an address space qualifier • Work-item functions - get_work_dim(), get_global_id(), get_local_id(), get_group_id() • Synchronization functions - Barriers - all work-items within a work-group must execute the barrier function before any work-item can continue - Memory fences - provides ordering between memory operations
  • 45. 45 OpenCL C Language Restrictions • Pointers to functions are not allowed • Pointers to pointers allowed within a kernel, but not as an argument to a kernel invocation • Bit-fields are not supported • Variable length arrays and structures are not supported • Recursion is not supported • The return type for a kernel function must be void. • Arguments to kernel functions that are declared to be a struct cannot pass OpenCL objects (such as buffers, images) as elements of the struct. • The extern, static, auto, and register storage class specifiers are not supported.
  • 46. 46 Kernel Scalar Types OpenCL C API type size notes bool - - char cl_char 8 unsigned char, uchar cl_uchar 8 short cl_short 16 unsigned short, ushort cl_ushort 16 int cl_int 32 unsigned int, uint cl_uint 32 long cl_long 64 unsigned long, ulong cl_ulong 64 float cl_float 32 IEEE 754 half cl_half 16 IEEE 754-2008 double cl_double 64 optional, IEEE-754 size_t - 32/64 ptrdiff_t - 32/64 intptr_t - 32/64 uintptr_t - 32/64 void void 0
  • 47. 47 Kernel Vector Data Types 1/2 OpenCL C API type charn cl_charn ucharn cl_ucharn shortn cl_shortn ushortn cl_ushortn intn cl_intn uintn cl_uintn longn cl_longn ulongn cl_ulongn floatn cl_floatn float4 pos = (float4)(1.0f, 2.0f, 3.0f, 4.0f); float4 swiz= pos.wzyx; // swiz = (4.0f, 3.0f, 2.0f, 1.0f) float4 dup = pos.xxyy; // dup = (1.0f, 1.0f, 2.0f, 2.0f) float4 f, a, b; f.xyzw = a.s0123 + b.s0123; 2-component 3-component 4-component 8-component 16-component 0, 1 0, 1, 2 0, 1, 2, 3 0, 1, 2, 3, 4, 5, 6, 7 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, A, b, B, c, C, d, D, e, E, f, F
  • 48. 48 Kernel Vector Data Types 2/2 • The forms of the function that are available are the set of possible argument lists for which all arguments have the same element type as the result vector, and the total number of elements is equal to the number of elements in the result vector. • For vector float4 following combinations are allowed: (float4)( float, float, float, float ) (float4)( float2, float, float ) (float4)( float, float2, float ) (float4)( float, float, float2 ) (float4)( float2, float2 ) (float4)( float3, float ) (float4)( float, float3 ) (float4)( float )
  • 49. 49 Endianness And Memory Access • OpenCL standard tells nothing about how the bytes are ordered in memory. The reason for this is that different devices and operating systems order bytes differently. • There are two ways to determine whether a device is little-endian or big-endian:  From the host, you can call clGetDeviceInfo with CL_DEVICE_ENDIAN_LITTLE as the parameter. If this returns CL_TRUE, the device is little-endian. If it returns CL_FALSE, the device is big-endian.  Within the kernel, you can use #ifdef to determine whether the __ENDIAN_LITTLE__ macro is defined. If this macro is defined, the device is little-endian. If not, the device is big-endian. uint4 vec = (vec4) (0x00010203, 0x04050607, 0x08090A0B, 0x0C0D0E0F);
  • 50. 50 Alignment Of Data Types • Every built-in and vector data type in OpenCL is aligned to the size of the data type itself. • A built-in data type that is not a power of two bytes in size must be aligned to the next larger power of two. This rule applies to built-in types only, not structs or unions. For example, a float3 contains 12 bytes. This vector will be stored on a 16-byte boundary because 16 is the smallest power of 2 greater than or equal to 12.
  • 51. 51 Alignment Of Data Types (Example) typedef struct { float8 x; //Needs 32 byte alignment float3 y; //Needs 16 byte alignment } OpenCLStruct; #if defined( _MSC_VER) typedef struct { cl_float8 __declspec(align(32)) y; cl_float3 __declspec(align(16)) x; } OpenCLStruct; #elif defined( __GNUC__ ) typedef struct { cl_float8 __attribute__ (aligned(32)) y; cl_float3 __attribute__ (aligned(16)) x; } OpenCLStruct; #endif
  • 52. 52 Kernel Synchronization Functions void barrier ( cl_mem_fence_flags flags ) - All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. void mem_fence ( cl_mem_fence_flags flags ) - Orders loads and stores of a work-item executing a kernel. void read_mem_fence ( cl_mem_fence_flags flags ) - Read memory barrier that orders only loads. void write_mem_fence ( cl_mem_fence_flags flags ) - Write memory barrier that orders only stores. Available options for flags are: • CLK_GLOBAL_MEM_FENCE • CLK_LOCAL_MEM_FENCE • CLK_IMAGE_MEM_FENCE
  • 53. 5353 Working with the OpenCL memory model
  • 54. 5454 Memory organization viewed from the kernel perspective and from the host program is slightly different
  • 55. 55 1. Host recognize merory located in the host ant the memory located in context 2. Host allocates memory in both regions 3. Context can be located in the host memory or in the device memory 4. Host program is responsible for the memory transfers 5. Host program cannot access to the local memory 6. OpenCL program operates with four memory pools – constant, global, local and private. Data can be explicitly copied between these regions
  • 56. 56 Relaxed Memory Consistency • The different memory pools provide different memory consistence models. • Local and global memory pools provide a relaxed consistency model. • Relaxed Consistency - A memory consistency model in which the contents of memory visible to different work-items or commands may be different except at a barrier or other explicit synchronization point.
  • 57. 57 Explanation Lets calculate sum of the array elements and put it into every element.
  • 58. 58 • It is important to be aware that synchronization is possible only per workgroup. • It is impossible to synchronize different work-groups. • In the case of global memory usage, the work-groups should write to different regions of it.
  • 59. 5959 Example: Barnes Hut n-Body Algorithm (practice)
  • 60. 60 Barnes Hut Algorithm • Set bodies’ initial position and velocity • Iterate over time steps 1. Subdivide space until at most one body per cell • Record this spatial hierarchy in an octree 2. Compute mass and center of mass of each cell 3. Compute force on bodies by traversing octree • Stop traversal path when encountering a leaf (body) or an internal node (cell) that is far enough away 4. Update each body’s position and velocity
  • 61. 61 Build Tree (Level 1) * * * * * * * * * * * * * * * * * * * * * * * o Subdivide space until at most one body per cell
  • 62. 62 Build Tree (Level 2) * * * * * * * * * * * * * * * * * * * * * * * o o o o o Subdivide space until at most one body per cell
  • 63. 63 Build Tree (Level 3) * * * * * * * * * * * * * * * * * * * * * * * o o o o o o o o o o o o o o o o o Subdivide space until at most one body per cell
  • 64. 64 Build Tree (Level 4) * * * * * * * * * * * * * * * * * * * * * * * o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Subdivide space until at most one body per cell
  • 65. 65 Build Tree (Level 5) * * * * * * * * * * * * * * * * * * * * * * * o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o o Subdivide space until at most one body per cell
  • 66. 66 Compute Cells’ Center of Mass * * * * * * * * * * * * * * o * * * * * o * * * * o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o o For each internal cell, compute sum of mass and weighted average of position of all bodies in subtree; example shows two cells only
  • 67. 67 Compute Forces * * * * * * * * * * * * * * o * * * * * o * * * * o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o o Compute force, for example, acting upon green body
  • 68. 68 Compute Force (Short Distance) * * * * * * * * * * * * * * o * * * * * o * * * * o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o o Scan tree depth first from left to right; green portion already completed
  • 69. 69 Compute Force (Down One Level) * * * * * * * * * * * * * * o * * * * * o * * * * o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o o Red center of mass is too close, need to go down one level
  • 70. 70 Compute Force (Long Distance) * * * * * * * * * * * * * * o * * * * * o * * * * o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o o Yellow center of mass is far enough away
  • 71. 71 Compute Force (Skip Subtree) Barnes Hut N-body Simulation * * * * * * * * * * * * * * o * * * * * o * * * * 71 o o o o oooo o o o o o o o o o o o o o o o o o o o o o o o o o Therefore, entire subtree rooted in the yellow cell can be skipped
  • 72. 72 Pseudocode Set bodySet = ... foreach timestep do { Octree octree = new Octree(); foreach Body b in bodySet { octree.Insert(b); } OrderedList cellList = octree.CellsByLevel(); foreach Cell c in cellList { c.Summarize(); } foreach Body b in bodySet { b.ComputeForce(octree); } foreach Body b in bodySet { b.Advance(); } }
  • 73. 73 Complexity Set bodySet = ... foreach timestep do { // O(n log n) Octree octree = new Octree(); foreach Body b in bodySet { // O(n log n) octree.Insert(b); } OrderedList cellList = octree.CellsByLevel(); foreach Cell c in cellList { // O(n) c.Summarize(); } foreach Body b in bodySet { // O(n log n) b.ComputeForce(octree); } foreach Body b in bodySet { // O(n) b.Advance(); } } 73
  • 74. 74 Parallelism Set bodySet = ... foreach timestep do { // sequential Octree octree = new Octree(); foreach Body b in bodySet { // tree building octree.Insert(b); } OrderedList cellList = octree.CellsByLevel(); foreach Cell c in cellList { // tree traversal c.Summarize(); } foreach Body b in bodySet { // fully parallel b.ComputeForce(octree); } foreach Body b in bodySet { // fully parallel b.Advance(); } }
  • 75. 75 Resources: https://www.khronos.org/opencl/ The OpenCL specification Surprisingly approachable for a spec! https://www.khronos.org/registry/cl/ OpenCL reference card Useful to have on your desk(top) Available on the same page as the spec.
  • 76. 76 Sep 2019 Thank you Stanislav Donets Lead Software Engineer DRM Software Reviews