SlideShare a Scribd company logo
1 of 49
Download to read offline
April 4-7, 2016 | Silicon Valley
James Beyer, NVIDIA
Jeff Larkin, NVIDIA
Targeting GPUs with OpenMP 4.5
Device Directives
2
AGENDA
OpenMP Background
Step by Step Case Study
Parallelize on CPU
Offload to GPU
Team Up
Increase Parallelism
Improve Scheduling
Additional Experiments
Conclusions
3
Motivation
Multiple compilers are in development to support OpenMP offloading
to NVIDIA GPUs.
Articles and blog posts are being written by early adopters trying
OpenMP on NVIDIA GPUs, most of them have gotten it wrong.
If you want to try OpenMP offloading to NVIDIA GPUs, we want you to
know what to expect and how to get reasonable performance.
4/6/2016
4
A Brief History of OpenMP
1996 - Architecture Review Board (ARB) formed by several vendors implementing
their own directives for Shared Memory Parallelism (SMP).
1997 - 1.0 was released for C/C++ and Fortran with support for parallelizing loops
across threads.
2000, 2002 – Version 2.0 of Fortran, C/C++ specifications released.
2005 – Version 2.5 released, combining both specs into one.
2008 – Version 3.0 released, added support for tasking
2011 – Version 3.1 release, improved support for tasking
2013 – Version 4.0 released, added support for offloading (and more)
2015 – Version 4.5 released, improved support for offloading targets (and more)
4/6/2016
5
OpenMP In Clang
Multi-vendor effort to implement OpenMP in Clang (including offloading)
Current status– interesting
How to get it– https://www.ibm.com/developerworks/community/blogs/8e0d7b52-
b996-424b-bb33-345205594e0d?lang=en
4/6/2016
6
OpenMP In Clang
How to get it, our way
Step one – make sure you have: gcc, cmake, python and cuda installed and updated
Step two – Look at
http://llvm.org/docs/GettingStarted.html
https://www.ibm.com/developerworks/community/blogs/8e0d7b52-b996-
424b-bb33-345205594e0d?lang=en
Step three –
git clone https://github.com/clang-ykt/llvm_trunk.git
cd llvm_trunk/tools
git clone https://github.com/clang-ykt/clang_trunk.git clang
cd ../projects
git clone https://github.com/clang-ykt/openmp.git
4/6/2016
7
OpenMP In Clang
How to build it
cd ..
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=DEBUG|RELEASE|MinSizeRel 
-DLLVM_TARGETS_TO_BUILD=“X86;NVPTX” 
-DCMAKE_INSTALL_PREFIX=“<where you want it>" 
-DLLVM_ENABLE_ASSERTIONS=ON 
-DLLVM_ENABLE_BACKTRACES=ON 
-DLLVM_ENABLE_WERROR=OFF 
-DBUILD_SHARED_LIBS=OFF 
-DLLVM_ENABLE_RTTI=ON 
-DCMAKE_C_COMPILER=“GCC you want used" 
-DCMAKE_CXX_COMPILER=“G++ you want used" 
-G "Unix Makefiles"  !there are other options, I like this one
../llvm_trunk
make [-j#]
make install
4/6/2016
8
OpenMP In Clang
How to use it
export LIBOMP_LIB=<llvm-install-lib>
export OMPTARGET_LIBS=$LIBOMP_LIB
export LIBRARY_PATH=$OMPTARGET_LIBS
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OMPTARGET_LIBS
export PATH=$PATH:<llvm_install-bin>
clang -O3 -fopenmp=libomp -omptargets=nvptx64sm_35-nvidia-linux …
4/6/2016
9
Case Study: Jacobi Iteration
10
Example: Jacobi Iteration
Iteratively converges to correct value (e.g. Temperature), by computing new
values at each point from the average of neighboring points.
Common, useful algorithm
Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎
A(i,j)
A(i+1,j)A(i-1,j)
A(i,j-1)
A(i,j+1)
𝐴 𝑘+1 𝑖, 𝑗 =
𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1
4
11
Jacobi Iteration
while ( err > tol && iter < iter_max ) {
err=0.0;
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
err = max(err, abs(Anew[j][i] - A[j][i]));
}
}
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
Convergence Loop
Calculate Next
Exchange Values
12
Parallelize on the CPU
13
OpenMP Worksharing
PARALLEL Directive
Spawns a team of threads
Execution continues redundantly on
all threads of the team.
All threads join at the end and the
master thread continues execution.
OMP PARALLEL
ThreadTeam
Master Thread
14
OpenMP Worksharing
FOR/DO (Loop) Directive
Divides (“workshares”) the
iterations of the next loop across
the threads in the team
How the iterations are divided is
determined by a schedule.
OMP PARALLEL
OMP FOR
ThreadTeam
16
CPU-Parallelism
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp parallel for
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error);
}
Create a team of threads
and workshare this loop
across those threads.
Create a team of threads
and workshare this loop
across those threads.
17
CPU-Parallelism
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp parallel
{
#pragma omp for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp barrier
#pragma omp for
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
}
if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error);
}
Create a team of threads
Prevent threads from
executing the second
loop nest until the first
completes
Workshare this loop
18
CPU-Parallelism
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
#pragma omp simd
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp parallel for
for( int j = 1; j < n-1; j++) {
#pragma omp simd
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error);
}
Some compilers want a
SIMD directive to simdize
on CPUS.
19
CPU Scaling (Smaller is Better)
1.00X
1.70X
2.94X
3.52X 3.40X
0
10
20
30
40
50
60
70
80
1 2 4 8 16
ExecutionTime(seconds)
OpenMP Threads
Intel Xeon E5-2690 v2 @3.00GHz
20
Targeting the GPU
21
OpenMP Offloading
TARGET Directive
Offloads execution and associated data from the CPU to the GPU
• The target device owns the data, accesses by the CPU during the execution of the
target region are forbidden.
• Data used within the region may be implicitly or explicitly mapped to the device.
• All of OpenMP is allowed within target regions, but only a subset will run well on
GPUs.
22
Target the GPU
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp target
{
#pragma omp parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp parallel for
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
}
if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error);
}
Moves this region of
code to the GPU and
implicitly maps data.
23
Target the GPU
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp target map(alloc:Anew[:n+2][:m+2]) map(tofrom:A[:n+2][:m+2])
{
#pragma omp parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp parallel for
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
}
if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error);
}
Moves this region of
code to the GPU and
explicitly maps data.
24
ExecutionTime(seconds)
NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz
Execution Time (Smaller is Better)
1.00X
5.12X
0.12X
0
20
40
60
80
100
120
140
Original CPU Threaded GPU Threaded
893
25
GPU Architecture Basics
GPUs are composed of 1 or
more independent parts,
known as Streaming
Multiprocessors (“SMs”)
Threads are organized into
threadblocks.
Threads within the same
theadblock run on an SM and
can synchronize.
Threads in different
threadblocks (even if they’re
on the same SM) cannot
synchronize.
26
Teaming Up
27
OpenMP Teams
TEAMS Directive
To better utilize the GPU resources,
use many thread teams via the
TEAMS directive.
• Spawns 1 or more thread teams
with the same number of threads
• Execution continues on the
master threads of each team
(redundantly)
• No synchronization between
teams
OMP TEAMS
28
OpenMP Teams
DISTRIBUTE Directive
Distributes the iterations of the
next loop to the master threads of
the teams.
• Iterations are distributed
statically.
• There’s no guarantees about the
order teams will execute.
• No guarantee that all teams will
execute simultaneously
• Does not generate
parallelism/worksharing within
the thread teams.
OMP TEAMS
OMP DISTRIBUTE
30
OpenMP Data Offloading
TARGET DATA Directive
Offloads data from the CPU to the GPU, but not execution
• The target device owns the data, accesses by the CPU during the execution of
contained target regions are forbidden.
• Useful for sharing data between TARGET regions
• NOTE: A TARGET region is a TARGET DATA region.
31
Teaming Up
#pragma omp target data map(alloc:Anew) map(A)
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma omp target teams distribute parallel for reduction(max:error)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error);
iter++;
}
Explicitly maps arrays
for the entire while
loop.
• Spawns thread teams
• Distributes iterations
to those teams
• Workshares within
those teams.
32
Execution Time (Smaller is Better)
1.00X
5.12X
0.12X
1.01X
0
20
40
60
80
100
120
140
Original CPU Threaded GPU Threaded GPU Teams
893
ExecutionTime(seconds)
NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz
33
Increasing Parallelism
34
Increasing Parallelism
Currently both our distributed and workshared parallelism comes from the same
loop.
• We could move the PARALLEL to the inner loop
• We could collapse them together
The COLLAPSE(N) clause
• Turns the next N loops into one, linearized loop.
• This will give us more parallelism to distribute, if we so choose.
4/6/2016
35
Splitting Teams & Parallel
#pragma omp target teams distribute
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Distribute the “j” loop
over teams.
Workshare the “i” loop
over threads.
36
Collapse
#pragma omp target teams distribute parallel for reduction(max:error) collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for collapse(2)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Collapse the two loops
into one.
37
Execution Time (Smaller is Better)
1.00X
5.12X
0.12X
1.01X
2.47X
0.96X
0
20
40
60
80
100
120
140
Original CPU Threaded GPU Threaded GPU Teams GPU Split GPU Collapse
893
ExecutionTime(seconds)
NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz
38
Improve Loop Scheduling
39
Improve Loop Scheduling
Most OpenMP compilers will apply a static schedule to workshared loops,
assigning iterations in N / num_threads chunks.
• Each thread will execute contiguous loop iterations, which is very cache &
SIMD friendly
• This is great on CPUs, but bad on GPUs
The SCHEDULE() clause can be used to adjust how loop iterations are
scheduled.
40
Effects of Scheduling
4/6/2016
!$OMP PARALLEL FOR SCHEDULE(STATIC)
!$OMP PARALLEL FOR SCHEDULE(STATIC,1)*
Thread 0
Thread 1
0 - (n/2-1)
(n/2) – n-1
Thread 0
Thread 1
0, 2, 4, …, n-2
1, 3, 5, …, n-1
Cache and vector friendly
Memory coalescing friendly
*There’s no reason a compiler
couldn’t do this for you.
41
Improved Schedule (Split)
#pragma omp target teams distribute
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for reduction(max:error) schedule(static,1)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute
for( int j = 1; j < n-1; j++)
{
#pragma omp parallel for schedule(static,1)
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Assign adjacent
threads adjacent loop
iterations.
42
Improved Schedule (Collapse)
#pragma omp target teams distribute parallel for 
reduction(max:error) collapse(2) schedule(static,1)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma omp target teams distribute parallel for 
collapse(2) schedule(static,1)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[j][i] = Anew[j][i];
}
}
Assign adjacent
threads adjacent loop
iterations.
43
Execution Time (Smaller is Better)
1.00X
5.12X
0.12X
1.01X
2.47X
0.96X
4.00X
17.42X
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
Original CPU Threaded GPU Threaded GPU Teams GPU Split GPU Collapse GPU Split
Sched
GPU Collapse
Sched
893
ExecutionTime(seconds)
NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz
44
How to Write Portable Code
#pragma omp 
#ifdef GPU
target teams distribute 
#endif
parallel for reduction(max:error) 
#ifdef GPU
collapse(2) schedule(static,1)
endif
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
Ifdefs can be used to
choose particular
directives per device
at compile-time
45
How to Write Portable Code
usegpu = 1;
#pragma omp target teams distribute parallel for reduction(max:error) 
#ifdef GPU
collapse(2) schedule(static,1) 
endif
if(target:usegpu)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
The OpenMP if clause
can help some too (4.5
improves this).
Note: This example
assumes that a compiler
will choose to generate 1
team when not in a target,
making it the same as a
standard “parallel for.”
46
Additional Experiments
47
Increase the Number of Teams
By default, CLANG will poll the number of SMs on your GPU and run that many teams
of 1024 threads.
This is not always ideal, so we tried increasing the number of teams using the
num_teams clause.
4/6/2016
Test SMs 2*SMs 4*SMs 8*SMs
A 1.00X 1.00X 1.00X 1.00X
B 1.00X 1.02X 1.16X 1.09X
C 1.00X 0.87X 0.94X 0.96X
D 1.00X 1.00X 1.00X 0.99X
48
Decreased Threads per Team
CLANG always generate CUDA threadblocks of 1024 threads, even when the
num_threads clause is used.
This number is frequently not ideal, but setting num_threads does not reduce the
threadblock size.
Ideally we’d like to use num_threads and num_teams to generate more, smaller
threadblocks
We suspect the best performance would be collapsing, reducing the threads per
team, and then using the remaining iterations to generate many teams, but are
unable to do this experiment.
4/6/2016
49
Scalar Copy Overhead
In OpenMP 4.0 scalars are implicitly mapped “tofrom”, resulting in very high
overhead. Application impact: ~10%.
OpenMP4.5 remedied this by making the default behavior of scalars “firstprivate”
4/6/2016
Overhead
Note: In the
meantime, some of
this overhead can be
mitigated by
explicitly mapping
your scalars “to”.
50
Conclusions
51
Conclusions
It is now possible to use OpenMP to program for GPUs, but the software is still very
immature.
OpenMP for a GPU will not look like OpenMP for a CPU.
Performance will vary significantly depending on the exact directives you use. (149X
in our example code)

More Related Content

What's hot

Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
 
An introduction to ROP
An introduction to ROPAn introduction to ROP
An introduction to ROPSaumil Shah
 
Post-processing SAR images on Xeon Phi - a porting exercise
Post-processing SAR images on Xeon Phi - a porting exercisePost-processing SAR images on Xeon Phi - a porting exercise
Post-processing SAR images on Xeon Phi - a porting exerciseIntel IT Center
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkJaewook. Kang
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Chris Fregly
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniquesNetronome
 
Architecture Aware Partitioning of Open-CL Programs
Architecture Aware Partitioning of Open-CL Programs Architecture Aware Partitioning of Open-CL Programs
Architecture Aware Partitioning of Open-CL Programs Ankit Singh
 
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortUsing Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortAnhanguera Educacional S/A
 
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019Unity Technologies
 

What's hot (16)

Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
An introduction to ROP
An introduction to ROPAn introduction to ROP
An introduction to ROP
 
Post-processing SAR images on Xeon Phi - a porting exercise
Post-processing SAR images on Xeon Phi - a porting exercisePost-processing SAR images on Xeon Phi - a porting exercise
Post-processing SAR images on Xeon Phi - a porting exercise
 
A Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB SimulinkA Simple Communication System Design Lab #4 with MATLAB Simulink
A Simple Communication System Design Lab #4 with MATLAB Simulink
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current Techniques
 
Architecture Aware Partitioning of Open-CL Programs
Architecture Aware Partitioning of Open-CL Programs Architecture Aware Partitioning of Open-CL Programs
Architecture Aware Partitioning of Open-CL Programs
 
OpenMP
OpenMPOpenMP
OpenMP
 
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortUsing Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
 
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019
 

Viewers also liked

1 the tikkunei zohar by ra...n selected a way and..
1 the tikkunei zohar by ra...n selected a way and..1 the tikkunei zohar by ra...n selected a way and..
1 the tikkunei zohar by ra...n selected a way and..SEIKI2
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...
Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...
Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...Patrimonio culturale FVG
 
Presentation soft launch
Presentation soft launchPresentation soft launch
Presentation soft launchWiebe Marichael
 
futbol .pasion mundial
futbol .pasion mundialfutbol .pasion mundial
futbol .pasion mundialmuerte15
 
[23] Ronald Van Den Hoff Trendsessie
[23] Ronald Van Den Hoff   Trendsessie[23] Ronald Van Den Hoff   Trendsessie
[23] Ronald Van Den Hoff TrendsessieTrickline
 
Abipal organiations study at ipa pvt ltd
Abipal organiations study at ipa pvt ltdAbipal organiations study at ipa pvt ltd
Abipal organiations study at ipa pvt ltdLibu Thomas
 
Analytical Models of Single Bubbles and Foams
Analytical Models of Single Bubbles and FoamsAnalytical Models of Single Bubbles and Foams
Analytical Models of Single Bubbles and FoamsRobert Murtagh
 
Policy Development Process Infographic English
Policy Development Process Infographic EnglishPolicy Development Process Infographic English
Policy Development Process Infographic EnglishICANN
 

Viewers also liked (16)

Genesis 1[1]
Genesis 1[1]Genesis 1[1]
Genesis 1[1]
 
1 the tikkunei zohar by ra...n selected a way and..
1 the tikkunei zohar by ra...n selected a way and..1 the tikkunei zohar by ra...n selected a way and..
1 the tikkunei zohar by ra...n selected a way and..
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
Orly Landingin's Resume
Orly Landingin's ResumeOrly Landingin's Resume
Orly Landingin's Resume
 
SISTEMA REPRODUCTOR
SISTEMA REPRODUCTOR SISTEMA REPRODUCTOR
SISTEMA REPRODUCTOR
 
iod
iodiod
iod
 
Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...
Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...
Alessandra Benvenuti, Open Data e beni culturali: un’opportunità per lo svilu...
 
Presentation soft launch
Presentation soft launchPresentation soft launch
Presentation soft launch
 
JEE5 New Features
JEE5 New FeaturesJEE5 New Features
JEE5 New Features
 
futbol .pasion mundial
futbol .pasion mundialfutbol .pasion mundial
futbol .pasion mundial
 
[23] Ronald Van Den Hoff Trendsessie
[23] Ronald Van Den Hoff   Trendsessie[23] Ronald Van Den Hoff   Trendsessie
[23] Ronald Van Den Hoff Trendsessie
 
Abipal organiations study at ipa pvt ltd
Abipal organiations study at ipa pvt ltdAbipal organiations study at ipa pvt ltd
Abipal organiations study at ipa pvt ltd
 
Analytical Models of Single Bubbles and Foams
Analytical Models of Single Bubbles and FoamsAnalytical Models of Single Bubbles and Foams
Analytical Models of Single Bubbles and Foams
 
Gametogenesis LILY
Gametogenesis LILYGametogenesis LILY
Gametogenesis LILY
 
Brochure Insem
Brochure InsemBrochure Insem
Brochure Insem
 
Policy Development Process Infographic English
Policy Development Process Infographic EnglishPolicy Development Process Infographic English
Policy Development Process Infographic English
 

Similar to GTC16 - S6510 - Targeting GPUs with OpenMP 4.5

lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMPjbp4444
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisFastly
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomValeriy Kravchuk
 
Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"
Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"
Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"LogeekNightUkraine
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Do snow.rwn
Do snow.rwnDo snow.rwn
Do snow.rwnARUN DN
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
Alexander Reelsen - Seccomp for Developers
Alexander Reelsen - Seccomp for DevelopersAlexander Reelsen - Seccomp for Developers
Alexander Reelsen - Seccomp for DevelopersDevDay Dresden
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUJohn Colvin
 

Similar to GTC16 - S6510 - Targeting GPUs with OpenMP 4.5 (20)

lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
 
presentation
presentationpresentation
presentation
 
Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"
Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"
Oleksandr Smoktal "Parallel Seismic Data Processing Using OpenMP"
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
Do snow.rwn
Do snow.rwnDo snow.rwn
Do snow.rwn
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
Alexander Reelsen - Seccomp for Developers
Alexander Reelsen - Seccomp for DevelopersAlexander Reelsen - Seccomp for Developers
Alexander Reelsen - Seccomp for Developers
 
OpenMP
OpenMPOpenMP
OpenMP
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPU
 

More from Jeff Larkin

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsJeff Larkin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin
 

More from Jeff Larkin (11)

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Recently uploaded

VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 

Recently uploaded (20)

VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 

GTC16 - S6510 - Targeting GPUs with OpenMP 4.5

  • 1. April 4-7, 2016 | Silicon Valley James Beyer, NVIDIA Jeff Larkin, NVIDIA Targeting GPUs with OpenMP 4.5 Device Directives
  • 2. 2 AGENDA OpenMP Background Step by Step Case Study Parallelize on CPU Offload to GPU Team Up Increase Parallelism Improve Scheduling Additional Experiments Conclusions
  • 3. 3 Motivation Multiple compilers are in development to support OpenMP offloading to NVIDIA GPUs. Articles and blog posts are being written by early adopters trying OpenMP on NVIDIA GPUs, most of them have gotten it wrong. If you want to try OpenMP offloading to NVIDIA GPUs, we want you to know what to expect and how to get reasonable performance. 4/6/2016
  • 4. 4 A Brief History of OpenMP 1996 - Architecture Review Board (ARB) formed by several vendors implementing their own directives for Shared Memory Parallelism (SMP). 1997 - 1.0 was released for C/C++ and Fortran with support for parallelizing loops across threads. 2000, 2002 – Version 2.0 of Fortran, C/C++ specifications released. 2005 – Version 2.5 released, combining both specs into one. 2008 – Version 3.0 released, added support for tasking 2011 – Version 3.1 release, improved support for tasking 2013 – Version 4.0 released, added support for offloading (and more) 2015 – Version 4.5 released, improved support for offloading targets (and more) 4/6/2016
  • 5. 5 OpenMP In Clang Multi-vendor effort to implement OpenMP in Clang (including offloading) Current status– interesting How to get it– https://www.ibm.com/developerworks/community/blogs/8e0d7b52- b996-424b-bb33-345205594e0d?lang=en 4/6/2016
  • 6. 6 OpenMP In Clang How to get it, our way Step one – make sure you have: gcc, cmake, python and cuda installed and updated Step two – Look at http://llvm.org/docs/GettingStarted.html https://www.ibm.com/developerworks/community/blogs/8e0d7b52-b996- 424b-bb33-345205594e0d?lang=en Step three – git clone https://github.com/clang-ykt/llvm_trunk.git cd llvm_trunk/tools git clone https://github.com/clang-ykt/clang_trunk.git clang cd ../projects git clone https://github.com/clang-ykt/openmp.git 4/6/2016
  • 7. 7 OpenMP In Clang How to build it cd .. mkdir build cd build cmake -DCMAKE_BUILD_TYPE=DEBUG|RELEASE|MinSizeRel -DLLVM_TARGETS_TO_BUILD=“X86;NVPTX” -DCMAKE_INSTALL_PREFIX=“<where you want it>" -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_BACKTRACES=ON -DLLVM_ENABLE_WERROR=OFF -DBUILD_SHARED_LIBS=OFF -DLLVM_ENABLE_RTTI=ON -DCMAKE_C_COMPILER=“GCC you want used" -DCMAKE_CXX_COMPILER=“G++ you want used" -G "Unix Makefiles" !there are other options, I like this one ../llvm_trunk make [-j#] make install 4/6/2016
  • 8. 8 OpenMP In Clang How to use it export LIBOMP_LIB=<llvm-install-lib> export OMPTARGET_LIBS=$LIBOMP_LIB export LIBRARY_PATH=$OMPTARGET_LIBS export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$OMPTARGET_LIBS export PATH=$PATH:<llvm_install-bin> clang -O3 -fopenmp=libomp -omptargets=nvptx64sm_35-nvidia-linux … 4/6/2016
  • 10. 10 Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎 A(i,j) A(i+1,j)A(i-1,j) A(i,j-1) A(i,j+1) 𝐴 𝑘+1 𝑖, 𝑗 = 𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1 4
  • 11. 11 Jacobi Iteration while ( err > tol && iter < iter_max ) { err=0.0; for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); err = max(err, abs(Anew[j][i] - A[j][i])); } } for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; } Convergence Loop Calculate Next Exchange Values
  • 13. 13 OpenMP Worksharing PARALLEL Directive Spawns a team of threads Execution continues redundantly on all threads of the team. All threads join at the end and the master thread continues execution. OMP PARALLEL ThreadTeam Master Thread
  • 14. 14 OpenMP Worksharing FOR/DO (Loop) Directive Divides (“workshares”) the iterations of the next loop across the threads in the team How the iterations are divided is determined by a schedule. OMP PARALLEL OMP FOR ThreadTeam
  • 15. 16 CPU-Parallelism while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error); } Create a team of threads and workshare this loop across those threads. Create a team of threads and workshare this loop across those threads.
  • 16. 17 CPU-Parallelism while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp parallel { #pragma omp for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp barrier #pragma omp for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error); } Create a team of threads Prevent threads from executing the second loop nest until the first completes Workshare this loop
  • 17. 18 CPU-Parallelism while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { #pragma omp simd for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp parallel for for( int j = 1; j < n-1; j++) { #pragma omp simd for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error); } Some compilers want a SIMD directive to simdize on CPUS.
  • 18. 19 CPU Scaling (Smaller is Better) 1.00X 1.70X 2.94X 3.52X 3.40X 0 10 20 30 40 50 60 70 80 1 2 4 8 16 ExecutionTime(seconds) OpenMP Threads Intel Xeon E5-2690 v2 @3.00GHz
  • 20. 21 OpenMP Offloading TARGET Directive Offloads execution and associated data from the CPU to the GPU • The target device owns the data, accesses by the CPU during the execution of the target region are forbidden. • Data used within the region may be implicitly or explicitly mapped to the device. • All of OpenMP is allowed within target regions, but only a subset will run well on GPUs.
  • 21. 22 Target the GPU while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp target { #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error); } Moves this region of code to the GPU and implicitly maps data.
  • 22. 23 Target the GPU while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp target map(alloc:Anew[:n+2][:m+2]) map(tofrom:A[:n+2][:m+2]) { #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } } if(iter++ % 100 == 0) printf("%5d, %0.6fn", iter, error); } Moves this region of code to the GPU and explicitly maps data.
  • 23. 24 ExecutionTime(seconds) NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz Execution Time (Smaller is Better) 1.00X 5.12X 0.12X 0 20 40 60 80 100 120 140 Original CPU Threaded GPU Threaded 893
  • 24. 25 GPU Architecture Basics GPUs are composed of 1 or more independent parts, known as Streaming Multiprocessors (“SMs”) Threads are organized into threadblocks. Threads within the same theadblock run on an SM and can synchronize. Threads in different threadblocks (even if they’re on the same SM) cannot synchronize.
  • 26. 27 OpenMP Teams TEAMS Directive To better utilize the GPU resources, use many thread teams via the TEAMS directive. • Spawns 1 or more thread teams with the same number of threads • Execution continues on the master threads of each team (redundantly) • No synchronization between teams OMP TEAMS
  • 27. 28 OpenMP Teams DISTRIBUTE Directive Distributes the iterations of the next loop to the master threads of the teams. • Iterations are distributed statically. • There’s no guarantees about the order teams will execute. • No guarantee that all teams will execute simultaneously • Does not generate parallelism/worksharing within the thread teams. OMP TEAMS OMP DISTRIBUTE
  • 28. 30 OpenMP Data Offloading TARGET DATA Directive Offloads data from the CPU to the GPU, but not execution • The target device owns the data, accesses by the CPU during the execution of contained target regions are forbidden. • Useful for sharing data between TARGET regions • NOTE: A TARGET region is a TARGET DATA region.
  • 29. 31 Teaming Up #pragma omp target data map(alloc:Anew) map(A) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp target teams distribute parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error); iter++; } Explicitly maps arrays for the entire while loop. • Spawns thread teams • Distributes iterations to those teams • Workshares within those teams.
  • 30. 32 Execution Time (Smaller is Better) 1.00X 5.12X 0.12X 1.01X 0 20 40 60 80 100 120 140 Original CPU Threaded GPU Threaded GPU Teams 893 ExecutionTime(seconds) NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz
  • 32. 34 Increasing Parallelism Currently both our distributed and workshared parallelism comes from the same loop. • We could move the PARALLEL to the inner loop • We could collapse them together The COLLAPSE(N) clause • Turns the next N loops into one, linearized loop. • This will give us more parallelism to distribute, if we so choose. 4/6/2016
  • 33. 35 Splitting Teams & Parallel #pragma omp target teams distribute for( int j = 1; j < n-1; j++) { #pragma omp parallel for reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute for( int j = 1; j < n-1; j++) { #pragma omp parallel for for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Distribute the “j” loop over teams. Workshare the “i” loop over threads.
  • 34. 36 Collapse #pragma omp target teams distribute parallel for reduction(max:error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Collapse the two loops into one.
  • 35. 37 Execution Time (Smaller is Better) 1.00X 5.12X 0.12X 1.01X 2.47X 0.96X 0 20 40 60 80 100 120 140 Original CPU Threaded GPU Threaded GPU Teams GPU Split GPU Collapse 893 ExecutionTime(seconds) NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz
  • 37. 39 Improve Loop Scheduling Most OpenMP compilers will apply a static schedule to workshared loops, assigning iterations in N / num_threads chunks. • Each thread will execute contiguous loop iterations, which is very cache & SIMD friendly • This is great on CPUs, but bad on GPUs The SCHEDULE() clause can be used to adjust how loop iterations are scheduled.
  • 38. 40 Effects of Scheduling 4/6/2016 !$OMP PARALLEL FOR SCHEDULE(STATIC) !$OMP PARALLEL FOR SCHEDULE(STATIC,1)* Thread 0 Thread 1 0 - (n/2-1) (n/2) – n-1 Thread 0 Thread 1 0, 2, 4, …, n-2 1, 3, 5, …, n-1 Cache and vector friendly Memory coalescing friendly *There’s no reason a compiler couldn’t do this for you.
  • 39. 41 Improved Schedule (Split) #pragma omp target teams distribute for( int j = 1; j < n-1; j++) { #pragma omp parallel for reduction(max:error) schedule(static,1) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute for( int j = 1; j < n-1; j++) { #pragma omp parallel for schedule(static,1) for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Assign adjacent threads adjacent loop iterations.
  • 40. 42 Improved Schedule (Collapse) #pragma omp target teams distribute parallel for reduction(max:error) collapse(2) schedule(static,1) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma omp target teams distribute parallel for collapse(2) schedule(static,1) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } Assign adjacent threads adjacent loop iterations.
  • 41. 43 Execution Time (Smaller is Better) 1.00X 5.12X 0.12X 1.01X 2.47X 0.96X 4.00X 17.42X 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 Original CPU Threaded GPU Threaded GPU Teams GPU Split GPU Collapse GPU Split Sched GPU Collapse Sched 893 ExecutionTime(seconds) NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz
  • 42. 44 How to Write Portable Code #pragma omp #ifdef GPU target teams distribute #endif parallel for reduction(max:error) #ifdef GPU collapse(2) schedule(static,1) endif for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Ifdefs can be used to choose particular directives per device at compile-time
  • 43. 45 How to Write Portable Code usegpu = 1; #pragma omp target teams distribute parallel for reduction(max:error) #ifdef GPU collapse(2) schedule(static,1) endif if(target:usegpu) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } The OpenMP if clause can help some too (4.5 improves this). Note: This example assumes that a compiler will choose to generate 1 team when not in a target, making it the same as a standard “parallel for.”
  • 45. 47 Increase the Number of Teams By default, CLANG will poll the number of SMs on your GPU and run that many teams of 1024 threads. This is not always ideal, so we tried increasing the number of teams using the num_teams clause. 4/6/2016 Test SMs 2*SMs 4*SMs 8*SMs A 1.00X 1.00X 1.00X 1.00X B 1.00X 1.02X 1.16X 1.09X C 1.00X 0.87X 0.94X 0.96X D 1.00X 1.00X 1.00X 0.99X
  • 46. 48 Decreased Threads per Team CLANG always generate CUDA threadblocks of 1024 threads, even when the num_threads clause is used. This number is frequently not ideal, but setting num_threads does not reduce the threadblock size. Ideally we’d like to use num_threads and num_teams to generate more, smaller threadblocks We suspect the best performance would be collapsing, reducing the threads per team, and then using the remaining iterations to generate many teams, but are unable to do this experiment. 4/6/2016
  • 47. 49 Scalar Copy Overhead In OpenMP 4.0 scalars are implicitly mapped “tofrom”, resulting in very high overhead. Application impact: ~10%. OpenMP4.5 remedied this by making the default behavior of scalars “firstprivate” 4/6/2016 Overhead Note: In the meantime, some of this overhead can be mitigated by explicitly mapping your scalars “to”.
  • 49. 51 Conclusions It is now possible to use OpenMP to program for GPUs, but the software is still very immature. OpenMP for a GPU will not look like OpenMP for a CPU. Performance will vary significantly depending on the exact directives you use. (149X in our example code)