Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)

Computations on GPU: a road towards desktop supercomputing

Computations on GPU: a road towards desktop
supercomputing

Glib Ivashkevych

Institute of Theoretical Physics, NSC KIPT

November 24, 2010

Quick outline

GPU – Graphic Processing Unit
programmable
manycore
multithreaded
with very high memory bandwidth

Quick outline

GPU – Graphic Processing Unit
programmable
manycore
multithreaded
with very high memory bandwidth

We are going to talk about:
how GPU became usefull for scientiﬁc computations
GPU intrinsics and programming
how to get as much as possible from GPU and survive:)
the future of GPUs and GPU programming
CUDA(Nvidia s Compute Uniﬁed Device Architecture) , most of the
time, and OpenCL(Open Computing Language)

Should we care?

But ﬁrst of all: do we really need GPU computing?

Should we care?


Short answer: yes!
high performance
transparent scalability

Should we care?


Short answer: yes!
high performance

More accurate answer: yes, for problems with high parallelism.
large datasets
portions of data could be processed independently

Should we care?


Short answer: yes!
high performance

More accurate answer: yes, for problems with high parallelism.
large datasets
portions of data could be processed independently

Most accurate answer: yes, for problems with high data
parallelism.

Reference

For reference
GFLOPs – 109 FLoating Point Operations Per second
∼ 55 GFLOPs on Intel Core i7 Nehalem 975 (according to
Intel)
∼ 125 GFLOPs AMD Opteron Istanbul 2435
∼ 500 GFLOPs in double and ∼ 2 TFLOPs in single precision
on Nvidia Tesla C2050
∼ 3.2 · 103 GFLOPs on ASCI Red in Sandia National
Laboratory – the fastest supercomputer as for November 1999
∼ 87 · 103 GFLOPs on TSUBAME-1 Grid Cluster in Tokyo
Institute of Technology – ﬁrst GPU based supercomputer, №88
in Top500, as for November 2010 (№56 – in November 2009)
∼ 2.56 · 106 GFLOPs on Tianhe-1-A at National
Supercomputing Center in Tianjin – the fastest
supercomputer as for November 2010 – GPU based

Examples

Matrix and vector operations
CUBLAS 1 on Nvidia Tesla C2050 (CUDA 3.2) vs Intel MKL 10.21
on Intel Core i7 Nehalem (4 threads)

∼ 8x in double precision

CULA 1 on Nvidia Tesla C2050 (CUDA 3.2)

up to ∼ 220 GFLOPs in double precision
up to ∼ 450 GFLOPs in single precision

vs Intel MKL 10.2

∼ 4 − 6x speed–up
1
CUDA accelerated Basic Linear Algebra Subprograms
1
Math Kernel Library
1
LAPACK for Heterogeneous systems

Examples

Fast Fourier Transform
CUFFT on Nvidia Tesla C2070 (CUDA 3.2)

up to 65 GFLOPs in double precision
up to 220 GFLOPs in single precision

vs Intel MKL on Intel Core i7 Nehalem

∼ 9x in double precision
∼ 20x in single precision

Examples

Physics: Computational Fluid Dynamics
Simulation of transition to turbulence1
Nvidia Tesla S1070 vs quad-core Intel Xeon X5450 (3GHz)
∼ 20x over serial code
∼ 10x over OpenMP realization (2 threads)
∼ 5x over OpenMP realization (4 threads)

1
A.S. Antoniou et al., American Institute of Aeronautics and Astronautics
Paper 2010 – 0525

Examples

Quantum chemistry
Calculations of molecular orbitals1
Nvidia GeForce GTX 280 vs Intel Core2Quad Q6600 (2.4GHz)
∼ 173x over serial non–optimized code
∼ 14x over parallel optimized code (4 threads)

1
D.J. Hardy et al., GPGPU 2009

Examples

Medical Imaging
Isosurfaces reconstruction from scalar volumetric data1
Nvidia GeForce GTX 285 vs ?
∼ 68x over optimized CPU code
nearly real-time processing of data

1
T. Kalbe et al., Proceedings of 5th International Symposium on Visual
Computing (ISVC 2009)

Examples

GPUGrid.net
Biomolecular simulations
accelerated by Nvidia CUDA boards and Sony PlayStation
∼ 8000 users from 101 country
∼ 145 TFLOPs in average ≈ №25 in Top500
∼ 50 GFLOPs from every active user

Examples

ATLAS experiment on Large Hadron Collider
Particles tracking, trigerring, events simulations1
possible Higgs events – track large number of particles
∼ 32x in tracking, ∼ 35x in triggering on Nvidia Tesla C1060

1
P.J. Clark et al., Processing Petabytes per Second with the ATLAS
Experiment at the LHC (GTC 2010)

Examples

And even more examples in:
N–body simulations
seismic simulations
molecular dynamics
SETI@Home & MilkyWay@Home
ﬁnance
neural networks
...
and, of course, graphics

VFX, rendering
image editing, video

Examples

GPU Technology Conference 2010 (September 20-23)

Outline

Outline
1 History

Outline

Outline
1 History
2 CUDA: architecture overview and programming model

Outline

Outline
1 History
3 Threads and memories hierarchy

Outline

Outline
1 History
4 Toolbox

Outline

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL

Outline

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality

Outline

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
7 EnSPy architecture

Outline

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
8 Example: D5 potential

Outline

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
9 Example: Hill problem

Outline

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
10 Example: Hill problem, N–body version

Outline

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
11 Performance results

Outline

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL
12 GPU computing prospects

History

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL

History

History in brief:

History

GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision

History

GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision

GPGPU today:
straightforward
easy
in double precision.

CUDA: architecture overview and programming model

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL


Hardware model: GT200 architecture

consists of
multiprocessors
each MP has:
8 stream processors
1 unit for double
precision operations
shared memory
global memory


Hardware model: Fermi architecture

each MP has:
32 stream processors
4 SFU’s (Special
Function Unit)
each SP has:
1 FP Unit & 1 INT
Unit


Hardware model: Multiprocessors and threads
MP can launch numerous threads
threads are ”lightweight” – little creation and switching
overhead
threads run the same code
threads syncronization within MP
cooperation via shared memory
each thread have unique identiﬁer – thread ID

Eﬃciency is achieved by latency hiding by calculation, and not by
cache usage, as on CPU


Software model: C for CUDA
a set of extensions to C
runtime library
function and variable type qualiﬁers
built–in vector types: ﬂoat4, double2 etc.
built–in variables
Kernels
maps parallel part of the program to the GPU
execution: N times in parallel by N CUDA threads

CUDA Driver API
low–level control over the execution
no need in nvcc compiler if kernels are precompiled – only
driver needed


Software model: Example
//Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
{
//Some math
r e t u r n smth ;
}


{
//Some math
r e t u r n smth ;
}

// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}


{
//Some math
r e t u r n smth ;
}

// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}

// Host c o d e
i n t main ( )
{
// K e r n e l i n v o c a t i o n
SomeKernel <<<1,N>>>(A , B , C)
}


Software model: Explanations
device qualiﬁer deﬁnes function that is:
executed on device
callable from device only


Software model: Explanations
device qualifier defines function that is:
executed on device
callable from device only

global qualifier defines function that is:
executed on device
callable from host only


Execution model


Scalability
underlying hardware architecture is hidden
threads could syncronize only within the MP

↓

we do not need to know exact number of MP

↓

scalable applications – from GTX8800 to Fermi

Threads and memories hierarchy

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL


Single threads
each thread have private local memory
are identiﬁed by built–in variable threadIdx (uint3 type)
int idx = threadIdx . x + threadIdx . y + threadIdx . z ;

form 1–, 2– or 3–dimensional array – vector, matrix or ﬁeld

Threads are organized into
thread blocks


Thread blocks
each block have shared memory visible to all threads within
the block
are identiﬁed by built–in variable blockIdx (uint3 type)
int b idx = blockIdx . x + blockIdx . y ;

dimension of the block is identiﬁed by built–in variable
blockDim (dim3 type)

Blocks are organized into
grid


Grid of thread blocks
global device memory is accessible by all threads in the grid
dimension of the grid is identiﬁed by built–in variable gridDim
(dim3 type)


Example: vector addition
i n t main ( )
{
// A l l o c a t e v e c t o r s i n d e v i c e memory
size t size = N ∗ sizeof ( float );
float ∗ d A ;
c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
float ∗ d B ;
float ∗ d C ;

// Copy d a t a from h o s t memory t o d e v i c e memory
cudamemcpy ( d A , h A , s i z e , cudaMemcpyHostToDevice ) ;
cudamemcpy ( d B , h B , s i z e , cudaMemcpyHostToDevice ) ;

// P r e p a r e t h e k e r n e l l a u n c h
int threadsPerBlock = 256;
i n t t h r e a d s P e r G r i d = (N + t h r e a d s P e r B l o c k −1) / T h r e a d s P e r B l o c k ;

VecAdd<<<t h r e a d s P e r G r i d , t h r e a d s P e r B l o c k >>>(d A , d B , d C ) ;
cudamemcpy ( h C , d C , s i z e , cudaMemcpyDeviceToHost ) ;

// F r e e d e v i c e memory
cudaFree ( d A ) ;
cudaFree ( d B ) ;
cudaFree ( d C ) ;
}


Example: vector addition
// K e r n e l c o d e
global v o i d VecAdd ( f l o a t ∗ A , f l o a t ∗ B , f l o a t ∗ C)
{
int i = threadIdx . x ;
i f ( i < N)
C[ i ] = A[ i ] + B[ i ] ;
}


Performance analysis and optimization
there must be enough thread blocks per MP to hide latency
try not to under–populate blocks


use memory bandwidth (∼ 100GB/s!) eﬃciently
coalescing
non–optimized access to global memory could reduce the
performance in order(-s) of magnitude
try to achieve high arithmetic intensity


use memory bandwidth (∼ 100GB/s!) eﬃciently
coalescing
non–optimized access to global memory could reduce the
performance in order(-s) of magnitude
try to achieve high arithmetic intensity
never diverge threads within one warp:
divergence → serialization = parallelism

Toolbox

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL

Toolbox

Start–up tools
drivers
CUDA Toolkit
nvcc compiler, runtime library, header ﬁles, CUBLAS, CUFFT,
Visual Proﬁler etc.
CUDA SDK
examples, Occupancy Calculator etc.

Free download at
http://developer.nvidia.com/object/cuda 2 3 downloads.html

Support for 32 and 64-bit Windows, Linux1 & Mac OS X

1
Supported distros in CUDA 3.2: Fedora 13, RH Enterprise 4.8& 5.5,
OpenSUSE 11.2, SLED 11.0, Ubuntu 10.04

Toolbox

Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution

Toolbox

Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution

CUDA Visual Proﬁler
tracks events with hardware counters
global memory loads/stores
total branches and divergent branches taken by threads
instruction count
number of serialized thread warps due to address conﬂicts
(shared and constant memory)

PyCUDA&PyOpenCL

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL

PyCUDA&PyOpenCL

Python
easy to learn
dynamically typed
rich built–in functionality
interpreted
very well documented
have large and active community

PyCUDA&PyOpenCL

Scientiﬁc tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···

PyCUDA&PyOpenCL

Scientiﬁc tools:
Fourier transforms
ODE
Optimization
overhead
···
NumPy – arrays
ﬂexible array creation routines
sorting, random sampling and statistics
···

PyCUDA&PyOpenCL

Scientiﬁc tools:
Fourier transforms
ODE
Optimization
overhead
···
NumPy – arrays
ﬂexible array creation routines
sorting, random sampling and statistics
···

Python is a convenient way of interfacing C/C++ libraries

PyCUDA&PyOpenCL

PyCUDA
provide complete access to CUDA features
automatically manages resources
errors handling and translation to Python exceptions
convenient abstractions: GPUArray
metaprogramming: creation of CUDA source code
dynamically
interactive!
PyOpenCL is pretty much the same in concept – but not only for
Nvidia GPUs.
Also for ATI/AMD cards, AMD&Intel Proccesors etc. (IBM Cell?)

PyCUDA&PyOpenCL

Python and CUDA
We could interface with:
Python C API – low–level approach: overkill
SWIG, Boost::Python – high–level approach: overkill
PyCUDA – most simple and straightforward way for CUDA
only
scipy.weave.inline – simple and straightforward way for
both CUDA and plain C/C++

EnSPy functionality

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL

EnSPy functionality

Motivation
Combine ﬂexibility of Python with eﬃciency of C++ → CUDA for
N–body sim
interface of EnSPy is written in Python
core of EnSPy is written in C++
joined together by scipy.weave.inline
C++ core could be used without Python – just include header
and link with precompiled shared library
easily extensible: both through high–level Python interface
and low–level C++ core – new algorithms, initial distributions
etc.
multi–GPU parallelization
it’s easy to experiment with EnSPy!

EnSPy functionality

EnSPy functionality
Types of ensembles:
”Simple” ensemble – without interaction, only external
potential
N–body ensemble – both external potential and gravitational
interaction between particles

Current algorithms:
4-th order Runge–Kutta for ”simple” ensemble
Hermite scheme with shared time steps for N-body ensemble

EnSPy functionality

Predeﬁned initial distributions:
Uniform, point and spherical for ”simple” ensembles
Uniform sphere with 2T /|U| = 1 for N-body ensemble
user could supply functions (in Python) for initial ensemble
generation

User speciﬁed values and expressions:
parameters of initial distribution
potential, forces, parameters of integration scheme
arbitrary number of triggers – Ni (t) of particles which do not
cross the given hypersurface Fi (q, p) = 0 before time t
¯
arbitrary number of averages – Fi (q, p, t) – quantities which
should be averaged over the ensembles

EnSPy functionality

Runtime generation and compilation of C and CUDA code:
User specified expressions (as Python strings) are wrapped by
EnSPy template subpackage into C functions and CUDA
module
Compiled at runtime

High usage and calculation efficiency:
flexible Python interface
all actual calculations are performed by runtime generated C
extension and precompiled shared library

Drawback:
extra time for generation and compilation of new code

EnSPy architecture

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL

EnSPy architecture

Execution flow and architecture

Input parameters

↓

Ensemble population
(predefined or user specified
distribution)

↓

Code generation and
compilation

↓

Launching NGPUs threads

EnSPy architecture

GPU parallelization scheme for N–body simulations

EnSPy architecture

Order of force calculation

Example: D5 potential

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL


Overview
Problem:
Escape from potential well.
Watched values (trigger):
N(t) – number of particles, remaining in the well at time t

Potential:

x4
UD5 = 2ay 2 − x 2 + xy 2 +
4

”Critical” energy: Ecr = ES = 0


Potential and structure of phase space:

Level lines of D5 potential

2
2

1
1

0
px
0
y

1

−1

2

2 1 0 1 2
−2 x
−2 −1 0 1 2
x


Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
x > 0 ∩ U(x, y ) < E
trigger: x = 0 → q0 = 0.
12 lines of simple Python code (examples/d5.py):
speciﬁcation of integration parameters


Results:
Regular particles are trapped in well → initial ”mixed state” splits

1

E = 0.1

0.8

0.6

N (t)/N (0)

0.4
E = 0.9

0.2

0
0 10 20 30
t

Example: Hill problem

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL


Overview
Problem:
Toy model of escape from star cluster: escape of star from
potential of point rotating star cluster Mc and point galaxy core
Mg Mc
Watched values (trigger):
N(t) – number of particles, remaining in cluster at time t

”Potential” in cluster frame of reference (tidal approximation):

GMc
UHill = −3ω 2 x 2 −
r2

”Critical” energy: Ecr = ES = −4.5ω 2


Potential:

Hill curves

0.5

0.0

y
−0.5

−1.0
−1.0 −0.5 0.0 0.5
x


Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
|x| < rt ∩ U(x, y ) < E
1
ω= √
3
→ rt = 1
trigger: |x| − rt = 0 → abs(q0) - 1. = 0.
12 lines of simple Python code (examples/hill plain.py):


Results:
Traping of regular particles (some tricky physics here):

1 · 104

8 · 103

6 · 103
N (t)

4 · 103

2 · 103 E = −1.3
E = −0.8
E = −0.3
0
0 2.5 · 104 5 · 104 7.5 · 104 1 · 105
nt

Example: Hill problem, N–body version

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL


Overview
Problem:
Simpliﬁed model of escape from star cluster: escape of star from
potential of rotating star cluster with total mass Mc and point
potential of galaxy core with mass Mg Mc (2D)

Watched values:
Conﬁguration of cluster
Potential of galaxy core in cluster frame of reference (tidal
approximation):

UHillNB = −3ω 2 x 2


”Toy” Hill model vs N–body Hill model:


Calculation setup:
N–body ensemble
2D (z = 0) initial distribution of N = 10240 particles inside
circle R with zero initial velocities
14 lines of simple Python code (examples/hill nbody.py):
1
Mc = 1, R = 200, ω = √
3


Results: cluster conﬁguration
step = 201 step = 401 step = 601
300 300 300

200 200 200

100 100 100

0 0 0
y

y

y
−100 −100 −100

−200 −200 −200

−300 −300 −300
−300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
x x x

step = 801 step = 1001 step = 1201
300 300 300

200 200 200

100 100 100

0 0 0
y

y

y
−100 −100 −100

−200 −200 −200

−300 −300 −300
−300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
x x x

Performance results

OpenSUSE 11.2, GCC 4.4, CUDA 3.0. AMD Athlon X2 4400+
(2.3GHz) / Intel Core2Duo E8500 (3.16GHz), Nvidia Geforce 260
GTX. Not as good, as it could be – subject to improve.
Estimation: ∼ 1TFLOPs on 2x recent Fermi graphic processors

40 300
OpenM P
SSE optimized
250
CU DA
30

200

speed − up
GF lop/s

20 150

100
10

GTX260 DP - N –body 50
GTX260 DP – ”simple” ensemble
0 0
1 · 104 2 · 104 5 · 104 1 · 105 2 · 105 0 2.5 · 105 5 · 105 7.5 · 105 1 · 106
N number of particles

GPU computing prospects

Outline
1 History
4 Toolbox
5 PyCUDA&PyOpenCL


Yesterday:
uniform programming with OpenCL: no need to care about
concrete implementation
desktop supercomputers (full ATX form–factor):

Nvidia Tesla C1060 x4 ATI FireStream x4
∼ 300GFLOPs/4TFLOPs ∼ 960GFLOPs/4.8TFLOPs
Windows & Linux 32/64–bit Windows & Linux 32/64–bit
support support


Today:
CUDA 3.2 → C++: classes, namespaces, default
parameters, operators overloading

Nvidia Tesla C2050/2070 x4 ATI FireStream 9350/9370 x4

∼ 2TFLOPs/4TFLOPs ∼ 2GFLOPs/8TFLOPs
concurent kernel execution stable double–precision support
(12 August 2010)
∼ 8x in GFLOPs, ∼ 6x in
GFLOPs/$, ∼ 5x in LOEWE–CSC (University of
GFLOPs/W vs four Intel Xeon Frankfurt): №22 in Top500
X5550
(85GFLOPs/73GFLOPs)
Tianhe-1-A, Nebulae,
Tsubame-2: №1, 3, 4 SC from
Top500


Tommorow:
OpenCL 1.2 (?) → matrix and ”ﬁeld” complex and real
types
New libraries: GPU programming as simple as CPU
programming

Nvidia Geforce 580 GTX ATI Radeon 6950 ”Cayman”
∼ 0.75TFLOPs/1.5TFLOPs ∼ 0.75GFLOPs/3TFLOPs


This presentation is available for download at
http://www.scribd.com/doc/27751403

Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (18)

Semelhante a Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)

Semelhante a Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010) (20)

Último

Último (20)

Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)