This repost was presented at Fronties in Computational Astrophysics Conference (Lyon, France, 11-15 October, 2010). I give brief and light introduction to CUDA architecture and it's benefits for scientific HPC. Also a brief description about KIPT in-house package for N-body simulations is given. This talk with minor differences was also presented at
seminars in Institute for Single Crystals (Kharkov) and Kharkov Institute of Physics and Technology.
Breaking the Kubernetes Kill Chain: Host Path Mount
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
1. Computations on GPU: a road towards desktop supercomputing
Computations on GPU: a road towards desktop
supercomputing
Glib Ivashkevych
Institute of Theoretical Physics, NSC KIPT
November 24, 2010
2. Computations on GPU: a road towards desktop supercomputing
Quick outline
GPU – Graphic Processing Unit
programmable
manycore
multithreaded
with very high memory bandwidth
3. Computations on GPU: a road towards desktop supercomputing
Quick outline
GPU – Graphic Processing Unit
programmable
manycore
multithreaded
with very high memory bandwidth
We are going to talk about:
how GPU became usefull for scientific computations
GPU intrinsics and programming
how to get as much as possible from GPU and survive:)
the future of GPUs and GPU programming
CUDA(Nvidia s Compute Unified Device Architecture) , most of the
time, and OpenCL(Open Computing Language)
4. Computations on GPU: a road towards desktop supercomputing
Should we care?
But first of all: do we really need GPU computing?
5. Computations on GPU: a road towards desktop supercomputing
Should we care?
But first of all: do we really need GPU computing?
Short answer: yes!
high performance
transparent scalability
6. Computations on GPU: a road towards desktop supercomputing
Should we care?
But first of all: do we really need GPU computing?
Short answer: yes!
high performance
transparent scalability
More accurate answer: yes, for problems with high parallelism.
large datasets
portions of data could be processed independently
7. Computations on GPU: a road towards desktop supercomputing
Should we care?
But first of all: do we really need GPU computing?
Short answer: yes!
high performance
transparent scalability
More accurate answer: yes, for problems with high parallelism.
large datasets
portions of data could be processed independently
Most accurate answer: yes, for problems with high data
parallelism.
8. Computations on GPU: a road towards desktop supercomputing
Reference
For reference
GFLOPs – 109 FLoating Point Operations Per second
∼ 55 GFLOPs on Intel Core i7 Nehalem 975 (according to
Intel)
∼ 125 GFLOPs AMD Opteron Istanbul 2435
∼ 500 GFLOPs in double and ∼ 2 TFLOPs in single precision
on Nvidia Tesla C2050
∼ 3.2 · 103 GFLOPs on ASCI Red in Sandia National
Laboratory – the fastest supercomputer as for November 1999
∼ 87 · 103 GFLOPs on TSUBAME-1 Grid Cluster in Tokyo
Institute of Technology – first GPU based supercomputer, №88
in Top500, as for November 2010 (№56 – in November 2009)
∼ 2.56 · 106 GFLOPs on Tianhe-1-A at National
Supercomputing Center in Tianjin – the fastest
supercomputer as for November 2010 – GPU based
9. Computations on GPU: a road towards desktop supercomputing
Examples
Matrix and vector operations
CUBLAS 1 on Nvidia Tesla C2050 (CUDA 3.2) vs Intel MKL 10.21
on Intel Core i7 Nehalem (4 threads)
∼ 8x in double precision
CULA 1 on Nvidia Tesla C2050 (CUDA 3.2)
up to ∼ 220 GFLOPs in double precision
up to ∼ 450 GFLOPs in single precision
vs Intel MKL 10.2
∼ 4 − 6x speed–up
1
CUDA accelerated Basic Linear Algebra Subprograms
1
Math Kernel Library
1
LAPACK for Heterogeneous systems
10. Computations on GPU: a road towards desktop supercomputing
Examples
Fast Fourier Transform
CUFFT on Nvidia Tesla C2070 (CUDA 3.2)
up to 65 GFLOPs in double precision
up to 220 GFLOPs in single precision
vs Intel MKL on Intel Core i7 Nehalem
∼ 9x in double precision
∼ 20x in single precision
11. Computations on GPU: a road towards desktop supercomputing
Examples
Physics: Computational Fluid Dynamics
Simulation of transition to turbulence1
Nvidia Tesla S1070 vs quad-core Intel Xeon X5450 (3GHz)
∼ 20x over serial code
∼ 10x over OpenMP realization (2 threads)
∼ 5x over OpenMP realization (4 threads)
1
A.S. Antoniou et al., American Institute of Aeronautics and Astronautics
Paper 2010 – 0525
12. Computations on GPU: a road towards desktop supercomputing
Examples
Quantum chemistry
Calculations of molecular orbitals1
Nvidia GeForce GTX 280 vs Intel Core2Quad Q6600 (2.4GHz)
∼ 173x over serial non–optimized code
∼ 14x over parallel optimized code (4 threads)
1
D.J. Hardy et al., GPGPU 2009
13. Computations on GPU: a road towards desktop supercomputing
Examples
Medical Imaging
Isosurfaces reconstruction from scalar volumetric data1
Nvidia GeForce GTX 285 vs ?
∼ 68x over optimized CPU code
nearly real-time processing of data
1
T. Kalbe et al., Proceedings of 5th International Symposium on Visual
Computing (ISVC 2009)
14. Computations on GPU: a road towards desktop supercomputing
Examples
GPUGrid.net
Biomolecular simulations
accelerated by Nvidia CUDA boards and Sony PlayStation
∼ 8000 users from 101 country
∼ 145 TFLOPs in average ≈ №25 in Top500
∼ 50 GFLOPs from every active user
15. Computations on GPU: a road towards desktop supercomputing
Examples
ATLAS experiment on Large Hadron Collider
Particles tracking, trigerring, events simulations1
possible Higgs events – track large number of particles
∼ 32x in tracking, ∼ 35x in triggering on Nvidia Tesla C1060
1
P.J. Clark et al., Processing Petabytes per Second with the ATLAS
Experiment at the LHC (GTC 2010)
16. Computations on GPU: a road towards desktop supercomputing
Examples
And even more examples in:
N–body simulations
seismic simulations
molecular dynamics
SETI@Home & MilkyWay@Home
finance
neural networks
...
and, of course, graphics
VFX, rendering
image editing, video
17. Computations on GPU: a road towards desktop supercomputing
Examples
GPU Technology Conference 2010 (September 20-23)
18. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
19. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
20. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
21. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
22. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
23. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
24. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
25. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
26. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
27. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
28. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
29. Computations on GPU: a road towards desktop supercomputing
Outline
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
30. Computations on GPU: a road towards desktop supercomputing
History
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
31. Computations on GPU: a road towards desktop supercomputing
History
History in brief:
32. Computations on GPU: a road towards desktop supercomputing
History
GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision
33. Computations on GPU: a road towards desktop supercomputing
History
GPGPU in 2001-2006:
through graphics API (OpenGL or DirectX)
extremely hard
only in single precision
GPGPU today:
straightforward
easy
in double precision.
34. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
35. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Hardware model: GT200 architecture
consists of
multiprocessors
each MP has:
8 stream processors
1 unit for double
precision operations
shared memory
global memory
36. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Hardware model: Fermi architecture
each MP has:
32 stream processors
4 SFU’s (Special
Function Unit)
each SP has:
1 FP Unit & 1 INT
Unit
37. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Hardware model: Multiprocessors and threads
MP can launch numerous threads
threads are ”lightweight” – little creation and switching
overhead
threads run the same code
threads syncronization within MP
cooperation via shared memory
each thread have unique identifier – thread ID
Efficiency is achieved by latency hiding by calculation, and not by
cache usage, as on CPU
38. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Software model: C for CUDA
a set of extensions to C
runtime library
function and variable type qualifiers
built–in vector types: float4, double2 etc.
built–in variables
Kernels
maps parallel part of the program to the GPU
execution: N times in parallel by N CUDA threads
CUDA Driver API
low–level control over the execution
no need in nvcc compiler if kernels are precompiled – only
driver needed
39. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Software model: Example
//Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
{
//Some math
r e t u r n smth ;
}
40. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Software model: Example
//Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
{
//Some math
r e t u r n smth ;
}
// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}
41. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Software model: Example
//Some f u n c t i o n − e x e c u t e d on d e v i c e (GPU)
device f l o a t D e v i c e F u n c t i o n ( f l o a t ∗ A , f l o a t ∗ B)
{
//Some math
r e t u r n smth ;
}
// K e r n e l d e f i n i t i o n
global v o i d SomeKernel ( f l o a t ∗ A , f l o a t ∗ B , f l o a t C)
{
//Some math
C = D e v i c e F u n c t i o n (A , B ) ;
}
// Host c o d e
i n t main ( )
{
// K e r n e l i n v o c a t i o n
SomeKernel <<<1,N>>>(A , B , C)
}
42. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Software model: Explanations
device qualifier defines function that is:
executed on device
callable from device only
43. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Software model: Explanations
device qualifier defines function that is:
executed on device
callable from device only
global qualifier defines function that is:
executed on device
callable from host only
44. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Execution model
45. Computations on GPU: a road towards desktop supercomputing
CUDA: architecture overview and programming model
Scalability
underlying hardware architecture is hidden
threads could syncronize only within the MP
↓
we do not need to know exact number of MP
↓
scalable applications – from GTX8800 to Fermi
46. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
47. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Single threads
each thread have private local memory
are identified by built–in variable threadIdx (uint3 type)
int idx = threadIdx . x + threadIdx . y + threadIdx . z ;
form 1–, 2– or 3–dimensional array – vector, matrix or field
Threads are organized into
thread blocks
48. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Thread blocks
each block have shared memory visible to all threads within
the block
are identified by built–in variable blockIdx (uint3 type)
int b idx = blockIdx . x + blockIdx . y ;
dimension of the block is identified by built–in variable
blockDim (dim3 type)
Blocks are organized into
grid
49. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Grid of thread blocks
global device memory is accessible by all threads in the grid
dimension of the grid is identified by built–in variable gridDim
(dim3 type)
50. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Threads and memories hierarchy
51. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Example: vector addition
i n t main ( )
{
// A l l o c a t e v e c t o r s i n d e v i c e memory
size t size = N ∗ sizeof ( float );
float ∗ d A ;
c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
float ∗ d B ;
c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
float ∗ d C ;
c u d a M a l l o c ( ( v o i d ∗∗)& d A , s i z e ) ;
// Copy d a t a from h o s t memory t o d e v i c e memory
cudamemcpy ( d A , h A , s i z e , cudaMemcpyHostToDevice ) ;
cudamemcpy ( d B , h B , s i z e , cudaMemcpyHostToDevice ) ;
// P r e p a r e t h e k e r n e l l a u n c h
int threadsPerBlock = 256;
i n t t h r e a d s P e r G r i d = (N + t h r e a d s P e r B l o c k −1) / T h r e a d s P e r B l o c k ;
VecAdd<<<t h r e a d s P e r G r i d , t h r e a d s P e r B l o c k >>>(d A , d B , d C ) ;
cudamemcpy ( h C , d C , s i z e , cudaMemcpyDeviceToHost ) ;
// F r e e d e v i c e memory
cudaFree ( d A ) ;
cudaFree ( d B ) ;
cudaFree ( d C ) ;
}
52. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Example: vector addition
// K e r n e l c o d e
global v o i d VecAdd ( f l o a t ∗ A , f l o a t ∗ B , f l o a t ∗ C)
{
int i = threadIdx . x ;
i f ( i < N)
C[ i ] = A[ i ] + B[ i ] ;
}
53. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Performance analysis and optimization
there must be enough thread blocks per MP to hide latency
try not to under–populate blocks
54. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Performance analysis and optimization
there must be enough thread blocks per MP to hide latency
try not to under–populate blocks
use memory bandwidth (∼ 100GB/s!) efficiently
coalescing
non–optimized access to global memory could reduce the
performance in order(-s) of magnitude
try to achieve high arithmetic intensity
55. Computations on GPU: a road towards desktop supercomputing
Threads and memories hierarchy
Performance analysis and optimization
there must be enough thread blocks per MP to hide latency
try not to under–populate blocks
use memory bandwidth (∼ 100GB/s!) efficiently
coalescing
non–optimized access to global memory could reduce the
performance in order(-s) of magnitude
try to achieve high arithmetic intensity
never diverge threads within one warp:
divergence → serialization = parallelism
56. Computations on GPU: a road towards desktop supercomputing
Toolbox
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
57. Computations on GPU: a road towards desktop supercomputing
Toolbox
Start–up tools
drivers
CUDA Toolkit
nvcc compiler, runtime library, header files, CUBLAS, CUFFT,
Visual Profiler etc.
CUDA SDK
examples, Occupancy Calculator etc.
Free download at
http://developer.nvidia.com/object/cuda 2 3 downloads.html
Support for 32 and 64-bit Windows, Linux1 & Mac OS X
1
Supported distros in CUDA 3.2: Fedora 13, RH Enterprise 4.8& 5.5,
OpenSUSE 11.2, SLED 11.0, Ubuntu 10.04
58. Computations on GPU: a road towards desktop supercomputing
Toolbox
Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution
59. Computations on GPU: a road towards desktop supercomputing
Toolbox
Developers Tools
CUDA-gdb
integration into gdb
CUDA C support
works on all 32/64–bit Linux distros
breakpoints and single step execution
CUDA Visual Profiler
tracks events with hardware counters
global memory loads/stores
total branches and divergent branches taken by threads
instruction count
number of serialized thread warps due to address conflicts
(shared and constant memory)
60. Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
61. Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Python
easy to learn
dynamically typed
rich built–in functionality
interpreted
very well documented
have large and active community
62. Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
63. Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
NumPy – arrays
flexible array creation routines
sorting, random sampling and statistics
···
64. Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Scientific tools:
Scipy – modeling and simulation
Fourier transforms
ODE
Optimization
scipy.weave.inline – C inlining with little or no
overhead
···
NumPy – arrays
flexible array creation routines
sorting, random sampling and statistics
···
Python is a convenient way of interfacing C/C++ libraries
65. Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
PyCUDA
provide complete access to CUDA features
automatically manages resources
errors handling and translation to Python exceptions
convenient abstractions: GPUArray
metaprogramming: creation of CUDA source code
dynamically
interactive!
PyOpenCL is pretty much the same in concept – but not only for
Nvidia GPUs.
Also for ATI/AMD cards, AMD&Intel Proccesors etc. (IBM Cell?)
66. Computations on GPU: a road towards desktop supercomputing
PyCUDA&PyOpenCL
Python and CUDA
We could interface with:
Python C API – low–level approach: overkill
SWIG, Boost::Python – high–level approach: overkill
PyCUDA – most simple and straightforward way for CUDA
only
scipy.weave.inline – simple and straightforward way for
both CUDA and plain C/C++
67. Computations on GPU: a road towards desktop supercomputing
EnSPy functionality
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
68. Computations on GPU: a road towards desktop supercomputing
EnSPy functionality
Motivation
Combine flexibility of Python with efficiency of C++ → CUDA for
N–body sim
interface of EnSPy is written in Python
core of EnSPy is written in C++
joined together by scipy.weave.inline
C++ core could be used without Python – just include header
and link with precompiled shared library
easily extensible: both through high–level Python interface
and low–level C++ core – new algorithms, initial distributions
etc.
multi–GPU parallelization
it’s easy to experiment with EnSPy!
69. Computations on GPU: a road towards desktop supercomputing
EnSPy functionality
EnSPy functionality
Types of ensembles:
”Simple” ensemble – without interaction, only external
potential
N–body ensemble – both external potential and gravitational
interaction between particles
Current algorithms:
4-th order Runge–Kutta for ”simple” ensemble
Hermite scheme with shared time steps for N-body ensemble
70. Computations on GPU: a road towards desktop supercomputing
EnSPy functionality
Predefined initial distributions:
Uniform, point and spherical for ”simple” ensembles
Uniform sphere with 2T /|U| = 1 for N-body ensemble
user could supply functions (in Python) for initial ensemble
generation
User specified values and expressions:
parameters of initial distribution
potential, forces, parameters of integration scheme
arbitrary number of triggers – Ni (t) of particles which do not
cross the given hypersurface Fi (q, p) = 0 before time t
¯
arbitrary number of averages – Fi (q, p, t) – quantities which
should be averaged over the ensembles
71. Computations on GPU: a road towards desktop supercomputing
EnSPy functionality
Runtime generation and compilation of C and CUDA code:
User specified expressions (as Python strings) are wrapped by
EnSPy template subpackage into C functions and CUDA
module
Compiled at runtime
High usage and calculation efficiency:
flexible Python interface
all actual calculations are performed by runtime generated C
extension and precompiled shared library
Drawback:
extra time for generation and compilation of new code
72. Computations on GPU: a road towards desktop supercomputing
EnSPy architecture
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
73. Computations on GPU: a road towards desktop supercomputing
EnSPy architecture
Execution flow and architecture
Input parameters
↓
Ensemble population
(predefined or user specified
distribution)
↓
Code generation and
compilation
↓
Launching NGPUs threads
74. Computations on GPU: a road towards desktop supercomputing
EnSPy architecture
GPU parallelization scheme for N–body simulations
75. Computations on GPU: a road towards desktop supercomputing
EnSPy architecture
Order of force calculation
76. Computations on GPU: a road towards desktop supercomputing
Example: D5 potential
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
77. Computations on GPU: a road towards desktop supercomputing
Example: D5 potential
Overview
Problem:
Escape from potential well.
Watched values (trigger):
N(t) – number of particles, remaining in the well at time t
Potential:
x4
UD5 = 2ay 2 − x 2 + xy 2 +
4
”Critical” energy: Ecr = ES = 0
78. Computations on GPU: a road towards desktop supercomputing
Example: D5 potential
Potential and structure of phase space:
Level lines of D5 potential
2
2
1
1
0
px
0
y
1
−1
2
2 1 0 1 2
−2 x
−2 −1 0 1 2
x
79. Computations on GPU: a road towards desktop supercomputing
Example: D5 potential
Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
x > 0 ∩ U(x, y ) < E
trigger: x = 0 → q0 = 0.
12 lines of simple Python code (examples/d5.py):
specification of integration parameters
80. Computations on GPU: a road towards desktop supercomputing
Example: D5 potential
Results:
Regular particles are trapped in well → initial ”mixed state” splits
1
E = 0.1
0.8
0.6
N (t)/N (0)
0.4
E = 0.9
0.2
0
0 10 20 30
t
81. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
82. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Overview
Problem:
Toy model of escape from star cluster: escape of star from
potential of point rotating star cluster Mc and point galaxy core
Mg Mc
Watched values (trigger):
N(t) – number of particles, remaining in cluster at time t
”Potential” in cluster frame of reference (tidal approximation):
GMc
UHill = −3ω 2 x 2 −
r2
”Critical” energy: Ecr = ES = −4.5ω 2
83. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Potential:
Hill curves
0.5
0.0
y
−0.5
−1.0
−1.0 −0.5 0.0 0.5
x
84. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Calculation setup:
”Simple ensemble”
uniform initial distribution of N = 10240 particles in
|x| < rt ∩ U(x, y ) < E
1
ω= √
3
→ rt = 1
trigger: |x| − rt = 0 → abs(q0) - 1. = 0.
12 lines of simple Python code (examples/hill plain.py):
specification of integration parameters
85. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem
Results:
Traping of regular particles (some tricky physics here):
1 · 104
8 · 103
6 · 103
N (t)
4 · 103
2 · 103 E = −1.3
E = −0.8
E = −0.3
0
0 2.5 · 104 5 · 104 7.5 · 104 1 · 105
nt
86. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
87. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
Overview
Problem:
Simplified model of escape from star cluster: escape of star from
potential of rotating star cluster with total mass Mc and point
potential of galaxy core with mass Mg Mc (2D)
Watched values:
Configuration of cluster
Potential of galaxy core in cluster frame of reference (tidal
approximation):
UHillNB = −3ω 2 x 2
88. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
”Toy” Hill model vs N–body Hill model:
89. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
Calculation setup:
N–body ensemble
2D (z = 0) initial distribution of N = 10240 particles inside
circle R with zero initial velocities
14 lines of simple Python code (examples/hill nbody.py):
specification of integration parameters
1
Mc = 1, R = 200, ω = √
3
90. Computations on GPU: a road towards desktop supercomputing
Example: Hill problem, N–body version
Results: cluster configuration
step = 201 step = 401 step = 601
300 300 300
200 200 200
100 100 100
0 0 0
y
y
y
−100 −100 −100
−200 −200 −200
−300 −300 −300
−300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
x x x
step = 801 step = 1001 step = 1201
300 300 300
200 200 200
100 100 100
0 0 0
y
y
y
−100 −100 −100
−200 −200 −200
−300 −300 −300
−300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300 −300 −200 −100 0 100 200 300
x x x
91. Computations on GPU: a road towards desktop supercomputing
Performance results
OpenSUSE 11.2, GCC 4.4, CUDA 3.0. AMD Athlon X2 4400+
(2.3GHz) / Intel Core2Duo E8500 (3.16GHz), Nvidia Geforce 260
GTX. Not as good, as it could be – subject to improve.
Estimation: ∼ 1TFLOPs on 2x recent Fermi graphic processors
40 300
OpenM P
SSE optimized
250
CU DA
30
200
speed − up
GF lop/s
20 150
100
10
GTX260 DP - N –body 50
GTX260 DP – ”simple” ensemble
0 0
1 · 104 2 · 104 5 · 104 1 · 105 2 · 105 0 2.5 · 105 5 · 105 7.5 · 105 1 · 106
N number of particles
92. Computations on GPU: a road towards desktop supercomputing
GPU computing prospects
Outline
1 History
2 CUDA: architecture overview and programming model
3 Threads and memories hierarchy
4 Toolbox
5 PyCUDA&PyOpenCL
6 EnSPy functionality
7 EnSPy architecture
8 Example: D5 potential
9 Example: Hill problem
10 Example: Hill problem, N–body version
11 Performance results
12 GPU computing prospects
93. Computations on GPU: a road towards desktop supercomputing
GPU computing prospects
Yesterday:
uniform programming with OpenCL: no need to care about
concrete implementation
desktop supercomputers (full ATX form–factor):
Nvidia Tesla C1060 x4 ATI FireStream x4
∼ 300GFLOPs/4TFLOPs ∼ 960GFLOPs/4.8TFLOPs
Windows & Linux 32/64–bit Windows & Linux 32/64–bit
support support
94. Computations on GPU: a road towards desktop supercomputing
GPU computing prospects
Today:
CUDA 3.2 → C++: classes, namespaces, default
parameters, operators overloading
Nvidia Tesla C2050/2070 x4 ATI FireStream 9350/9370 x4
∼ 2TFLOPs/4TFLOPs ∼ 2GFLOPs/8TFLOPs
concurent kernel execution stable double–precision support
(12 August 2010)
∼ 8x in GFLOPs, ∼ 6x in
GFLOPs/$, ∼ 5x in LOEWE–CSC (University of
GFLOPs/W vs four Intel Xeon Frankfurt): №22 in Top500
X5550
(85GFLOPs/73GFLOPs)
Tianhe-1-A, Nebulae,
Tsubame-2: №1, 3, 4 SC from
Top500
95. Computations on GPU: a road towards desktop supercomputing
GPU computing prospects
Tommorow:
OpenCL 1.2 (?) → matrix and ”field” complex and real
types
New libraries: GPU programming as simple as CPU
programming
Nvidia Geforce 580 GTX ATI Radeon 6950 ”Cayman”
∼ 0.75TFLOPs/1.5TFLOPs ∼ 0.75GFLOPs/3TFLOPs
96. Computations on GPU: a road towards desktop supercomputing
GPU computing prospects
This presentation is available for download at
http://www.scribd.com/doc/27751403