Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
1. Performance
Analysis
of
La1ce
QCD
on
GPUs
in
APGAS
Programming
Model
Koichi
Shirahata1,
Jun
Doi2,
Mikio
Takeuchi2
1:
Tokyo
InsGtute
of
Technology
2:
IBM
Research
-‐
Tokyo
2. Programming
Models
for
Exascale
CompuGng
• GPU-‐based
heterogeneous
clusters
– e.g.)
TSUBAME
2.5
(3
GPUs
x
1408
nodes)
– AcceleraGon
using
GPUs
• High
peak
performance/memory
bandwidth
• Programming
models
for
GPU-‐based
clusters
– Massage
passing
(e.g.
MPI)
• High
tuning
efficiency
• High
programming
cost
– APGAS
(Asynchronous
ParGGoned
Global
Address
Space)
• Abstract
distributed
memory
and
deep
memory
hierarchy
• X10:
an
instance
of
APGAS
programming
languages
→
Highly
scalable
and
producGve
compuGng
on
GPUs
using
APGAS
3. Problem
Statement
• How
much
do
GPUs
accelerate
applicaGons
using
APGAS?
– Tradeoff
between
performance
and
producGvity
• Performance
– The
abstracGon
of
memory
hierarchy
may
limit
performance
– Scalability
on
mulG-‐GPU
• ProducGvity
– Can
we
use
GPUs
with
li`le
programming
cost?
4. Goal
and
ContribuGons
• Goal
– Scalable
and
producGve
compuGng
on
GPUs
• Approach
– Performance
analysis
of
la1ce
QCD
in
X10
CUDA
• Implement
la1ce
QCD
in
X10
CUDA
• ComparaGve
performance
analysis
of
X10
on
GPUs
• Confirm
acceleraGon
using
GPUs
in
X10
– 19.4x
speedup
from
X10
by
using
X10
CUDA
on
32
nodes
of
TSUBAME
2.5
5. The
APGAS
Model
using
GPUs
Child
Place
0
Child
Place
M-‐1
...
...
...
...
...
...
GPU
Place
...
Place
0
Place
1
Place
N-‐1
AcGvity
at
async
at
asyncCopy
asyncCopy
asyncCopy
at
at
asyncCopy
Object
CPU
Place
X10
provides
CUDA
support
for
GPUs
[1]
[1]
Cunningham,
D.
et
al.
"GPU
programming
in
a
high
level
language:
compiling
X10
to
CUDA."
Proceedings
of
the
2011
ACM
SIGPLAN
X10
Workshop.
ACM,
2011.
6. La1ce
QCD
• La1ce
QCD
– SimulaGon
of
Quantum
ChromoDynamics
(QCD)
of
quarks
and
gluons
on
4D
grid
of
points
in
space
and
Gme
– A
grand
challenge
in
high-‐performance
compuGng
• Requires
high
memory/network
bandwidth
and
computaGonal
power
• CompuGng
la1ce
QCD
– Monte-‐Carlo
simulaGons
on
4D
grid
– Dominated
solving
linear
equaGons
of
matrix-‐vector
mulGplicaGon
using
iteraGve
methods
(etc.
CG
method)
– Parallelizable
by
dividing
4D
grid
into
parGal
grids
for
each
place
• Boundary
exchanges
are
required
between
places
in
each
direcGon
7. ImplementaGon
of
La1ce
QCD
in
X10
CUDA
• We
extend
our
la1ce
QCD
in
X10
into
X10
CUDA
– PorGng
whole
solvers
into
CUDA
kernels
• Wilson-‐Dirac
operator,
BLAS
level
1
operaGons
• Avoid
waste
memory
copy
overheads
between
CPU
and
GPU,
except
boundary
data
transfers
– Implement
boundary
data
transfer
among
GPUs
• Add
memory
copy
operaGons
– (1)
GPU
à
CPU,
(2)
CPU
à
CPU,
(3)
CPU
à
GPU
– OpGmizaGons
• Data
layout
transformaGon
• CommunicaGon
overlapping
8. Data
Layout
TransformaGon
• Two
types
of
data
layouts
for
quarks
and
gluons
– AoS
(Array
of
Structure)
• Non-‐conGguous
data
• Used
in
our
original
CPU
implementaGons
– SoA
(Structure
of
Array)
• ConGguous
data
• Suitable
for
vectorizaGon
• We
translate
from
AoS
to
SoA
– GPU
is
suitable
for
coalesced
memory
access
(0,
0,
0,
0)
(1,
0,
0,
0)
(x,
y,
z,
t):
…
Spin
1
Spin
2
…
Spin:
…
…
AoS
…
…
SoA
(n-‐1,
n-‐1,
n-‐1,
n-‐1)
Spin
m
…
…
9. CommunicaGon
OpGmizaGons
in
X10
CUDA
• Two
communicaGon
opGmizaGons
– mulG-‐dimensional
parGGoning
– CommunicaGon
overlapping
• Overlap
memory
copy
between
GPU
and
CPU
in
addiGon
to
between
CPU
and
CPU
• Overlapping
domain
is
limited
by
finish-‐based
synchronizaGon
Comp.
Comm.
GPU Kernel
Transfer btw. CPU and GPU
Exchange btw. CPU and CPU
Time
Synchronization (using finish)
T:
X:
Y:
Z:
Inner:
10. Experimental
Setup
• Performance
comparison
with
other
implementaGons
– X10
C++,
MPI
C,
MPI
CUDA
– Use
one
GPU
per
node
• Measurements
– Weak
scaling
– Strong
scaling
• ConfiguraGon
– Measure
average
iteraGon
Gme
of
one
convergence
of
CG
method
• Typically
300
–
500
iteraGons
– Problem
size
• (x,
y,
z,
t)
=
(24,
24,
24,
96)
(unless
specified)
• Fit
on
one
Tesla
K20X
GPU
11. Experimental
environments
• TSUBAME2.5
supercomputer
(unless
specified)
– Use
up
to
32
nodes
(32
GPUs)
– CPU-‐GPU:
PCI-‐E
2.0
x16
(8
GB/sec)
– Internode:
QDR
IB
dual
rail
(10
GB/sec)
• Setup
– 1
place
per
node
– 1
GPU
per
place
(X10
CUDA)
– 12
threads
per
place
(X10
C++,
MPI
C)
• Sotware
– X10
version
2.4.3.2
– CUDA
version
6.0
– OpenMPI
1.6.5
2
CPUs
/
node
3
GPUs
/
node
Model
Intel®
Xeon®
X5670
Tesla
K20X
#
Cores
6
2688
Frequency
2.93
GHz
0.732
GHz
Memory
54
GB
6
GB
Memory
BW
32
GB/sec
250
GB/sec
Compiler
gcc
4.3.4
Nvcc
6.0
11
12. Comparison
of
Weak
Scaling
with
CPU-‐based
ImplementaGons
1
10
100
1000
1
2
4
8
16
32
Performance
[Gflops]
Number
of
Nodes
MPI
C
X10
C++
X10
CUDA
DP
X10
CUDA
SP
• X10
CUDA
exhibits
good
weak
scaling
– 19.4x
and
11.0x
faster
than
X10
C++
and
MPI
C
each
on
32
nodes
– X10
CUDA
does
not
incur
communicaGonal
penalty
with
large
amount
of
data
on
GPUs
0
500
1000
1500
2000
2500
3000
1
2
4
8
16
32
Elapsed
Time
per
Itera@on
[ms]
Number
of
Nodes
MPI
C
X10
C++
X10
CUDA
DP
X10
CUDA
SP
13. Comparison
of
Weak
Scaling
with
MPI
CUDA
• Comparison
on
TSUBAME-‐KFC
– X10
2.5.1,
CUDA
5.5,
OpenMPI
1.7.2
– Problem
Size:
24
x
24
x
24
x
24
per
node
• X10
CUDA
performs
similar
scalability
with
MPI
CUDA
• Increase
performance
gap
as
using
larger
number
of
nodes
1
10
100
1000
1
2
4
8
16
32
Performance
[Gflops]
Number
of
Nodes
MPI
CUDA
DP
X10
CUDA
DP
0
10
20
30
40
50
60
70
80
1
2
4
8
16
32
Elapsed
Time
per
Itera@on
[msec]
Number
of
Nodes
MPI
CUDA
DP
X10
CUDA
DP
14. Strong
Scaling
Comparing
with
CPU-‐
based
ImplementaGons
• X10
CUDA
outperforms
both
X10
C++
and
MPI
C
– 8.27x
and
4.57x
faster
each
on
16
Places
– Scalability
of
X10
CUDA
gets
poorer
as
the
number
of
nodes
increases
1
10
100
1000
1
2
4
8
16
32
Performance
[Gflops]
Number
of
Nodes
MPI
C
X10
C++
X10
CUDA
DP
X10
CUDA
SP
15. Comparison
of
Strong
Scaling
with
MPI
CUDA
• Comparison
on
TSUBAME-‐KFC
– X10
2.5.1,
CUDA
5.5,
OpenMPI
1.7.2
• X10
CUDA
exhibits
comparaGve
performance
up
to
4
nodes
• X10
CUDA
suffers
heavy
overheads
on
over
8
nodes
1
10
100
1000
1
2
4
8
16
32
Preformance
[Gflops]
Number
of
Nodes
MPI
CUDA
DP
X10
CUDA
DP
0
20
40
60
80
100
120
140
160
180
200
0
10
20
30
40
Elapsed
Time
per
Itera@on
[msec]
Number
of
Nodes
MPI
CUDA
DP
X10
CUDA
DP
16. Performance
Breakdown
of
La1ce
QCD
in
X10
CUDA
• CommunicaGon
overhead
increases
– Both
boundary
communicaGon
and
MPI
Allreduce
• ComputaGon
parts
also
do
not
scale
linearly
0
10
20
30
40
50
60
70
80
90
100
0
20
40
60
80
100
120
1
2
4
8
16
32
Comm.
Overhead
[%]
Elapsed
Time
per
Iteratoin
[ms]
Number
of
Nodes
Bnd
Comm.
Allreduce
BLAS
Wilson-‐Dirac
Comm.
RaGo
17. Comparison
with
MPI
CUDA
using
Different
Problem
Sizes
• Comparison
on
TSUBAME-‐KFC
– X10
2.5.1,
CUDA
5.5
• X10
CUDA
suffers
heavy
overhead
on
small
problem
sizes
– 6.02x
slower
on
4
x
4
x
4
x
8
– We
consider
X10
CUDA
suffers
constant
overhead
1
10
100
1000
10000
4x4x4x8
8x8x8x16
12x12x12x48
24x24x24x96
Performance
[Gflops]
Problem
Size
(X
x
Y
x
Z
x
T)
MPI
CUDA
DP
X10
CUDA
DP
18. Comparison
of
ProducGvity
with
MPI
CUDA
• Lines
of
code
– X10
CUDA
version
contains
1.92x
larger
lines
of
code
compared
with
MPI
CUDA
in
total
• Since
currently
X10
CUDA
cannot
call
device
funcGons
inside
CUDA
kernels
• Compiling
Gme
– X10
CUDA
takes
11.3x
longer
Gme
to
compile
MPI
CUDA
X10
CUDA
Total
4667
8942
Wilson
Dirac
1590
6512
MPI
CUDA
X10
CUDA
Compiling
Time
[sec]
15.19
171.50
19. Pros/Cons
of
X10
CUDA
from
Our
Study
• Advantages
of
X10
CUDA
– Straighxorward
porGng
from
X10
to
X10
CUDA
• Simply
porGng
computaGon
kernels
into
CUDA
• inserGng
memory
copy
operaGons
between
CPU
and
GPU
– X10
CUDA
exhibits
good
weak
scaling
• Drawbacks
of
current
version
of
X10
CUDA
– LimitaGons
of
programmability
• X10
CUDA
cannot
call
a
funcGon
inside
a
kernel
– LimitaGons
of
performance
• finish-‐based
synchronizaGon
incurs
overhead
• X10
CUDA
does
not
support
creaGng
CUDA
streams
20. Related
Work
• High
performance
large-‐scale
la1ce
QCD
computaGon
– Peta-‐scale
la1ce
QCD
on
a
Blue
Gene/Q
supercomputer
[Doi
et
al.
2012]
• Fully
overlapping
communicaGon
and
applying
node-‐mapping
opGmizaGon
for
BG/Q
• La1ce
QCD
using
many-‐core
accelerators
– QUDA:
A
QCD
library
on
GPUs
[Clark
et
al.
2010]
• Invokes
mulGple
CUDA
streams
for
overlapping
– La1ce
QCD
on
Intel
Xeon
Phi
[Joo
et
al.
2013]
• PGAS
language
extension
for
mulG-‐node
GPU
clusters
– XcalableMP
extension
for
GPU
[Lee
et
al.
2012]
• Demonstrated
their
N-‐body
implementaGon
scales
well
21. Conclusion
• Conclusion
– GPUs
accelerate
la1ce
QCD
significantly
in
APGAS
programming
model
• X10
CUDA
exhibits
good
scalability
in
weak
scaling
• 19.4x
speedup
from
X10
by
using
X10
CUDA
on
32
nodes
of
TSUBAME
2.5
– We
reveal
limitaGons
in
current
X10
CUDA
• Performance
overheads
in
strong
scalability
• Increase
of
lines
of
code
• Future
work
– Performance
improvement
in
strong
scaling
• More
detailed
analysis
of
overheads
in
X10
CUDA
23. Breakdown
of
Wilson-‐Dirac
ComputaGon
in
X10
CUDA
• CommunicaGon
becomes
dominant
when
using
more
than
16
places
– A
cause
of
the
limit
of
strong
scaling
• Possible
ways
to
improve
the
scalability
– Applying
one-‐to-‐one
synchronizaGon
– Improving
communicaGon
and
synchronizaGon
operaGons
themselves
in
X10
0
10
20
30
40
50
60
70
1
2
4
8
16
32
Elapsed
Time
[ms]
Number
of
Places
Gamma5
Bnd
Set
Copy
CPU
to
GPU
max(Bulk,
Bnd)
Bulk
Bnd
(Make
+
Comm.)
24. Comparison
with
Different
Dimensions
of
La1ce
ParGGoning
• Comparison
between
4D
and
1D
parGGoning
– 1D
parGGoning
exhibits
be`er
strong
scalability
– 1.35x
be`er
on
32
GPUs
– SGll
saturate
using
over
16
GPUs
0
50
100
150
200
250
0
5
10
15
20
25
30
35
Gflops
Number
of
Nodes
(=
Number
of
GPUs)
Strong
Scaling
(24x24x24x96)
X10
CUDA
SP
X10
CUDA
SP
(div
t)
25. Comparison
with
QUDA
• QUDA
[1]:
A
QCD
library
on
GPUs
– Highly
opGmized
for
GPUs
• QUDA
exhibits
be`er
strong
scalability
– Currently
30.4x
slower
in
X10
CUDA
1
10
100
1000
10000
1
2
4
8
16
32
Gflops
Number
of
Places
(number
of
GPUs)
X10
CUDA
DP
(div
t)
X10
CUDA
SP
(div
t)
QUDA
SP
recon
18
QUDA
SP
recon
12
[1]:
Clark,
Michael
A.,
et
al.
"Solving
La1ce
QCD
systems
of
equaGons
using
mixed
precision
solvers
on
GPUs."
Computer
Physics
CommunicaGons
181.9
(2010):
1517-‐1528.