PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron at the AMD Developer Summit (APU13) November 11-13, 2014.
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron
1. ACCELERATING
AND
EVALUATING
OPENCL
GRAPH
APPLICATIONS
SHUAI
CHE
,
BRAD
BECKMANN,
STEVE
REINHARDT
AND
KEVIN
SKADRON
2. AGENDA
Background
and
Graph
Applica8ons
Panno8a
OpenCL™
Graph
Applica8ons
Performance
Evalua8on
and
Discussion
2
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
3. GRAPH
APPLICATIONS
! Intelligence
‒ Business
analy8cs,
security
and
scien8fic
discovery
! Social
networks
‒ Facebook,
TwiVer,
LinkedIn,
Weibo,
etc.
! Life
science
and
healthcare
‒ Disease
and
drug
research,
life
system
research
! Infrastructure
‒ Transporta8on,
power
grid,
energy
and
water
supply
! Scien8fic
and
engineering
simula8ons
3
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
4. GRAPH
APPLICATIONS
! Low
arithme8c
intensity
and
data
reuse
! Not
floa8ng-‐point
intensive
! Branch
divergence
‒ Part
of
threads
in
a
wavefront
are
ac8ve
! Memory
divergence
‒ Data
distributed
in
different
regions
of
memory
‒ A
challenge
to
op8mize
data
layouts
and
memory
accesses
! Load
imbalance
‒ Uneven
work
distribu8on
across
different
threads
‒ Short-‐running
threads
wait
for
long-‐running
threads
! Parallelism
‒ Changing
degree
of
parallelism
across
itera8ons
‒ Underu8liza8on
of
compute
units
for
certain
phases
4
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
5. PANNOTIA
! A
graph
applica8on
suite
for
GPGPU
! Eight
diverse
graph
algorithms,
e.g.,
shortest
path,
graph
par88oning,
web
analysis
and
social
network
! Implemented
in
C
+
OpenCL™
! Some
are
OpenCL
implementa8ons
based
on
algorithms
of
prior
work
! Ini8al
implementa8on
is
for
a
single
GPU
node
! Further
algorithmic
and
hardware-‐specific
op8miza8ons
are
in
progress
! Details
of
Panno8a
can
be
found
in
our
paper
published
in
2013
IEEE
Interna8onal
Symposium
on
Workload
Characteriza8on
5
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
6. PANNOTIA
Applica7ons
Domains
Single-‐Source
Shortest
Path
Shortest
Path
Connected
Component
Labeling
Graph
Clustering
Graph
Coloring
Graph
Par88oning
Floyd-‐Warshall
Shortest
Path
Maximal
Independent
Set
Graph
Par88oning
Betweeness
Centrality
Social
Network
Friend
Recommenda8on
Social
Network
Page
Rank
Web
Analysis
6
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
7. GRAPH
INPUT
AND
DATA
STRUCTURE
! Real-‐world
graphs
‒ The
University
of
Florida
Sparse
Matrix
Collec8on
‒ The
9th
DIMACS
Implementa8on
Challenges
‒ The10th
DIMACS
Implementa8on
Challenges
!
Synthe8c
graphs
‒
Random-‐graph
generator
from
Georgia
Tech
!
Graph
input
formats
‒
Coordinate
Format
‒
METIS
‒
Matrix
Market
!
Data
structure
representa8on
‒
CSR,
COO,
ELL
…
‒
2D
adjacency
matrix
7
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
8. SINGLE
SOURCE
SHORTEST
PATH
! Finds
the
path
with
the
shortest
path
between
the
source
node
and
all
the
other
nodes
in
the
graph
Vid
Dist
7
0
15
2
13
6
23
18
4
1
5
8
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
0
1
3
2
1
3
8
4
16
5
2
1
0
8
3
19
6
16
9. CONNECTED
COMPONENT
LABELING
! Detect
connected
regions
in
graphs
and
images
! Connected
components
are
the
nodes
in
a
graph
that
point
to
the
same
root
q
p
s
r
9
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
10. GRAPH
COLORING
! Assign
colors
(integers)
to
ver8ces
with
no
two
adjacent
ver8ces
with
the
same
color
10
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
11. FLOYD-‐WARSHALL
! Solves
the
all-‐pairs
shortest
path
(APSP)
problem
–
finding
the
shortest
path
from
every
possible
source
to
every
possible
des8na8on
!
A
dynamic
programming
approach
shortestPath(i,
j,
k)
returns
the
shortest
path
from
i
to
j
with
ver8ces
from
{1,2,...,k}
11
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
12. MAXIMAL
INDEPENDENT
SET
! Independent
set:
no
two
ver8ces
are
neighbors
! Maximal
Independent
set:
impossible
to
add
another
vertex
to
s8ll
keep
independent
0
2
3
5
4
1
6
7
S
=
{0,
4,
6}
is
an
Maximal
Independent
Set
12
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
13. BETWEENNESS
CENTRALITY
! Centrality
determines
the
rela8ve
importance
of
a
vertex
within
the
graph
(e.g.
degree,
betweenness,
closeness
…)
! Betweenness
Centrality
quan8fies
the
number
of
8mes
a
node
acts
as
a
bridge
along
the
shortest
path
between
two
other
nodes.
σ st (v)
BC (v) = ∑
s ≠ v ≠ t σ st
σ st
σ st (v)
no.
of
shortest
paths
between
nodes
s
and
t
no.
of
shortest
paths
between
nodes
s
and
t
passing
through
v
13
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
14. FRIEND
RECOMMENDATION
!
Recommend
friend
connec8ons
–
a
common
feature
in
social
websites
!
A
simple
Map-‐Reduce
like
algorithm
“Andy” =
[
“Brad”,
“Derek”,
“Shuai”,
…]
Andy
!
<“Brad”,
“Derek”,
“Andy”>
<“Brad”,
“Shuai”,
“Andy”>
<“Derek”,
“Brad”,
“Andy”>
<“Derek”,
“Shuai”,
“Andy”>
<“Shuai”,
“Derek”,
“Andy”>
<“Shuai”,
“Brad”,
“Andy”>
Andy
recommends
Brad
to
Shuai
14
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
15. PAGERANK
!
15
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
16. PERFORMANCE
BENEFITS
! Speedups
are
up
to
11x
(an
AMD
“Tahi8”
discrete
GPU
v.s.
4
CPU
cores
on
A8)
! PCI-‐E
overhead
is
included
! Performance
benefits
depend
on
graph
input
datasets
15
Parallel Speedup
10
5
0
16
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
17. EXECUTION
TIME
BREAKDOWN
(D-‐GPU)
! The
por8on
of
GPU
execu8on:
8%
-‐
99%
! Some
further
GPU
offload
can
be
done
(e.g.
FRD
and
MIS)
17
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
18. PARALLELISM
(ACTIVE
VERTICES
OVER
TIME)
Single-‐Source
Shortest
Path
(Road
Network
-‐
NY)
120000
0
400000
0
Time
Graph
Coloring
(G3
Circuit)
Time
18
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
21. L2
HIT
RATE
OVER
TIME
(SSSP)
! The
cache
hit
rate
first
improves,
then
degrades,
improves
again
and
finally
degrades
with
some
fluctua8ons
in
the
middle
60
Hit
Rate
50
40
30
20
10
0
Time
21
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
22. ARCHITECTURAL
IMPLICATIONS
(SCALAR
UNIT)
Scalar
SIMD
SIMD
Scalar
SIMD
Time
A
B
22
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
Graph
Traversal
23. ARCHITECTURAL
IMPLICATIONS
! Possibly
include
narrower
SIMD
units
or
heterogeneous
SIMD
units
Scalar
Narrow
SIMD
Wide
SIMD
! Resource
management
and
scheduling
‒ Switch
the
task
between
the
CPU
and
the
GPU
based
on
parallelism
‒ Use
only
“enough”
SIMD
engines
and
save
power
CPU
120000
0
GPU
GPU
Time
A
23
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL
B
24. CONCLUSION
AND
FUTURE
WORK
! Graph
applica8ons
are
an
emerging
workload
domain
! Panno8a
is
a
first-‐step
aVempt
to
evaluate
diverse
graph
building
blocks
on
GPUs
Next-‐Step
Goals:
! Add
more
applica8ons
(e.g.
matching,
spanning
tree,
flow)
! Op8mize
Panno8a
applica8ons
! Extend
to
mul8ple
GPU
nodes
and
across
CPU
and
GPU
24
|
Accelera8ng
and
Evalua8ng
OpenCL
Graph
Applica8ons|
November
20,
2013
|
CONFIDENTIAL