SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
NUMA-aware thread-parallel breadth-first
search for Graph500 and Green
Graph500 Benchmarks on SGI UV 2000
Yuichiro Yasui & Katsuki Fujisawa
Kyushu University
ISM High Performance Computing Conference
11:00 − 11:50, Oct 9-10, 2015
Outline
•  Introduction
•  NUMA-aware threading
–  NUMA architecture and NUMA based system
–  Our library “ULIBC” for NUMA aware threading
•  Efficient BFS algorithm for Graph500 benchmark
–  NUMA-based Distributed Graph Representation … [BD13]
–  Efficient algorithm considering the vertex degree [ISC14, HPCS15]
•  Conclusion
Graph processing for Large scale networks
•  Large-scale graphs in various fields
–  US Road network: 58 million edges
–  Twitter follow-ship: 1.47 billion edges
–  Neuronal network: 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
•  Fast and scalable graph processing by using HPC
large
61.6 million vertices
& 1.47 billion edges
•  Transportation
•  Social network
•  Cyber-security
•  Bioinformatics
Graph analysis and important kernel BFS
•  Graph analysis to understanding for relationships on real-networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortest path
・Maximum flow ・Maximal independent set
・Centrality metrics ・Clustering ・Graph Mining
•  One of most important and fundamental algorithm to traverse graph structures
•  Many algorithms and applications based on BFS (Max. flow and Centrality)
•  Linear time algorithm, required many widely memory accesses w/o reuse
Breadth-first search (BFS)
Source
BFS Lv. 3
Source Lv. 2
Lv. 1
Outputs
•  Predecessor tree
•  Distance
Inputs
•  Graph
•  Source vertex
•  Transportation
•  Social network
•  Cyber-security
•  Bioinformatics
Graph analysis and important kernel BFS
•  Graph analysis to understanding for relationships on real-networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortest path
・Maximum flow ・Maximal independent set
・Centrality metrics ・Clustering ・Graph Mining
BFS Tree
Highway
Bridge
Betweenness centrality (BC)
CB(v) =
s v t∈V
σst(v)
σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex
CB(v) =
s v t∈V
σst(v)
σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex v
: # of (s, t)-shortest paths
: # of (s, t)-shortest paths
passing throw v
Osaka road network
13,076 vertices and 40,528 edges
•  BC requires #vertices-times BFS,
because BFS obtains one-to-all shortest paths
•  Computes an importance for each vertices
and edges utilizing all-to-all shortest-paths
(breadth-first search) w/o vertex coordinates
Importance
low high
Osaka station
Our software “NETAL” can solves BC for
Osaka road network within one second
Y. Yasui, K. Fujisawa, K. Goto, N. Kamiyama, and M. Takamatsu:
NETAL: High-performance Implementation of Network Analysis Library
Considering Computer Memory Hierarchy, JORSJ, Vol. 54-4, 2011.
Single-node NUMA system
•  Single-node or Multi-node system?
•  Uniform memory access, or not?
Single-node (Single-OS) Multi-node
NUMA (Non-uniform memory access)
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
Fast local access Slow non-local access
CPU CPU
RAM RAM
UMA (Uniform memory access)
Same cost
•  UV2000 @ ISM (256 CPUs)
•  Intel Xeon server (4 CPUs)
•  Laptop PC (1 CPU)
•  Smartphone (1 CPU)
Currently, major CPU arch.
Many configurations
Threading or Process-parallel
•  Which we choice parallel programming model?
OpenMP (Pthreads)
Explicit memory access
Using MPI_send() & MPI_recv()
MPI-OpenMP Hybrid
•  Distributed memory
•  Explicit memory access
between processes for Good
locality
•  Shared memory
•  Implicit memory access for
reduce programming cost
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
Single-process Multi-process
Implicit memory access
Implicit
NUMA system
•  NUMA = non-uniform memory access
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
Local memory access
Remote (non-local)
Memory access
NUMA node
•  4-way Intel Xeon E5-4640 (Sandybridge-EP)
–  4 (# of CPU sockets)
–  8 (# of physical cores per socket)
–  2 (# of threads per core)
4 x 8 x 2 = 64 threads
NUMA node
Max.
Memory Bandwidth between NUMA nodes
•  Local memory is 2.6x faster than remote memory
0
2
4
6
8
10
12
14
16
10 15 20 25 30
Bandwidth(GB/s)
Number of elements log2 n
NUMA 0 ! NUMA 0
NUMA 0 ! NUMA 1
NUMA 0 ! NUMA 2
NUMA 0 ! NUMA 3
Local access: 13 GB/s
Remote access: 5 GB/s
Approx. 2.6 faster
double a[N], b[N], c[N];
void STREAM_Triad(double scalar)
{
long j;
#pragma omp parallel for
for (j = 0; j < N; ++j)
a[j] = b[j]+scalar*c[j];
}
STREAM	TRIAD
threads data
vector size a[n], b[n], c[n]
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA 0
NUMA 1 NUMA 2
NUMA 3
Problem and motivation
•  To develop efficient graph algorithm on NUMA system
CPU CPU
RAM RAM
UMA (Uniform memory access)
Same cost
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
T
T T
T
T
D
D
D
D
D
Accessing remote memory
T
Moving on another core
Default
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
T
NUMA-aware
D
T
D
T
D
T
D
D
T
Accessing local memory
Pinning threads and memory
NUMA (Non-uniform memory access)
Fast local access Slow non-local access
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
processor : 23
model name : Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz
stepping : 7
cpu MHz : 1200.000
cache size : 20480 KB
physical id : 2
siblings : 16
core id : 7
cpu cores : 8
apicid : 78
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
Processor ID
NUMA node ID
Core ID in NUMA node
Programming cost for NUMA-aware
•  Not easy to apply NUMA-aware threading?
#define _GNU_SOURCE
#include <sched.h>
int bind_thread(int procid) {
  cpu_set_t set;
  CPU_ZERO(&set);
  CPU_SET(procid, &set);
  return sched_setaffinity((pid_t)0, sizeof(cpu_set_t), &set) );
}
Processor ID
–  Linux provides processor topology as machine file only.
–  The pinning function “sched_setaffinity” use processor ID
File size of /proc/cpuinfo
8.0 KB on DesktopPC
2.4 MB on UV 2000
NUMA-aware computation with ULIBC
•  ULIBC is callable library for CPU and memory affinity settings
•  Detects processor topology on a system on run time
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
T
NUMA-aware
D
T
D
T
D
T
D
D
T
Accessing local memory
Pinning threads and memory
NUMA (Non-uniform memory access)
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
Local memory
Remote memory
#include <ulibc.h>
#include <omp.h>
int main(void) {
ULIBC_init();
#pragma omp parallel
{
const struct numainfo_t ni = ULIBC_get_current_numainfo();
printf(“[%02d] Node: %d of %d, Core: %d of %dn”,
ni.id, ni.node, ULIBC_get_online_nodes(),
ni.core, ULIBC_get_online_cores(ni.node));
}
return 0;
}
struct numainfo_t {
int id; /* Thread ID */
int proc; /* Processor ID */
int node; /* NUMA node ID */
int core; /* core ID */
};
[04] Node 0 of 4, Core: 1 of 16
[55] Node 3 of 4, Core: 13 of 16
[16] Node 0 of 4, Core: 4 of 16
[37] Node 1 of 4, Core: 9 of 16
[30] Node 2 of 4, Core: 7 of 16
. . .
Core IDNUMA node IDThread ID
Init.
Thread pinning
https://bitbucket.org/yuichiro_yasui/ulibc
1.  Detects entire topology
Cores
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
2.  Detects online (available) topology
Threads
NUMA 0 0(P1), 2(P5), 4(P9), 6(P13)
NUMA 1 1(P2), 3(P6), 5(P10)
3.  Constructs ULIBC affinity
ULIBC_set_affinity_policy(7,
COMPACT_MAPPING, THREAD_TO_CORE)
CPU Affinity construction with ULIBC
Cores
CPU 0 P0, P4, P8, P12
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
CPU 3 P3, P7, P11, P15
Use Other
processes
NUMA 0
NUMA 1
core 0
core 1
core 2
core 3
RAM
RAM
Local RAM
Threads
NUMA 0 0(P1), 1(P5), 2(P9), 3(P13)
NUMA 1 4(P2), 5(P6), 6(P10)
NUMA 0
NUMA 1
core 0
core 1
core 2
core 3
RAM
RAM
Local RAM
ULIBC_set_affinity_policy(7,
SCATTER_MAPPING, THREAD_TO_CORE)
Job manager (PBS) or
numactl --cpunodebind=1,2
3.  Constructs ULIBC affinity (2 types)
Graph500 Benchmark
•  Measures a performance of irregular memory accesses
•  TEPS score (# of Traversed edges per second) in a BFS
SCALE & edgefactor (=16)
Median
TEPS
1.  Generation
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
t parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed e
- TEPS
Input parameters ResultGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3.  BFS x 642.  Construction
x 64
TEPS ratio
•  Generates synthetic scale-free network with 2SCALE vertices and
2SCALE×edgefactor edges by using SCALE-times the Rursive
Kronecker products
www.graph500.org
G1 G2 G3 G4
Kronecker graph
Input parameters for problem size
Green Graph500 Benchmark
•  Measures power-efficient using TEPS/W score
•  Our results on various systems such as SGI UV series and
Xeon servers, Android devices
http://green.graph500.org
Median
TEPS
1. Generation
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
E
factor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3.  BFS phase
2. Construction
x 64
TEPS ratio
Watt
TEPS/W
Power measurement
Green Graph500
Graph500
Level-synchronized parallel BFS (Top-down)
•  Started from source vertex and
executes following two phases
for each level
ns (timed).: This step iterates the timed
untimed verify-phase 64 times. The BFS-
BFS for each source, and the verify-phase
ut of the BFS.
k is based on the TEPS ratio, which is
ven graph and the BFS output. Submission
hmark must report five TEPS ratios: the
uartile, median, third quartile, and maxi-
ARALLEL BFS ALGORITHM
ized Parallel BFS
the input of a BFS is a graph G = (V, E)
et of vertices V and a set of edges E.
f G are contained as pairs (v, w), where
et of edges E corresponds to a set of
where an adjacency list A(v) contains
s (v, w) ∈ E for each vertex v ∈ V . A
various edges spanning all other vertices
he source vertex s ∈ V in a given graph
predecessor map π, which is a map from
Algorithm 1: Level-synchronized Parallel BFS.
Input : G = (V, A) : unweighted directed graph.
s : source vertex.
Variables: QF
: frontier queue.
QN
: neighbor queue.
visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.
1 π(v) ← −1, ∀v ∈ V
2 π(s) ← s
3 visited ← {s}
4 QF
← {s}
5 QN
← ∅
6 while QF
̸= ∅ do
7 for v ∈ QF
in parallel do
8 for w ∈ A(v) do
9 if w ̸∈ visited atomic then
10 π(w) ← v
11 visited ← visited ∪ {w}
12 QN
← QN
∪ {w}
13 QF
← QN
14 QN
← ∅
Traversal
Swap
Frontier
Neighbor
Level k
Level k+1QF
QN
Swap exchanges the frontier QF
and the neighbors QN for next level
Traversal finds neighbors QN
from current frontier QF
visited
unvisited
Frontier
Level k
Level k+1
NeighborsFrontier
NeighborsLevel k
Level k+1
Candidates of
neighbors
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V  visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Direction-optimizing BFS
Top-down direction
•  Efficient for small-frontier
•  Uses out-going edges
Bottom-up direction
•  Efficient for large-frontier
•  Uses in-coming edges
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V  visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Current frontier
Unvisited
neighbors
Current frontier
Candidates of
neighbors
Skips unnecessary edge traversal
Outgoing
edges Incoming
edges
Chooses direction from Top-down or Bottom-up Beamer @ SC12
# of traversal edges of Kronecker graph with SCALE 26
Hybrid-BFS reduces
unnecessary edge traversals
Direction-optimizing BFS
Top-down探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Bottom-up
Top-down
Distance from source
|V| = 226, |E| = 230
= |E|
Chooses direction from Top-down or Bottom-up Beamer @ SC12
for small frontier for large frontier
Large
frontier
0
10
20
30
40
50
2011 SC10
SC12
BigData13
ISC14
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
NUMA-Opt.+Deg.aware
NUMA-Opt.+Deg.aware+Vtx.Sort
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
NUMA-Opt. + Dir. Opt. BFS [BD13]
•  Manages memory accesses on a NUMA system carefully.
Top-down Top-down
Bottom-up	
Top-down
CPU
RAM
NUMA-aware Bottom-up	
Top-down
CPU
RAM
NUMA-aware
Our previous result (2013)
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM
RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM0th 3th
1st 2nd
0 1 2 3
Adjacency matrix
binds a partial adjacency matrix
into each NUMA nodes.
Binding using our library
NUMA nodes
0
10
20
30
40
50
2011 SC10
SC12
BigData13
ISC14
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
NUMA-Opt.+Deg.aware
NUMA-Opt.+Deg.aware+Vtx.Sort
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
Degree-aware + NUMA-opt. + Dir. Opt. BFS
•  Manages memory accesses on NUMA system
–  Each NUMA node contains CPU socket and local memory
Top-down Top-down
Bottom-up	
Top-down
CPU
RAM
NUMA-aware Bottom-up	
Top-down
CPU
RAM
NUMA-aware
2. Sorting adjacency lists by
degree to reduce bottom-up loop
1. Deleting isolated vertices
Our previous result (2014)
Isolated
A(va)
……
A(va)
……
[ISC14]
0
10
20
30
40
50
2011 SC10
SC12
BigData13
ISC14
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
NUMA-Opt.+Deg.aware
NUMA-Opt.+Deg.aware+Vtx.Sort
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
Improved locality
Top-down Top-down
Bottom-up	
Top-down
CPU
RAM
NUMA-aware Bottom-up	
Top-down
CPU
RAM
NUMA-aware
Our latest ver.
Is 489X faster than
Reference code
[HPCS15]
Degree-aware + NUMA-opt. + Dir. Opt. BFS
•  Forward graph GF for Top-down
•  Backward graph GB for Bottom-up
Top-down
Bottom-up
A0
A1
A2
A3
Input:
Frontier
V
Data: visited Vk
Output:
neighbors
Vk
Local RAM
A0
A1
A2
A3
Data:
Frontier
V
Data: visited Vk
Output:
neighbors
Vk
Local RAM
NQ
Bottom-up(G, CQ, VS, π)
∅
∈ V  VS in parallel do
v ∈ AB
(w) do
if v ∈ CQ then
π(w) ← v
VS ← VS ∪ {w}
NQ ← NQ ∪ {w}
break
NQ
mory, and these connect to one another via an
such as the Intel QPI, AMD HyperTransport,
MAlink 6. On such systems, processor cores can
ocal memory faster than they can access remote
memory (i.e., memory local to another processor
hared between processors). To some degree, the
of BFS depends on the speed of memory access,
omplexity of memory accesses is greater than that
on. Therefore, in this paper, we propose a general
approach for processor and memory affinities on
tem. However, we cannot find a library for obtain-
sentation and working variables to allow a BFS t
over the local memory before the traversal. In ou
all accesses to remote memory are avoided in
phase using the following column-wise partitioni
V = V0 | V1 | · · · | Vℓ−1 ,
A = A0 | A1 | · · · | Aℓ−1 ,
and each set of partial vertices Vk on the k-th N
is defined by
Vk = vj ∈ V | j ∈
k
ℓ
· n,
(k + 1)
ℓ
· n
where n is the number of vertices and the diviso
the number of NUMA nodes (CPU sockets). In
avoid accessing remote memory, we define parti
lists AF
k and AB
k for the Top-down and Bottom-u
follows:
AF
k (v) = {w | w ∈ {Vk ∩ A(v)}} , v ∈ V,
AB
k (w) = {v | v ∈ A(w)} , w ∈ Vk
Furthermore, the working spaces NQk, VSk,
partial vertices Vk are allocated to the local memo
th NUMA node with the memory pinned. Note th
of each current queue CQk is all vertices V in a
and these are allocated to the local memory on the
or w ∈ V  VS in parallel do
for v ∈ AB
(w) do
if v ∈ CQ then
π(w) ← v
VS ← VS ∪ {w}
NQ ← NQ ∪ {w}
break
eturn NQ
al memory, and these connect to one another via an
nnect such as the Intel QPI, AMD HyperTransport,
NUMAlink 6. On such systems, processor cores can
heir local memory faster than they can access remote
cal) memory (i.e., memory local to another processor
mory shared between processors). To some degree, the
mance of BFS depends on the speed of memory access,
the complexity of memory accesses is greater than that
putation. Therefore, in this paper, we propose a general
ment approach for processor and memory affinities on
A system. However, we cannot find a library for obtain-
V = V0 | V1 |
A = A0 | A1 |
and each set of partial vertice
is defined by
Vk = vj ∈ V | j ∈
k
ℓ
where n is the number of ver
the number of NUMA nodes
avoid accessing remote memo
lists AF
k and AB
k for the Top-d
follows:
AF
k (v) = {w | w ∈ {Vk
AB
k (w) = {v | v
Furthermore, the working s
partial vertices Vk are allocated
th NUMA node with the mem
of each current queue CQk is
and these are allocated to the lo
o
nect to one another via an
QPI, AMD HyperTransport,
ystems, processor cores can
han they can access remote
y local to another processor
ssors). To some degree, the
he speed of memory access,
accesses is greater than that
paper, we propose a general
or and memory affinities on
nnot find a library for obtain-
V = V0 | V1 | · · · | Vℓ−1 ,
A = A0 | A1 | · · · | Aℓ−1 ,
and each set of partial vertices Vk on the k-th NUMA
is defined by
Vk = vj ∈ V | j ∈
k
ℓ
· n,
(k + 1)
ℓ
· n ,
where n is the number of vertices and the divisor ℓ is s
the number of NUMA nodes (CPU sockets). In additio
avoid accessing remote memory, we define partial adjac
lists AF
k and AB
k for the Top-down and Bottom-up polici
follows:
AF
k (v) = {w | w ∈ {Vk ∩ A(v)}} , v ∈ V,
AB
k (w) = {v | v ∈ A(w)} , w ∈ Vk.
Furthermore, the working spaces NQk, VSk, and πk
partial vertices Vk are allocated to the local memory on th
th NUMA node with the memory pinned. Note that the r
of each current queue CQk is all vertices V in a given g
and these are allocated to the local memory on the k-th NU
NUMA-based 1-D partitioned graph representation
These sub-graphs
represent the same
area, but not same
data structures.
Partial vertex set Vk
Vertex sorting for BFS
Degree distribution
Access freq. w/ vertex sorting
•  # traversals is equal to out-degree for each vertex
•  Locality of accessing vertices depends on vertex index
Improving cache hit ratios !!
Applying the vertex sorting,
an access frequency is similar
to the degree distribution.
Access frequency V.S. Degree distribution
0
2
1
3
4
4
0
2
3
1
Original
indices
Sorted
indices
High degree
Many accesses
for small-index vertex
Two strategies for implementation
1.  Highest TEPS for Graph500
–  Graph500 list uses TEPS scores only
2.  Largest SCALE (problem size) for Green Graph500
–  Green Graph500 list is separated two categories by problem size
–  The big data category collects over SCALE 30 entries
Median size of all entries
0
10
20
30
40
50
20 21 22 23 24 25 26 27 28 29 30
GTEPS
SCALE
DG-V
DG-S
SG
Highest TEPS
Largest SCALE
#1 on 5th Green Graph500
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM
RAM
On the 4-way NUMA system,
•  Highest TEPS model obtains 42
GTEPS for SCALE 27
•  Largest SCALE model can solve
up to SCALE 30
A0
A1
A2
A3
Comparison of two implementations
•  Dual-dir. graphs or Single-graph
Highest-TEPS mode Largest-SCALE mode
•  Forward graph GF for Top-down •  Transposed GB for Top-down
A0
A1
A2
A3
Input:
Frontier
V
Data: visited Vk
Output:
neighbors Vk
Local RAM
•  Backward graph GB for Bottom-up
A0
A1
A2
A3
Data:
Frontier
V
Data: visited Vk
Output:
neighbors
Vk
Local RAM
•  Backward graph GB for Bottom-up
A0
A1
A2
A3
Data:
Frontier
V
Data: visited Vk
Output:
neighbors
Vk
Local RAM
Output:
Neighbors V
Data:
Frontier
V
Input: visited V
Local and Remote RAM
Same
Diff.
Results on 4-way NUMA system (Xeon)
0
10
20
30
40
50
20 21 22 23 24 25 26 27 28 29 30
GTEPS
SCALE
DG-V
DG-S
SG
1
2
4
8
16
32
64
1
(1×1×1)
2
(1×2×1)
4
(1×4×1)
8
(1×8×1)
16
(2×8×1)
32
(4×8×1)
64
(4×8×2)
Speedupratio
Number of threads (#NUMA Nodes × #cores × #threads)
DG-V
DG-S
SG
•  TEPS-model handle up to SCALE29
•  Largest-SCALE model can solve up to SCALE30
Strong scaling for SCALE 27
•  With 64 threads, these models achieve over
20X faster than sequential.
•  Comparing 32 and 64 threads, HT was to
produce a speedup of over 20%.
HT speedups
are more
than 20%.
Highest TEPS score
Largest SCALE
20x faster than seq.
#1 entry on 5th Green Graph500
HT
CPU: 4-way Intel Xeon E5-4640 (64 threads)
Base-architecture: SandyBridge-EP
RAM: 512 GB
CC: GCC-4.4.7
SGI UV 2000 system
•  SGI UV 2000
–  Shared-memory supercomputer based on cc-NUMA arch.
–  Running with a single Linux OS
–  User handles a large memory space by the thread parallelization
e.g. OpenMP or Pthreads (also can use MPI)
–  The full-spec. UV 2000 (4 racks) has 2,560 cores and 64 TB memory
•  ISM, SGI, and us collaborate for the Graph500 benchmarks
The Institute of Statistical Mathematics
•  Japan's national research institute for statistical science.
#2 system
ISM has two Full-spec. UV 2000 (totality 8 racks)
#1 full spec. UV 2K
System configuration of UV 2000
•  UV2000 has hierarchical hardware topologies
–  Sockets, Nodes, Cubes, Inner-racks, and Inter-racks
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes
•  We used NUMA-based flat parallelization
–  Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM”
CPU
RAM
CPU
RAM
× 4 =
CPU
RAM
Node = 2 NUMA nodes Rack = 64 NUMA nodes
× 64 =
CPU
RAM
Cube = 16 NUMA nodes
× 2
CPU
RAM
× 16
NUMAlink
6.7GB/s
(20 cores, 512GB) (160 cores, 4TB) (640 cores, 16TB)
Cannot detect
0
50
100
150
200
26
(1)
27
(2)
28
(4)
29
(8)
30
(16)
31
(32)
32
(64)
33
(128)
SCALE 31
(64)
GTEPS
SCALE (#sockets)
DG-V (SCALE 26 per NUMA node)
DG-V (SCALE 25 per NUMA node)
DG-S (SCALE 26 per NUMA node)
SG (SCALE 26 per NUMA node)
Results on UV2000
DG-V is fastest
1 to 32 socket(s)
Highest TEPS
model
0
50
100
150
200
26
(1)
27
(2)
28
(4)
29
(8)
30
(16)
31
(32)
32
(64)
33
(128)
SCALE 31
(64)
GTEPS
SCALE (#sockets)
DG-V (SCALE 26 per NUMA node)
DG-V (SCALE 25 per NUMA node)
DG-S (SCALE 26 per NUMA node)
SG (SCALE 26 per NUMA node)
Results on UV2000
64 sockets
DG-S is faster than DG-V
DG-V is fastest
1 to 32 socket(s)
Both implementations
use all-to-all communication
Performance
degradation
Highest TEPS
model
0
50
100
150
200
26
(1)
27
(2)
28
(4)
29
(8)
30
(16)
31
(32)
32
(64)
33
(128)
SCALE 31
(64)
GTEPS
SCALE (#sockets)
DG-V (SCALE 26 per NUMA node)
DG-V (SCALE 25 per NUMA node)
DG-S (SCALE 26 per NUMA node)
SG (SCALE 26 per NUMA node)
Results on UV2000
DG-V is fastest
SG is fastest and scalable
1 to 32 socket(s)
128 CPU sockets (1280 threads)
64 sockets Fastest of
single node
174 GTEPS
9th on SC14,
10th on ISC15
Graph500
SCALE 33
•  8.59 B vertices
• 137.44 B edges
Two UV2000 racks
DG-S is faster than DG-V
Highest TEPS
model
Large-problem
model
Breakdown with 2-racks UV2000
•  Comp. (57 %) > Comm. (43 %) scalable
0
50
100
150
200
250
300
350
Init. 0 1 2 3 4 5 6 7
CPUtime(ms)
Level
Breakdown of SCALE 33 on UV 2000 with 128 CPUs
Traversal (ms)
Comm. (ms)
Top-down
Top-down
Bottom-up
Bottom-up
Bottom-up
Bottom-up
Bottom-up
Bottom-up
Remote	memory	
communications
Computation
Most of CPU time
at middle-level
•  4-way Intel Xeon server
–  DG-V (High-TEPS) model achieves
the fastest single-server entries
–  SG (Large-SCALE) model won 3rd,
4th, 5th Green Graph500 lists
5th Green Graph500 list
62.93 MTEPS/W
31.33 GTEPS
Our achievements of Graph500 benchmarks
•  UV 2000
–  DG-S (Middle) model achieves
131 GTEPS with 640 threads and
the most power-efficient of
commercial supercomputers
4th Green Graph500 list
61.48 MTEPS/W
28.61 GTEPS
3rd Green Graph500 list
59.12 MTEPS/W
28.48 GTEPS
#1
#1
#1
ISC15 in Last week
#7 on 3rd list #9 on 4rd list
–  SG (Largest-SCALE) model
achieves 174 GTEPS for SCALE
33 with 1,280 threads and
the fastest single-node entry
1
4
16
64
256
1024
4096
16384
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
GTEPS
SCALE
UV 2000 (CPU per SCLAE 26)
BQ/Q (CPU per SCALE 24)
DragonHawk (480 threads)
BFS performances
IBM BG/Q (8,192 cores)
⇒  SCALE 33: 172 GTEPS
⇒  21.0 MTEPS/core
SGI UV2000 (1,280 cores)
⇒ SCALE 33: 175 GTEPS
⇒ 136.5 MTEPS/core
HP Superdome X (240 cores)
⇒  SCALE 33: 127 GTEPS
⇒ 530.3 MTEPS/core
Our result (NUMA-aware)
Our result (NUMA-aware)
Conclusion
1.  Efficient graph algorithm
considering the processor
topology on a single-node
NUMA system
2.  NUMA-aware programming
utilizing by our ULIBC
3.  Our implementation works well on many computers;
–  Scales up to 1,280 threads on UV2000 at ISM
–  UV2000 achieves fast single node entry of 9th and 10th Graph500
–  Xeon server won most energy-efficient 3rd, 4th, and 5th Green
Graph500
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
T
NUMA-aware threading with ULBIC
D
T
D
T
D
T
D
D
T
Accessing local memory
Pinning threads and memory
Our library “ULIBC“ is available at Bitbucket�
https://bitbucket.org/yuichiro_yasui/ulibc�
Reference
•  [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel
Breadth-first Search on Multicore Single-node System, IEEE
BigData 2013
•  [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient
Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014
•  [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based
thread parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP,
2015.
•  [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y.
Yasui, K. Iwabuchi, and T. Endo: Advanced Computing &
Optimization Infrastructure for Extremely Large-Scale Graphs on
Post Peta-Scale upercomputers, Proceedings of the Optimization in
the Real World -- Toward Solving Real-World Optimization Problems
--, Springer, 2015.
This talk
Other results of Our Graph500 team

Mais conteúdo relacionado

Mais procurados

Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
 
Area-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAArea-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAIJERA Editor
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Sean Moran
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...BigDataEverywhere
 
ktruss-short
ktruss-shortktruss-short
ktruss-shortJia Wang
 
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderAn Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderIJERA Editor
 
Nmea Introduction
Nmea IntroductionNmea Introduction
Nmea IntroductionTom Chen
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)IISRT
 
Development of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsDevelopment of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsAtsushi Koike
 
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsOrthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsT. E. BOGALE
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsYu Liu
 
Axes Tech
Axes TechAxes Tech
Axes Techncct
 

Mais procurados (20)

Cnq1
Cnq1Cnq1
Cnq1
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
 
Area-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAArea-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCA
 
GoogLeNet Insights
GoogLeNet InsightsGoogLeNet Insights
GoogLeNet Insights
 
An Overview of HDF-EOS (Part 1)
An Overview of HDF-EOS (Part 1)An Overview of HDF-EOS (Part 1)
An Overview of HDF-EOS (Part 1)
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
ktruss-short
ktruss-shortktruss-short
ktruss-short
 
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderAn Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ Adder
 
Nmea Introduction
Nmea IntroductionNmea Introduction
Nmea Introduction
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
 
Development of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsDevelopment of Routing for Car Navigation Systems
Development of Routing for Car Navigation Systems
 
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsOrthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and Experiments
 
Chenchu
ChenchuChenchu
Chenchu
 
Axes Tech
Axes TechAxes Tech
Axes Tech
 

Semelhante a NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph500 Benchmarks on SGI UV 2000

Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 AprFNian
 
Netmap presentation
Netmap presentationNetmap presentation
Netmap presentationAmir Razmjou
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceIntel Nervana
 
BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!Linaro
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)micchie
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 

Semelhante a NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph500 Benchmarks on SGI UV 2000 (20)

Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 Apr
 
Netmap presentation
Netmap presentationNetmap presentation
Netmap presentation
 
mTCP使ってみた
mTCP使ってみたmTCP使ってみた
mTCP使ってみた
 
Lec05
Lec05Lec05
Lec05
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligence
 
uCluster
uClusteruCluster
uCluster
 
BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
 
PF_DIRECT@TMA12
PF_DIRECT@TMA12PF_DIRECT@TMA12
PF_DIRECT@TMA12
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
NSCC Training - Introductory Class
NSCC Training - Introductory ClassNSCC Training - Introductory Class
NSCC Training - Introductory Class
 

Último

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 

Último (20)

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 

NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph500 Benchmarks on SGI UV 2000

  • 1. NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph500 Benchmarks on SGI UV 2000 Yuichiro Yasui & Katsuki Fujisawa Kyushu University ISM High Performance Computing Conference 11:00 − 11:50, Oct 9-10, 2015
  • 2. Outline •  Introduction •  NUMA-aware threading –  NUMA architecture and NUMA based system –  Our library “ULIBC” for NUMA aware threading •  Efficient BFS algorithm for Graph500 benchmark –  NUMA-based Distributed Graph Representation … [BD13] –  Efficient algorithm considering the vertex degree [ISC14, HPCS15] •  Conclusion
  • 3. Graph processing for Large scale networks •  Large-scale graphs in various fields –  US Road network: 58 million edges –  Twitter follow-ship: 1.47 billion edges –  Neuronal network: 100 trillion edges 89 billion vertices & 100 trillion edges Neuronal network @ Human Brain Project Cyber-security Twitter US road network 24 million vertices & 58 million edges 15 billion log entries / day Social network •  Fast and scalable graph processing by using HPC large 61.6 million vertices & 1.47 billion edges
  • 4. •  Transportation •  Social network •  Cyber-security •  Bioinformatics Graph analysis and important kernel BFS •  Graph analysis to understanding for relationships on real-networks graph processing Understanding Application fields - SCALE - edgefactor - SCALE - edgefac - BFS Tim - Traverse - TEPS Input parameters ResuGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations Relationships - SCALE - edgefactor Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations graph - SCALE - edgefactor Input parameters Graph generation Graph construction VBFS 6 results Step1 Step2 Step3 constructing ・Breadth-first search ・Single-source shortest path ・Maximum flow ・Maximal independent set ・Centrality metrics ・Clustering ・Graph Mining
  • 5. •  One of most important and fundamental algorithm to traverse graph structures •  Many algorithms and applications based on BFS (Max. flow and Centrality) •  Linear time algorithm, required many widely memory accesses w/o reuse Breadth-first search (BFS) Source BFS Lv. 3 Source Lv. 2 Lv. 1 Outputs •  Predecessor tree •  Distance Inputs •  Graph •  Source vertex •  Transportation •  Social network •  Cyber-security •  Bioinformatics Graph analysis and important kernel BFS •  Graph analysis to understanding for relationships on real-networks graph processing Understanding Application fields - SCALE - edgefactor - SCALE - edgefac - BFS Tim - Traverse - TEPS Input parameters ResuGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations Relationships - SCALE - edgefactor Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations graph - SCALE - edgefactor Input parameters Graph generation Graph construction VBFS 6 results Step1 Step2 Step3 constructing ・Breadth-first search ・Single-source shortest path ・Maximum flow ・Maximal independent set ・Centrality metrics ・Clustering ・Graph Mining BFS Tree
  • 6. Highway Bridge Betweenness centrality (BC) CB(v) = s v t∈V σst(v) σst σst : number of shortest (s, t)-paths σst(v) : number of shortest (s, t)-paths passing through vertex CB(v) = s v t∈V σst(v) σst σst : number of shortest (s, t)-paths σst(v) : number of shortest (s, t)-paths passing through vertex v : # of (s, t)-shortest paths : # of (s, t)-shortest paths passing throw v Osaka road network 13,076 vertices and 40,528 edges •  BC requires #vertices-times BFS, because BFS obtains one-to-all shortest paths •  Computes an importance for each vertices and edges utilizing all-to-all shortest-paths (breadth-first search) w/o vertex coordinates Importance low high Osaka station Our software “NETAL” can solves BC for Osaka road network within one second Y. Yasui, K. Fujisawa, K. Goto, N. Kamiyama, and M. Takamatsu: NETAL: High-performance Implementation of Network Analysis Library Considering Computer Memory Hierarchy, JORSJ, Vol. 54-4, 2011.
  • 7. Single-node NUMA system •  Single-node or Multi-node system? •  Uniform memory access, or not? Single-node (Single-OS) Multi-node NUMA (Non-uniform memory access) CPU RAM CPU RAM CPU RAM CPU RAM NUMA node Fast local access Slow non-local access CPU CPU RAM RAM UMA (Uniform memory access) Same cost •  UV2000 @ ISM (256 CPUs) •  Intel Xeon server (4 CPUs) •  Laptop PC (1 CPU) •  Smartphone (1 CPU) Currently, major CPU arch. Many configurations
  • 8. Threading or Process-parallel •  Which we choice parallel programming model? OpenMP (Pthreads) Explicit memory access Using MPI_send() & MPI_recv() MPI-OpenMP Hybrid •  Distributed memory •  Explicit memory access between processes for Good locality •  Shared memory •  Implicit memory access for reduce programming cost CPU RAM CPU RAM CPU RAM CPU RAM CPU RAM CPU RAM CPU RAM CPU RAM Single-process Multi-process Implicit memory access Implicit
  • 9. NUMA system •  NUMA = non-uniform memory access RAM RAM processor core & L2 cache 8-core Xeon E5 4640 shared L3 cache RAM RAM Local memory access Remote (non-local) Memory access NUMA node •  4-way Intel Xeon E5-4640 (Sandybridge-EP) –  4 (# of CPU sockets) –  8 (# of physical cores per socket) –  2 (# of threads per core) 4 x 8 x 2 = 64 threads NUMA node Max.
  • 10. Memory Bandwidth between NUMA nodes •  Local memory is 2.6x faster than remote memory 0 2 4 6 8 10 12 14 16 10 15 20 25 30 Bandwidth(GB/s) Number of elements log2 n NUMA 0 ! NUMA 0 NUMA 0 ! NUMA 1 NUMA 0 ! NUMA 2 NUMA 0 ! NUMA 3 Local access: 13 GB/s Remote access: 5 GB/s Approx. 2.6 faster double a[N], b[N], c[N]; void STREAM_Triad(double scalar) { long j; #pragma omp parallel for for (j = 0; j < N; ++j) a[j] = b[j]+scalar*c[j]; } STREAM TRIAD threads data vector size a[n], b[n], c[n] CPU RAM CPU RAM CPU RAM CPU RAM NUMA 0 NUMA 1 NUMA 2 NUMA 3
  • 11. Problem and motivation •  To develop efficient graph algorithm on NUMA system CPU CPU RAM RAM UMA (Uniform memory access) Same cost RAM RAM processor core & L2 cache 8-core Xeon E5 4640 shared L3 cache RAM RAM T T T T T D D D D D Accessing remote memory T Moving on another core Default RAM RAM processor core & L2 cache 8-core Xeon E5 4640 shared L3 cache RAM RAM T NUMA-aware D T D T D T D D T Accessing local memory Pinning threads and memory NUMA (Non-uniform memory access) Fast local access Slow non-local access CPU RAM CPU RAM CPU RAM CPU RAM NUMA node
  • 12. processor : 23 model name : Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz stepping : 7 cpu MHz : 1200.000 cache size : 20480 KB physical id : 2 siblings : 16 core id : 7 cpu cores : 8 apicid : 78 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual Processor ID NUMA node ID Core ID in NUMA node Programming cost for NUMA-aware •  Not easy to apply NUMA-aware threading? #define _GNU_SOURCE #include <sched.h> int bind_thread(int procid) {   cpu_set_t set;   CPU_ZERO(&set);   CPU_SET(procid, &set);   return sched_setaffinity((pid_t)0, sizeof(cpu_set_t), &set) ); } Processor ID –  Linux provides processor topology as machine file only. –  The pinning function “sched_setaffinity” use processor ID File size of /proc/cpuinfo 8.0 KB on DesktopPC 2.4 MB on UV 2000
  • 13. NUMA-aware computation with ULIBC •  ULIBC is callable library for CPU and memory affinity settings •  Detects processor topology on a system on run time RAM RAM processor core & L2 cache 8-core Xeon E5 4640 shared L3 cache RAM RAM T NUMA-aware D T D T D T D D T Accessing local memory Pinning threads and memory NUMA (Non-uniform memory access) CPU RAM CPU RAM CPU RAM CPU RAM NUMA node Local memory Remote memory #include <ulibc.h> #include <omp.h> int main(void) { ULIBC_init(); #pragma omp parallel { const struct numainfo_t ni = ULIBC_get_current_numainfo(); printf(“[%02d] Node: %d of %d, Core: %d of %dn”, ni.id, ni.node, ULIBC_get_online_nodes(), ni.core, ULIBC_get_online_cores(ni.node)); } return 0; } struct numainfo_t { int id; /* Thread ID */ int proc; /* Processor ID */ int node; /* NUMA node ID */ int core; /* core ID */ }; [04] Node 0 of 4, Core: 1 of 16 [55] Node 3 of 4, Core: 13 of 16 [16] Node 0 of 4, Core: 4 of 16 [37] Node 1 of 4, Core: 9 of 16 [30] Node 2 of 4, Core: 7 of 16 . . . Core IDNUMA node IDThread ID Init. Thread pinning https://bitbucket.org/yuichiro_yasui/ulibc
  • 14. 1.  Detects entire topology Cores CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 2.  Detects online (available) topology Threads NUMA 0 0(P1), 2(P5), 4(P9), 6(P13) NUMA 1 1(P2), 3(P6), 5(P10) 3.  Constructs ULIBC affinity ULIBC_set_affinity_policy(7, COMPACT_MAPPING, THREAD_TO_CORE) CPU Affinity construction with ULIBC Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15 Use Other processes NUMA 0 NUMA 1 core 0 core 1 core 2 core 3 RAM RAM Local RAM Threads NUMA 0 0(P1), 1(P5), 2(P9), 3(P13) NUMA 1 4(P2), 5(P6), 6(P10) NUMA 0 NUMA 1 core 0 core 1 core 2 core 3 RAM RAM Local RAM ULIBC_set_affinity_policy(7, SCATTER_MAPPING, THREAD_TO_CORE) Job manager (PBS) or numactl --cpunodebind=1,2 3.  Constructs ULIBC affinity (2 types)
  • 15. Graph500 Benchmark •  Measures a performance of irregular memory accesses •  TEPS score (# of Traversed edges per second) in a BFS SCALE & edgefactor (=16) Median TEPS 1.  Generation SCALE edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS t parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed e - TEPS Input parameters ResultGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor - S - e - B - T - T Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations 3.  BFS x 642.  Construction x 64 TEPS ratio •  Generates synthetic scale-free network with 2SCALE vertices and 2SCALE×edgefactor edges by using SCALE-times the Rursive Kronecker products www.graph500.org G1 G2 G3 G4 Kronecker graph Input parameters for problem size
  • 16. Green Graph500 Benchmark •  Measures power-efficient using TEPS/W score •  Our results on various systems such as SGI UV series and Xeon servers, Android devices http://green.graph500.org Median TEPS 1. Generation - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS Input parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS Input parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations E factor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS rameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations 3.  BFS phase 2. Construction x 64 TEPS ratio Watt TEPS/W Power measurement Green Graph500 Graph500
  • 17. Level-synchronized parallel BFS (Top-down) •  Started from source vertex and executes following two phases for each level ns (timed).: This step iterates the timed untimed verify-phase 64 times. The BFS- BFS for each source, and the verify-phase ut of the BFS. k is based on the TEPS ratio, which is ven graph and the BFS output. Submission hmark must report five TEPS ratios: the uartile, median, third quartile, and maxi- ARALLEL BFS ALGORITHM ized Parallel BFS the input of a BFS is a graph G = (V, E) et of vertices V and a set of edges E. f G are contained as pairs (v, w), where et of edges E corresponds to a set of where an adjacency list A(v) contains s (v, w) ∈ E for each vertex v ∈ V . A various edges spanning all other vertices he source vertex s ∈ V in a given graph predecessor map π, which is a map from Algorithm 1: Level-synchronized Parallel BFS. Input : G = (V, A) : unweighted directed graph. s : source vertex. Variables: QF : frontier queue. QN : neighbor queue. visited : vertices already visited. Output : π(v) : predecessor map of BFS tree. 1 π(v) ← −1, ∀v ∈ V 2 π(s) ← s 3 visited ← {s} 4 QF ← {s} 5 QN ← ∅ 6 while QF ̸= ∅ do 7 for v ∈ QF in parallel do 8 for w ∈ A(v) do 9 if w ̸∈ visited atomic then 10 π(w) ← v 11 visited ← visited ∪ {w} 12 QN ← QN ∪ {w} 13 QF ← QN 14 QN ← ∅ Traversal Swap Frontier Neighbor Level k Level k+1QF QN Swap exchanges the frontier QF and the neighbors QN for next level Traversal finds neighbors QN from current frontier QF visited unvisited
  • 18. Frontier Level k Level k+1 NeighborsFrontier NeighborsLevel k Level k+1 Candidates of neighbors 前方探索と後方探索でのデータアクセスの観察 • 前方探索でのデータの書込み v → w v w Input : Directed graph G = (V, AF ), Queue QF Data : Queue QN , visited, Tree π(v) QN ← ∅ for v ∈ QF in parallel do for w ∈ AF (v) do if w visited atomic then π(w) ← v visited ← visited ∪ {w} QN ← QN ∪ {w} QF ← QN • 後方探索でのデータの書込み w → v v w Input : Directed graph G = (V, AB ), Queue QF Data : Queue QN , visited, Tree π(v) QN ← ∅ for w ∈ V visited in parallel do for v ∈ AB (w) do if v ∈ QF then π(w) ← v visited ← visited ∪ {w} QN ← QN ∪ {w} break QF ← QN Direction-optimizing BFS Top-down direction •  Efficient for small-frontier •  Uses out-going edges Bottom-up direction •  Efficient for large-frontier •  Uses in-coming edges 前方探索と後方探索でのデータアクセスの観察 • 前方探索でのデータの書込み v → w v w Input : Directed graph G = (V, AF ), Queue QF Data : Queue QN , visited, Tree π(v) QN ← ∅ for v ∈ QF in parallel do for w ∈ AF (v) do if w visited atomic then π(w) ← v visited ← visited ∪ {w} QN ← QN ∪ {w} QF ← QN • 後方探索でのデータの書込み w → v v w Input : Directed graph G = (V, AB ), Queue QF Data : Queue QN , visited, Tree π(v) QN ← ∅ for w ∈ V visited in parallel do for v ∈ AB (w) do if v ∈ QF then π(w) ← v visited ← visited ∪ {w} QN ← QN ∪ {w} break QF ← QN Current frontier Unvisited neighbors Current frontier Candidates of neighbors Skips unnecessary edge traversal Outgoing edges Incoming edges Chooses direction from Top-down or Bottom-up Beamer @ SC12
  • 19. # of traversal edges of Kronecker graph with SCALE 26 Hybrid-BFS reduces unnecessary edge traversals Direction-optimizing BFS Top-down探索に対する前方探索 (Top-down) と後方探索 (Bottom-up) Level Top-down Bottom-up Hybrid 0 2 2,103,840,895 2 1 66,206 1,766,587,029 66,206 2 346,918,235 52,677,691 52,677,691 3 1,727,195,615 12,820,854 12,820,854 4 29,557,400 103,184 103,184 5 82,357 21,467 21,467 6 221 21,240 227 Total 2,103,820,036 3,936,072,360 65,689,631 Ratio 100.00% 187.09% 3.12% Bottom-up Top-down Distance from source |V| = 226, |E| = 230 = |E| Chooses direction from Top-down or Bottom-up Beamer @ SC12 for small frontier for large frontier Large frontier
  • 20. 0 10 20 30 40 50 2011 SC10 SC12 BigData13 ISC14 G500,ISC14 GTEPS Reference NUMA-aware Dir.Opt. NUMA-Opt. NUMA-Opt.+Deg.aware NUMA-Opt.+Deg.aware+Vtx.Sort 87M 800M 5G 11G 29G 42G ⇥1 ⇥9 ⇥58 ⇥125 ⇥334 ⇥489 NUMA-Opt. + Dir. Opt. BFS [BD13] •  Manages memory accesses on a NUMA system carefully. Top-down Top-down Bottom-up Top-down CPU RAM NUMA-aware Bottom-up Top-down CPU RAM NUMA-aware Our previous result (2013) RAM RAM processor core & L2 cache 8-core Xeon E5 4640shared L3 cache RAM RAM RAM RAM processor core & L2 cache 8-core Xeon E5 4640 shared L3 cache RAM RAM RAM RAM processor core & L2 cache 8-core Xeon E5 4640 shared L3 cache RAM RAM0th 3th 1st 2nd 0 1 2 3 Adjacency matrix binds a partial adjacency matrix into each NUMA nodes. Binding using our library NUMA nodes
  • 21. 0 10 20 30 40 50 2011 SC10 SC12 BigData13 ISC14 G500,ISC14 GTEPS Reference NUMA-aware Dir.Opt. NUMA-Opt. NUMA-Opt.+Deg.aware NUMA-Opt.+Deg.aware+Vtx.Sort 87M 800M 5G 11G 29G 42G ⇥1 ⇥9 ⇥58 ⇥125 ⇥334 ⇥489 Degree-aware + NUMA-opt. + Dir. Opt. BFS •  Manages memory accesses on NUMA system –  Each NUMA node contains CPU socket and local memory Top-down Top-down Bottom-up Top-down CPU RAM NUMA-aware Bottom-up Top-down CPU RAM NUMA-aware 2. Sorting adjacency lists by degree to reduce bottom-up loop 1. Deleting isolated vertices Our previous result (2014) Isolated A(va) …… A(va) …… [ISC14]
  • 22. 0 10 20 30 40 50 2011 SC10 SC12 BigData13 ISC14 G500,ISC14 GTEPS Reference NUMA-aware Dir.Opt. NUMA-Opt. NUMA-Opt.+Deg.aware NUMA-Opt.+Deg.aware+Vtx.Sort 87M 800M 5G 11G 29G 42G ⇥1 ⇥9 ⇥58 ⇥125 ⇥334 ⇥489 Improved locality Top-down Top-down Bottom-up Top-down CPU RAM NUMA-aware Bottom-up Top-down CPU RAM NUMA-aware Our latest ver. Is 489X faster than Reference code [HPCS15] Degree-aware + NUMA-opt. + Dir. Opt. BFS
  • 23. •  Forward graph GF for Top-down •  Backward graph GB for Bottom-up Top-down Bottom-up A0 A1 A2 A3 Input: Frontier V Data: visited Vk Output: neighbors Vk Local RAM A0 A1 A2 A3 Data: Frontier V Data: visited Vk Output: neighbors Vk Local RAM NQ Bottom-up(G, CQ, VS, π) ∅ ∈ V VS in parallel do v ∈ AB (w) do if v ∈ CQ then π(w) ← v VS ← VS ∪ {w} NQ ← NQ ∪ {w} break NQ mory, and these connect to one another via an such as the Intel QPI, AMD HyperTransport, MAlink 6. On such systems, processor cores can ocal memory faster than they can access remote memory (i.e., memory local to another processor hared between processors). To some degree, the of BFS depends on the speed of memory access, omplexity of memory accesses is greater than that on. Therefore, in this paper, we propose a general approach for processor and memory affinities on tem. However, we cannot find a library for obtain- sentation and working variables to allow a BFS t over the local memory before the traversal. In ou all accesses to remote memory are avoided in phase using the following column-wise partitioni V = V0 | V1 | · · · | Vℓ−1 , A = A0 | A1 | · · · | Aℓ−1 , and each set of partial vertices Vk on the k-th N is defined by Vk = vj ∈ V | j ∈ k ℓ · n, (k + 1) ℓ · n where n is the number of vertices and the diviso the number of NUMA nodes (CPU sockets). In avoid accessing remote memory, we define parti lists AF k and AB k for the Top-down and Bottom-u follows: AF k (v) = {w | w ∈ {Vk ∩ A(v)}} , v ∈ V, AB k (w) = {v | v ∈ A(w)} , w ∈ Vk Furthermore, the working spaces NQk, VSk, partial vertices Vk are allocated to the local memo th NUMA node with the memory pinned. Note th of each current queue CQk is all vertices V in a and these are allocated to the local memory on the or w ∈ V VS in parallel do for v ∈ AB (w) do if v ∈ CQ then π(w) ← v VS ← VS ∪ {w} NQ ← NQ ∪ {w} break eturn NQ al memory, and these connect to one another via an nnect such as the Intel QPI, AMD HyperTransport, NUMAlink 6. On such systems, processor cores can heir local memory faster than they can access remote cal) memory (i.e., memory local to another processor mory shared between processors). To some degree, the mance of BFS depends on the speed of memory access, the complexity of memory accesses is greater than that putation. Therefore, in this paper, we propose a general ment approach for processor and memory affinities on A system. However, we cannot find a library for obtain- V = V0 | V1 | A = A0 | A1 | and each set of partial vertice is defined by Vk = vj ∈ V | j ∈ k ℓ where n is the number of ver the number of NUMA nodes avoid accessing remote memo lists AF k and AB k for the Top-d follows: AF k (v) = {w | w ∈ {Vk AB k (w) = {v | v Furthermore, the working s partial vertices Vk are allocated th NUMA node with the mem of each current queue CQk is and these are allocated to the lo o nect to one another via an QPI, AMD HyperTransport, ystems, processor cores can han they can access remote y local to another processor ssors). To some degree, the he speed of memory access, accesses is greater than that paper, we propose a general or and memory affinities on nnot find a library for obtain- V = V0 | V1 | · · · | Vℓ−1 , A = A0 | A1 | · · · | Aℓ−1 , and each set of partial vertices Vk on the k-th NUMA is defined by Vk = vj ∈ V | j ∈ k ℓ · n, (k + 1) ℓ · n , where n is the number of vertices and the divisor ℓ is s the number of NUMA nodes (CPU sockets). In additio avoid accessing remote memory, we define partial adjac lists AF k and AB k for the Top-down and Bottom-up polici follows: AF k (v) = {w | w ∈ {Vk ∩ A(v)}} , v ∈ V, AB k (w) = {v | v ∈ A(w)} , w ∈ Vk. Furthermore, the working spaces NQk, VSk, and πk partial vertices Vk are allocated to the local memory on th th NUMA node with the memory pinned. Note that the r of each current queue CQk is all vertices V in a given g and these are allocated to the local memory on the k-th NU NUMA-based 1-D partitioned graph representation These sub-graphs represent the same area, but not same data structures. Partial vertex set Vk
  • 24. Vertex sorting for BFS Degree distribution Access freq. w/ vertex sorting •  # traversals is equal to out-degree for each vertex •  Locality of accessing vertices depends on vertex index Improving cache hit ratios !! Applying the vertex sorting, an access frequency is similar to the degree distribution. Access frequency V.S. Degree distribution 0 2 1 3 4 4 0 2 3 1 Original indices Sorted indices High degree Many accesses for small-index vertex
  • 25. Two strategies for implementation 1.  Highest TEPS for Graph500 –  Graph500 list uses TEPS scores only 2.  Largest SCALE (problem size) for Green Graph500 –  Green Graph500 list is separated two categories by problem size –  The big data category collects over SCALE 30 entries Median size of all entries 0 10 20 30 40 50 20 21 22 23 24 25 26 27 28 29 30 GTEPS SCALE DG-V DG-S SG Highest TEPS Largest SCALE #1 on 5th Green Graph500 RAM RAM processor core & L2 cache 8-core Xeon E5 4640 shared L3 cache RAM RAM On the 4-way NUMA system, •  Highest TEPS model obtains 42 GTEPS for SCALE 27 •  Largest SCALE model can solve up to SCALE 30
  • 26. A0 A1 A2 A3 Comparison of two implementations •  Dual-dir. graphs or Single-graph Highest-TEPS mode Largest-SCALE mode •  Forward graph GF for Top-down •  Transposed GB for Top-down A0 A1 A2 A3 Input: Frontier V Data: visited Vk Output: neighbors Vk Local RAM •  Backward graph GB for Bottom-up A0 A1 A2 A3 Data: Frontier V Data: visited Vk Output: neighbors Vk Local RAM •  Backward graph GB for Bottom-up A0 A1 A2 A3 Data: Frontier V Data: visited Vk Output: neighbors Vk Local RAM Output: Neighbors V Data: Frontier V Input: visited V Local and Remote RAM Same Diff.
  • 27. Results on 4-way NUMA system (Xeon) 0 10 20 30 40 50 20 21 22 23 24 25 26 27 28 29 30 GTEPS SCALE DG-V DG-S SG 1 2 4 8 16 32 64 1 (1×1×1) 2 (1×2×1) 4 (1×4×1) 8 (1×8×1) 16 (2×8×1) 32 (4×8×1) 64 (4×8×2) Speedupratio Number of threads (#NUMA Nodes × #cores × #threads) DG-V DG-S SG •  TEPS-model handle up to SCALE29 •  Largest-SCALE model can solve up to SCALE30 Strong scaling for SCALE 27 •  With 64 threads, these models achieve over 20X faster than sequential. •  Comparing 32 and 64 threads, HT was to produce a speedup of over 20%. HT speedups are more than 20%. Highest TEPS score Largest SCALE 20x faster than seq. #1 entry on 5th Green Graph500 HT CPU: 4-way Intel Xeon E5-4640 (64 threads) Base-architecture: SandyBridge-EP RAM: 512 GB CC: GCC-4.4.7
  • 28. SGI UV 2000 system •  SGI UV 2000 –  Shared-memory supercomputer based on cc-NUMA arch. –  Running with a single Linux OS –  User handles a large memory space by the thread parallelization e.g. OpenMP or Pthreads (also can use MPI) –  The full-spec. UV 2000 (4 racks) has 2,560 cores and 64 TB memory •  ISM, SGI, and us collaborate for the Graph500 benchmarks The Institute of Statistical Mathematics •  Japan's national research institute for statistical science. #2 system ISM has two Full-spec. UV 2000 (totality 8 racks) #1 full spec. UV 2K
  • 29. System configuration of UV 2000 •  UV2000 has hierarchical hardware topologies –  Sockets, Nodes, Cubes, Inner-racks, and Inter-racks Node = 2 sockets Cube = 8 nodes Rack = 32 nodes •  We used NUMA-based flat parallelization –  Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM” CPU RAM CPU RAM × 4 = CPU RAM Node = 2 NUMA nodes Rack = 64 NUMA nodes × 64 = CPU RAM Cube = 16 NUMA nodes × 2 CPU RAM × 16 NUMAlink 6.7GB/s (20 cores, 512GB) (160 cores, 4TB) (640 cores, 16TB) Cannot detect
  • 30. 0 50 100 150 200 26 (1) 27 (2) 28 (4) 29 (8) 30 (16) 31 (32) 32 (64) 33 (128) SCALE 31 (64) GTEPS SCALE (#sockets) DG-V (SCALE 26 per NUMA node) DG-V (SCALE 25 per NUMA node) DG-S (SCALE 26 per NUMA node) SG (SCALE 26 per NUMA node) Results on UV2000 DG-V is fastest 1 to 32 socket(s) Highest TEPS model
  • 31. 0 50 100 150 200 26 (1) 27 (2) 28 (4) 29 (8) 30 (16) 31 (32) 32 (64) 33 (128) SCALE 31 (64) GTEPS SCALE (#sockets) DG-V (SCALE 26 per NUMA node) DG-V (SCALE 25 per NUMA node) DG-S (SCALE 26 per NUMA node) SG (SCALE 26 per NUMA node) Results on UV2000 64 sockets DG-S is faster than DG-V DG-V is fastest 1 to 32 socket(s) Both implementations use all-to-all communication Performance degradation Highest TEPS model
  • 32. 0 50 100 150 200 26 (1) 27 (2) 28 (4) 29 (8) 30 (16) 31 (32) 32 (64) 33 (128) SCALE 31 (64) GTEPS SCALE (#sockets) DG-V (SCALE 26 per NUMA node) DG-V (SCALE 25 per NUMA node) DG-S (SCALE 26 per NUMA node) SG (SCALE 26 per NUMA node) Results on UV2000 DG-V is fastest SG is fastest and scalable 1 to 32 socket(s) 128 CPU sockets (1280 threads) 64 sockets Fastest of single node 174 GTEPS 9th on SC14, 10th on ISC15 Graph500 SCALE 33 •  8.59 B vertices • 137.44 B edges Two UV2000 racks DG-S is faster than DG-V Highest TEPS model Large-problem model
  • 33. Breakdown with 2-racks UV2000 •  Comp. (57 %) > Comm. (43 %) scalable 0 50 100 150 200 250 300 350 Init. 0 1 2 3 4 5 6 7 CPUtime(ms) Level Breakdown of SCALE 33 on UV 2000 with 128 CPUs Traversal (ms) Comm. (ms) Top-down Top-down Bottom-up Bottom-up Bottom-up Bottom-up Bottom-up Bottom-up Remote memory communications Computation Most of CPU time at middle-level
  • 34. •  4-way Intel Xeon server –  DG-V (High-TEPS) model achieves the fastest single-server entries –  SG (Large-SCALE) model won 3rd, 4th, 5th Green Graph500 lists 5th Green Graph500 list 62.93 MTEPS/W 31.33 GTEPS Our achievements of Graph500 benchmarks •  UV 2000 –  DG-S (Middle) model achieves 131 GTEPS with 640 threads and the most power-efficient of commercial supercomputers 4th Green Graph500 list 61.48 MTEPS/W 28.61 GTEPS 3rd Green Graph500 list 59.12 MTEPS/W 28.48 GTEPS #1 #1 #1 ISC15 in Last week #7 on 3rd list #9 on 4rd list –  SG (Largest-SCALE) model achieves 174 GTEPS for SCALE 33 with 1,280 threads and the fastest single-node entry
  • 35. 1 4 16 64 256 1024 4096 16384 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 GTEPS SCALE UV 2000 (CPU per SCLAE 26) BQ/Q (CPU per SCALE 24) DragonHawk (480 threads) BFS performances IBM BG/Q (8,192 cores) ⇒  SCALE 33: 172 GTEPS ⇒  21.0 MTEPS/core SGI UV2000 (1,280 cores) ⇒ SCALE 33: 175 GTEPS ⇒ 136.5 MTEPS/core HP Superdome X (240 cores) ⇒  SCALE 33: 127 GTEPS ⇒ 530.3 MTEPS/core Our result (NUMA-aware) Our result (NUMA-aware)
  • 36. Conclusion 1.  Efficient graph algorithm considering the processor topology on a single-node NUMA system 2.  NUMA-aware programming utilizing by our ULIBC 3.  Our implementation works well on many computers; –  Scales up to 1,280 threads on UV2000 at ISM –  UV2000 achieves fast single node entry of 9th and 10th Graph500 –  Xeon server won most energy-efficient 3rd, 4th, and 5th Green Graph500 RAM RAM processor core & L2 cache 8-core Xeon E5 4640shared L3 cache RAM RAM T NUMA-aware threading with ULBIC D T D T D T D D T Accessing local memory Pinning threads and memory Our library “ULIBC“ is available at Bitbucket� https://bitbucket.org/yuichiro_yasui/ulibc�
  • 37. Reference •  [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System, IEEE BigData 2013 •  [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014 •  [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015. •  [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui, K. Iwabuchi, and T. Endo: Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale upercomputers, Proceedings of the Optimization in the Real World -- Toward Solving Real-World Optimization Problems --, Springer, 2015. This talk Other results of Our Graph500 team