SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
NUMA-aware scalable graph traversal
on SGI UV systems
*Yuichiro Yasui, Katsuki Fujisawa
Kyushu University
Eng Lim Goh, John Baron, Atsushi Sugiura
SGI Corp.
Takashi Uchiyama
SGI Japan, Ltd.
HPGP’16 @ Kyoto at May 31, 2016
Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
Our motivations
• NUMA / cc-NUMA architecture • Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm
– improves locality of reference in memory accesses
– exploits multithreading on many-socket system (SGI UV2000, UV300)
- SCALE
- edgefactor
Input parameters Graph generation Graph construction BFS
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
Local access
Remote access
・・・
Many-socket system
Represents many relationships by graph structure
CPU
RAM
CPU
RAM
CPU
RAM
…
Partial CSR graphassigned
Kronecker graph w/ SCALE 34
17 billion nodes and 275 billion edges
SGI UV 300
32-sockets 18-core Xeon and 16 TB RAM
Our contributions (Previous work and this paper)
• Efficient Graph data structure
4
3
2
0
1
Input graph
1
3
0
4
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• Efficient BFS based on Beamer’s Direction-optimizing (SC12)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges
Top-down direction Bottom-up direction
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:
CQ
Data:
VSk
Output:
NQk
Local
Sorting by outdegree
CSR graph
This paper
Reduction of
remote edges
UV 300 w/ 32 sockets
219 GTEPS (NEW)
New result !!
Updated highest score
On single-node
Binding on
NUMA node
Graph processing for Large scale networks
• Large-scale networks are generated in widely appli. areas
– US Road network: 58 million edges
– Twitter follow-ship: 1.47 billion edges
– Neuronal network: 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing with HPC
– categorized as data intensive application
large
61.6 million vertices
& 1.47 billion edges
• Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Used to understand relationships in real-world networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortestpath
・Maximum flow ・Maximal independentset
・Centrality metrics ・Clustering ・Graph Mining
• One of most important and fundamental algorithm to traverse graph structures
• Many algorithms and applications based on BFS (Max. flow and Centrality)
• Well-known algorithm takes O(n+m) for a digraph G with n vertices and m edges
Breadth-first search (BFS)
Source
BFS Lv. 3
Source Lv. 2
Lv. 1
Outputs
• BFS tree
• Distance
Inputs
• digraph G = (V, E)
• Source vertex
• Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Used to understand relationships in real-world networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortestpath
・Maximum flow ・Maximal independentset
・Centrality metrics ・Clustering ・Graph Mining
BFS Tree
BFS on Twitter follow-ship network
• follow-ship network
– #Users (#vertices): 41,652,230
– Follow-ships (#edges): 2,405,026,092
Lv. #users ratio (%) percentile (%)
0 1 0.00 0.00
1 7 0.00 0.00
2 6,188 0.01 0.01
3 510,515 1.23 1.24
4 29,526,508 70.89 72.13
5 11,314,238 27.16 99.29
6 282,456 0.68 99.97
7 11536 0.03 100.00
8 673 0.00 100.00
9 68 0.00 100.00
10 19 0.00 100.00
11 10 0.00 100.00
12 5 0.00 100.00
13 2 0.00 100.00
14 2 0.00 100.00
15 2 0.00 100.00
Total 41,652,230 100.00 -
BFS result from User 21,804,357
excluding unconnected users
Six-degrees of separation
“everyone and everything is
six or fewer steps away”
Ours: 60 milliseconds per BFS
Twitter2009
Graph500 and Green Graph500
• New benchmarks using graph processing (breadth-first search)
• measures a performance and energy efficiency of irregular
memory access
TEPS score (# of Traversed edges per second) for
Measuring a performance of irregular memory accesses
TEPS per Watt score for measuring
power-efficient performamce
Graph500 benchmark Green Graph500 benchmark
1. Generation
or
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
eters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
LE
efactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
ut parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 642. Construction
x 64
Median TEPS
SCALE & edgefactor (=16)
Kronecker graph with 2SCALE vertices and
2SCALE edgefactor edges
by using SCALE-times the recursive
Kronecker productsG1 G2 G3 G4
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 64
x 64
Median of
64 TEPSs
Power
consumption
Power consumption
in watt
TEPS per Watt
NUMA (Non-uniform memory access) system
NUMA 0
NUMA 1 NUMA 2
NUMA 3
0
1
2
3
0 1 2 3
targetNUMAnode
source NUMA node
24.2
3.4
3.0
3.4
3.3
23.9
3.5
3.0
3.0
3.4
24.3
3.4
3.5
3.0
3.4
24.2
Local access: 24 GB/s
Remote access: 3 GB/s
NUMA 0
NUMA 1 NUMA 2
NUMA 3
Data
Threads
NUMA system w/ 4-sockets
Data
threads
Fast local access Slow non-local access
Different
distances
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
Thread placement
Memoryplacement
(Example) 4-socket Xeon system
• 4 (# of CPU sockets)
• 8 (# of physical cores per socket)
• 2 (# of threads per core)
diagonal elements
= local access
SGI UV 2000
• UV 2000
– Single OS: SUSE Linux 11 (x86_64)
– hypercube interconnection
– Up to 2,560 cores and 64 TB RAM
– (= 128 UV 2000 shassis x 2 sockets x 10 cores)
ISM has two full-spec. UV 2000
• Hierarchical network topologies
– Sockets, Chassis, Cubes, Inner-racks, and Outer-racks
UV2000 Chassis = 2 sockets Cube = 8 Chassis Rack = 32 nodes
CPU
RAM
CPU
RAM
4 =
NUMAlink6
6.7GB/s
Cannotdetect
NUMA
node
ISM Kyushu U.
RA
M
CPU
to other chassis
to other chassis
SGI UV 300
• UV 300
– Single OS: SUSE Linux 11 (x86_64)
– All-to-all interconnection
– Up to 1,152 cores and 16 TB RAM
– (= 8 UV 300 shassis x 4 sockets x 18 cores x 2 SMT)
• UV 300 chassis
– 4-socket 18-core Intel Xeon E7-8867 (Haswell)
– 2TB RAM (512 GB per NUMA node)
UV300 chassis
UV300 Rack
All-to-All
• 18-core Xeon E7-8867
• HT enabled (2 SMT)
• 512GB RAM
NUMA node
UV 300 chassis
Kyushu U.
8 chassis
Memory Bandwidths on UV 2000 and UV 300
• Bandwidths in GB/s b/w NUMA nodes using STREAM TRIAD
• Local access is clearly faster than remote access
0
16
32
48
0 16 32 48
MemoryPlacement
Thread Placement
0
5
10
15
20
25
30
35
UV 2000 (64 sockets)
3-7 GB/s
3-7 GB/s
Each chassis has 2 sockets and connects to
each other by hypercube topology
Local
33 GB/s
UV 2000
chassis
0
4
8
12
16
20
24
28
0 4 8 12 16 20 24 28
MemoryPlacement
Thread Placement
0
10
20
30
40
50
60
UV 300 (32 sockets)
Each chassis has 4 sockets and connects to
each other by all-to-all topology
6 GB/s
6 GB/s
Local
56 GB/s
12-14 GB/s
UV 300
chassis
Programming cost for NUMA-aware
• Thread and Memory binding
– Reduce remote access
– Avoid thread migration
• Linux provides naïve interfaces
– sched_{set,get}affinity()
• binds a thread on a processor set (specifying by processor id)
– mbind()
• binds pages on a NUMA node set (specifying by NUMA node id)
– Linux provides processer id and NUMA node id as system files;
/proc/cpuinfo, /sys/devices/system/{node,cpu}/
• Reducing programming cost using ULIBC
– provides Some APIs for NUMA-aware programming
– available at
https://bitbucket.org/yuichiro_yasui/ulibc
#define _GNU_SOURCE
#include <sched.h>
int bind_thread(int procid) {
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(procid, &set);
return sched_setaffinity( (pid_t)0,
sizeof(cpu_set_t), &set) );
}
Specifying by Processor id
2. Detects online topology
CPU Affinity construction using ULIBC
RAM
1. Detects entire topology
RAM
RAM RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
RAM
RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
numactl --cpunodebind=1,2 ¥
--membind=1,2
e.g.)
3. Constructs two-type affinities
NUMA node 0
NUMA node 1
thread 0
thread 1
thread 2
thread 3
RAM
RAM
Local RAM
assigns threads in a position close to each other.
Compact-type affinity
export ULIBC_AFFINITY=compact:fine
export OMP_NUM_THREADS=7
e.g.)
export ULIBC_AFFINITY=scatter:fine
export OMP_NUM_THREADS=7e.g.)
NUMA node 0
NUMA node 1
thread 0 thread 2
thread 1 thread 3
RAM
RAM
Local RAM
distributes the threads as evenly as possible
across online processors.
Scatter-type affinity
RAM
RAM
NUMA-aware computation with ULIBC
• ULIBC is a callable library for NUMA-aware computation
• Detects processor topology on run time
• Constructs thread and memory affinity setting
#include <stdio.h>
#include <omp.h>
#include <ulibc.h>
int main(void) {
ULIBC_init();
_Pragma("omp parallel") {
const int tid = ULIBC_get_thread_num();
ULIBC_bind_thread();
const struct numainfo_t loc = ULIBC_get_numainfo( tid );
printf(”Thread: %2d, NUMA-node: %d, NUMA-core: %d¥n",
loc.id, loc.node, loc.core);
/* do something */
}
}
initialize
get thread id
bind current thread
get NUMA placement
https://bitbucket.org/yuichiro_yasui/ulibc
Thread: 4, NUMA-node: 0, NUMA-core: 1
Thread: 55, NUMA-node: 3, NUMA-core: 13
Thread: 16, NUMA-node: 0, NUMA-core: 4
Thread: 37, NUMA-node: 1, NUMA-core: 9
Thread: 30, NUMA-node: 2, NUMA-core: 7
. . .
Core IDNUMA node IDThread ID
include header file
ULIBC is available at
Execution log on 4-socket
Level-synchronized parallel BFS (Top-down)
• Started from source vertex and
executes following two phases at
each level
Level k
Level k+1CQ
NQ
Swap exchanges CQ and NQ for next
level
Traversal phase finds unvisited vertices
from CQ and appends into NQ
visited
unvisited
NQ
Level 1
Source
Level 0
CQ
Level 2
Level 1
NQ
CQ
Level 3
Level 2
NQ
CQ
Level 0
Sync.
Sync.
Level 1
Level 2
NQCQ
NQCQ
Frontier
Level k
Level k+1
NeighborsFrontier Neighbors
Level k
Level k+1
Candidates of
neighbors
Direction-optimizing BFS [Beamer, SC12]
• Top-down dir. using out-going edges • Bottom-up dir. using in-coming edges
Outgoing
edges Incoming
edges
Two directions; Top-down or Bottom-up
幅優先探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Distance from source
Large
frontier
Top-down
Top-down
Bottom-up
Direction-opt. BFS
Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
Our contributions (Previous work and this paper)
• Efficient Graph data structure
4
3
2
0
1
Input graph
1
3
0
4
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• Efficient BFS based on Beamer’s Direction-optimizing (SC12)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges
Top-down direction Bottom-up direction
• 131 GTEPS
• 152 GTEPS (NEW)
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:
CQ
Data:
VSk
Output:
NQk
Local
Sorting by outdegree
CSR graph
This paper
Reduction of
remote edges
UV 2000 w/ 64 sockets
UV 300 w/ 32 sockets
• 219 GTEPS (NEW)
• New results
16 %
faster than highest
single-node entry
Binding on
NUMA node
Ours: NUMA-aware 1-D part. graph [BD13]
• Divides sub graphs and assigns on each NUMA node
A0
A1
A2
A3
Adjacency matrix 1-D part. Graph
CPU
RAM
assigndivide
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
A0
A1
A2
A3
Input:
Frontier
CQ
Data:
visited VSk
Output:
neighbors
NQ
Local RAM
Bottom-up direction
• At bottom-up direction (Bottleneck component), each NUMA node
computes partial NQ using local copied CQ and local assigned VS.
Each sub graph represents by CSR graph
Top-down direction uses inverse of G.
(G is undirected)
A0
A1
A2
A3
Input:
Frontier
CQk
Data:
visited VSk
Output:
neighbors
NQ
Local
Local
Remote Remote
Modified version of Agarwal’s NUMA-aware BFS
Ours: Adjacency list sorting [ISC14]
• Reduces unnecessary edge traversals at Bottom-up dir.
Loop count τ
A(va)
A(vb)
finds frontier vertex and breaks this loop
……
Bottom-up
Skipped adjacencyvertices
Traversed adjacency vertices
• Sorting adjacency lists by the corresponding outdegree
Vertex vi Vertex vi+1
Index
Value
High Low
Adjacency vertices of vi
Sorting by outdegree
Ours: Vertex sorting [HPCS15]
Degree distribution
Access freq. w/ vertex sorting
• # of vertex traversals equals the outdegree of the corresponding vertex
• Our vertex sorting reorders vertex indices by the outdegrees
Access freq. and OutDegree are correlated
4
3
2
0
1
1
3
0
4
2
Original
indices
Sorted indices by outdegree
Highest
outdegree
Many accesses
for small-index vertex
NUMA-aware Top-down BFS
• Original version was proposed by Agarwal [Agarwal-SC10]
• Reducing random remote accesses using socket-queue
CQ
Local + Remote
NUMA 0
NUMA 1
NUMA 2
NUMA 3
Local : Remote = 1 : ℓ on ℓ-sockets
e.g.) focused on NUMA 2
synchronize
Socket−queue
Local
VS NQ
synchronize
Swap CQ and NQ
Append unvisited
vertices into NQ
Local
Phase1: CQ NQ or Socket-queue Phase2: Socket-queue NQ Next level
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Append unvisited
vertices into NQ
NUMA-aware Top-down w/ Pruning remote edges
• pruning remote edges to reduce remote accesses
NUMA 0
NUMA 1
NUMA 2
NUMA 3
e.g.) focused remote edge traversal on NUMA2
This paperproposed by Agarwal’s SC10 paper
with Pruningw/o Pruning (original)
Each NUMA node appends remote edges (v,w)
into the corresponding socket-queue, if the F
doesn't contain w. (And then, F appends w)
Each NUMA node appends all remote
edges (v,w) into the corresponding
socket-queue
F (reuse CQ bitmap for Bottom-up)
CQ
(vector queue)
Socket−queue
Remote
Local
Local
Remote
The F is not initialized,
while there is no change of
search direction.
CQ
(vector queue)
Socket−queue
Remote
Remote
Each vertex is
searched once only.
Effects of pruning & Updated TEPS score
• Pruned many remote edges
0.0
0.2
0.4
0.6
0.8
1.0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
(4) (9.04K) (221M) (15.3B) (1.50B) (4.55M) (11.1K) (29)
Local
Pruned-remote
Remote
Level 7Level 6Level 5Level 4Level 3Level 2Level 1Level 0
Figure 5: Ratio of traversed edges on a NUMA node in the
top-down algorithm with remote edge traversal pruning for
a Kronecker graph with SCALE29 on a four-socket server
(SB4). Each number in a bracket represents the total num-
ber of traversed edges at each level.
Algorithm 3: Top-down with pruning remote traversal
Procedure NUMA-aware-Top-down(G, CQ, VS, ⇡)
fork
Top-down Bottom-up Top-down
0
50
100
150
200
1 2 4 8 16 32 64 128
GTEPS Number of NUMA nodes (CPU sockets)
HPCS15-SG (SCALE 26 per NUMA node)
This paper (SCALE 27 per NUMA node)
7.7
15.3
24.2
42.1
59.4
94.8
131.4
174.7
8.3
14.2
25.1
38.6
61.5
91.8
152.2
Figure 6: Weak scaling on UV 2000
5.2 SGI UV 300
In this study, we obtained new results on SGI UV 300,
which has 32 CPU sockets and 16 TB memory. Fig. 7 de-
picts TEPS versus number of CPU sockets (NUMA nodes).
Table 8 shows the TEPS obtained for 32 CPU sockets. We
discuss the results with the following parameters:
SCALE29, on 4-socket Xeon UV2000 (used only 64 sockets)
However, this method may not be effective on a few
sockets, because the algorithm switch a direction as
the bottom-up at middle levels.
Pruned
Previous: TD BU
This paper: TD BU TD
On many sockets, Updated TEPS score
UV 2000 with 64 sockets;
• w/o pruning: 131 GTEPS
• w/ pruning: 152 GTEPS
Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
Weak scaling performance
• UV 300 clearly outperforms UV 2000.
0
50
100
150
200
1 2 4 8 16 32 64
GTEPS
Number of sockets
UV2000: Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)
8.3
14.2
25.1
38.6
61.5
91.8
152.2
16.3
29.2
53.5
83.9
129.4
171.0
15.9
28.0
57.9
93.5
151.4
209.3
91.4
147.6
203.7
18.7
32.5
64.7
100.3
161.5
219.4
UV 2000
UV 300
Compared on next slide.
Breakdown of system configuration on UV300
• UV 300 is 2x faster than UV 2000
– same sockets (32 sockets)
– #ThreadsPerSockets = #logical cores
• Best perf. of UV 300 obtained with
– Larger problem size
– THP (transparent huge page) enabled
– Set memory reference mode as local-mode
– HT (Hyperthreading) enabled
System #sockets SCALE HT THP Mem-Ref-mode GTEPS
UV2000 32 32 − 92
UV300 32
32 Remote 171
34
Remote 204
Remote 209
*1 Local 188
Local 219 +16.5% by HT enabled
0
50
100
150
200
1 2 4 8 16 32 64
GTEPS
Number of sockets
UV2000: Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)
8.3
14.2
25.1
38.6
61.5
91.8
152.2
16.3
29.2
53.5
83.9
129.4
171.0
15.9
28.0
57.9
93.5
151.4
209.3
91.4
147.6
203.7
18.7
32.5
64.7
100.3
161.5
219.4
+ 4.8% by Memory
Reference mode
+ 2.5% by THP enabled
+19.3% by using a
larger memory space
*1: uses # of threads same as physical cores. emulated "Hyperthreading disabled”.
Perf.
gap
28.3 % perf. gap
New results and Nov. 2015 list
Updated fastest single-node
Ours
fastest of
single-node
Ours
SCALE34
219 GTEPS
SGI UV300
(1 node / 576 cores)
− HT enabled
− THP enabled
− local-ref. mode
SGI UV2000
(1280 cores)
SCALE 33
174.7 GTEPS
SGI UV2000
(640 cores)
SCALE 33
149.8 GTEPS
Bandwidth and TEPS
• BW and TEPS of our implementations on 3 systems
– GB/s: STEAM TRIAD with 10 M elements per socket
– TEPS: SCALE27 (n=134M, m=2.15B) per socket
via a modified implementation using ULIBC, in which each
thread computed the partial TRIAD operation for vectors
on local memory only, shown in subsection 3.2. Figure shows
correlativity between the memory bandwidth and the graph
traversal performance. The optimized Graph500 implemen-
tation and our previous implementation are scalable, like
the memory bandwidth. In contrast, the reference code of
Graph500 is not scalable and cannot exploit the NUMA sys-
tem e ciently.
2
4
8
16
32
64
128
256
16 32 64 128 256 512 1024 2048
GTEPS
Memory Bandwidth (GB/s)
UV300 (Haswell) (HT, THP, Local): This paper
UV2000 (Ivy Bridge): This paper
UV2000 (Ivy Bridge): BD13
SB4 (Sandy Bride-EP) (HT, THP): This paper
SB4 (Sandy Bride-EP) (HT, THP): BD13
(a) GTEPS
Our previous [BD13]
This paper
• Bandwidth and GTEPS are correlated on three Xeon processors
• UV300
– 32-sockets Haswell
• UV2000
– 64-sockets Ivy Bridge
• SB4
– 4-sockets Sandy Bridge-EP
Systems
Conclusion
• NUMA / cc-NUMA architecture • Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm
– NUMA-aware to improved a locality of memory access
– Exploit multithreading on many-socket system (SGI UV2000, UV300)
Motivations
- SCALE
- edgefactor
Input parameters Graph generation Graph construction BFS
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
Local access
Remote access
・・・
Many-socket system
Represents many relationships by graph structure
• NUMA-aware scalable BFS algorithm
– Scalable more than thousand threads on SGI UV 2000 and SGI UV 300
– Updated highest score single-node as 219 GTEPS on SGI UV300 with 32 sockets
• “ULIBC”: Callable library for NUMA-aware computation
– available at https://bitbucket.org/yuichiro_yasui/ulibc
Contributions Pruning edge traversal
to reduce remote edges
References
• [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel
Breadth-first Search on Multicore Single-node System, IEEE BigData 2013
• [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient
Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014
• [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread
parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015.
• [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui,
K. Iwabuchi, and T. Endo: Advanced Computing & Optimization
Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale
upercomputers, Proceedings of the Optimization in the Real World --
Toward Solving Real-World Optimization Problems --, Springer, 2015.
NUMA-aware BFS algorithm
Other results of Our Graph500 team

Mais conteúdo relacionado

Mais procurados

IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
Shubham Joshi
 
ktruss-short
ktruss-shortktruss-short
ktruss-short
Jia Wang
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
Linaro
 

Mais procurados (20)

distance_matrix_ch
distance_matrix_chdistance_matrix_ch
distance_matrix_ch
 
GoogLeNet Insights
GoogLeNet InsightsGoogLeNet Insights
GoogLeNet Insights
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
Cnq1
Cnq1Cnq1
Cnq1
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
 
An Overview of HDF-EOS (Part 1)
An Overview of HDF-EOS (Part 1)An Overview of HDF-EOS (Part 1)
An Overview of HDF-EOS (Part 1)
 
Nmea Introduction
Nmea IntroductionNmea Introduction
Nmea Introduction
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
ktruss-short
ktruss-shortktruss-short
ktruss-short
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
Development of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsDevelopment of Routing for Car Navigation Systems
Development of Routing for Car Navigation Systems
 
Chenchu
ChenchuChenchu
Chenchu
 
K10692 control theory
K10692 control theoryK10692 control theory
K10692 control theory
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Max flow min cut
Max flow min cutMax flow min cut
Max flow min cut
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and Experiments
 
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
 

Destaque

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
Hiroshi Ono
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Junli Gu
 
SGI - HPC-29mai2012
SGI - HPC-29mai2012SGI - HPC-29mai2012
SGI - HPC-29mai2012
Agora Group
 
บทที่ 3 กฎการเคลื่อนที่ของนิวตัน
บทที่ 3 กฎการเคลื่อนที่ของนิวตันบทที่ 3 กฎการเคลื่อนที่ของนิวตัน
บทที่ 3 กฎการเคลื่อนที่ของนิวตัน
Wannalak Santipapwiwatana
 

Destaque (14)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
 
SGI - HPC-29mai2012
SGI - HPC-29mai2012SGI - HPC-29mai2012
SGI - HPC-29mai2012
 
SGI HPC DAY 2011 Kiev
SGI HPC DAY 2011 KievSGI HPC DAY 2011 Kiev
SGI HPC DAY 2011 Kiev
 
Sgi Hpc Day Kiev 2009 10 Uv
Sgi Hpc Day Kiev 2009 10 UvSgi Hpc Day Kiev 2009 10 Uv
Sgi Hpc Day Kiev 2009 10 Uv
 
SGI HPC Update for June 2013
SGI HPC Update for June 2013SGI HPC Update for June 2013
SGI HPC Update for June 2013
 
Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108Qnap iei partners_day_2016 1108
Qnap iei partners_day_2016 1108
 
SGI: Meeting Manufacturing's Need for Production Supercomputing
SGI: Meeting Manufacturing's Need for Production SupercomputingSGI: Meeting Manufacturing's Need for Production Supercomputing
SGI: Meeting Manufacturing's Need for Production Supercomputing
 
Qnap nas training latam 2016 0810
Qnap nas training latam 2016 0810Qnap nas training latam 2016 0810
Qnap nas training latam 2016 0810
 
QNAP NAS training 2016 Q3
QNAP NAS training 2016 Q3QNAP NAS training 2016 Q3
QNAP NAS training 2016 Q3
 
Nuevo Portafolio QNAP 2017
Nuevo Portafolio QNAP 2017Nuevo Portafolio QNAP 2017
Nuevo Portafolio QNAP 2017
 
The Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to ExascaleThe Explosion of Petascale in the Race to Exascale
The Explosion of Petascale in the Race to Exascale
 
CPUに関する話
CPUに関する話CPUに関する話
CPUに関する話
 
บทที่ 3 กฎการเคลื่อนที่ของนิวตัน
บทที่ 3 กฎการเคลื่อนที่ของนิวตันบทที่ 3 กฎการเคลื่อนที่ของนิวตัน
บทที่ 3 กฎการเคลื่อนที่ของนิวตัน
 

Semelhante a NUMA-aware Scalable Graph Traversal on SGI UV Systems

Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
wjunjmt
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
Jinho Lee
 
Basic switch and switch configuration.pptx
Basic switch and switch configuration.pptxBasic switch and switch configuration.pptx
Basic switch and switch configuration.pptx
itwkd
 

Semelhante a NUMA-aware Scalable Graph Traversal on SGI UV Systems (20)

Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
 
Software defined network
Software defined networkSoftware defined network
Software defined network
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
NUMA optimized Parallel Breadth first Search on Multicore Single node System
NUMA optimized Parallel Breadth first Search on Multicore Single node SystemNUMA optimized Parallel Breadth first Search on Multicore Single node System
NUMA optimized Parallel Breadth first Search on Multicore Single node System
 
Cisco CCNA Data Center Networking Fundamentals
Cisco CCNA Data Center Networking FundamentalsCisco CCNA Data Center Networking Fundamentals
Cisco CCNA Data Center Networking Fundamentals
 
Chapter07
Chapter07Chapter07
Chapter07
 
Chapter4 Network
Chapter4 NetworkChapter4 Network
Chapter4 Network
 
What is 3d torus
What is 3d torusWhat is 3d torus
What is 3d torus
 
Network_Layer_and_Internet_Protocols_IPv.pptx
Network_Layer_and_Internet_Protocols_IPv.pptxNetwork_Layer_and_Internet_Protocols_IPv.pptx
Network_Layer_and_Internet_Protocols_IPv.pptx
 
Network.pptx
Network.pptxNetwork.pptx
Network.pptx
 
LinkedIn OpenFabric Project - Interop 2017
LinkedIn OpenFabric Project - Interop 2017LinkedIn OpenFabric Project - Interop 2017
LinkedIn OpenFabric Project - Interop 2017
 
RouteFlow & IXPs
RouteFlow & IXPsRouteFlow & IXPs
RouteFlow & IXPs
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligence
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...RT15 Berkeley |  ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
 
Basic switch and switch configuration.pptx
Basic switch and switch configuration.pptxBasic switch and switch configuration.pptx
Basic switch and switch configuration.pptx
 
Network Algorithmics
Network AlgorithmicsNetwork Algorithmics
Network Algorithmics
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 

Último

Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 

Último (20)

AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 

NUMA-aware Scalable Graph Traversal on SGI UV Systems

  • 1. NUMA-aware scalable graph traversal on SGI UV systems *Yuichiro Yasui, Katsuki Fujisawa Kyushu University Eng Lim Goh, John Baron, Atsushi Sugiura SGI Corp. Takashi Uchiyama SGI Japan, Ltd. HPGP’16 @ Kyoto at May 31, 2016
  • 2. Outline • Introduction – Graph analysis for large-scale networks – Graph500 benchmark and Breadth-first search – NUMA-aware computation • Our proposal: Pruning of remote edge traversals • Numerical results on SGI UV systems
  • 3. Outline • Introduction – Graph analysis for large-scale networks – Graph500 benchmark and Breadth-first search – NUMA-aware computation • Our proposal: Pruning of remote edge traversals • Numerical results on SGI UV systems
  • 4. Our motivations • NUMA / cc-NUMA architecture • Graph algorithm; BFS • Efficient NUMA-aware BFS algorithm – improves locality of reference in memory accesses – exploits multithreading on many-socket system (SGI UV2000, UV300) - SCALE - edgefactor Input parameters Graph generation Graph construction BFS - SCALE - edgefactor Input parameters Graph generation Graph construction VBFS 6 Local access Remote access ・・・ Many-socket system Represents many relationships by graph structure CPU RAM CPU RAM CPU RAM … Partial CSR graphassigned Kronecker graph w/ SCALE 34 17 billion nodes and 275 billion edges SGI UV 300 32-sockets 18-core Xeon and 16 TB RAM
  • 5. Our contributions (Previous work and this paper) • Efficient Graph data structure 4 3 2 0 1 Input graph 1 3 0 4 2 Vertex sorting (HPCS15) Adjacency list sorting (ISC14) A0 A1 A2 A3 NUMA-aware graph (BD13) • Efficient BFS based on Beamer’s Direction-optimizing (SC12) CQ Socket−queue Remote Remote Local VS NQ Agarwal’s Top-down (SC10) Pruning edges Top-down direction Bottom-up direction NUMA-aware Bottom-up (BD13) A0 A1 A2 A3 Input: CQ Data: VSk Output: NQk Local Sorting by outdegree CSR graph This paper Reduction of remote edges UV 300 w/ 32 sockets 219 GTEPS (NEW) New result !! Updated highest score On single-node Binding on NUMA node
  • 6. Graph processing for Large scale networks • Large-scale networks are generated in widely appli. areas – US Road network: 58 million edges – Twitter follow-ship: 1.47 billion edges – Neuronal network: 100 trillion edges 89 billion vertices & 100 trillion edges Neuronal network @ Human Brain Project Cyber-security Twitter US road network 24 million vertices & 58 million edges 15 billion log entries / day Social network • Fast and scalable graph processing with HPC – categorized as data intensive application large 61.6 million vertices & 1.47 billion edges
  • 7. • Transportation • Social network • Cyber-security • Bioinformatics Graph analysis and important kernel BFS • Used to understand relationships in real-world networks graph processing Understanding Application fields - SCALE - edgefactor - SCALE - edgefac - BFS Tim - Traverse - TEPS Input parameters ResuGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations Relationships - SCALE - edgefactor Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations graph - SCALE - edgefactor Input parameters Graph generation Graph construction VBFS 6 results Step1 Step2 Step3 constructing ・Breadth-first search ・Single-source shortestpath ・Maximum flow ・Maximal independentset ・Centrality metrics ・Clustering ・Graph Mining
  • 8. • One of most important and fundamental algorithm to traverse graph structures • Many algorithms and applications based on BFS (Max. flow and Centrality) • Well-known algorithm takes O(n+m) for a digraph G with n vertices and m edges Breadth-first search (BFS) Source BFS Lv. 3 Source Lv. 2 Lv. 1 Outputs • BFS tree • Distance Inputs • digraph G = (V, E) • Source vertex • Transportation • Social network • Cyber-security • Bioinformatics Graph analysis and important kernel BFS • Used to understand relationships in real-world networks graph processing Understanding Application fields - SCALE - edgefactor - SCALE - edgefac - BFS Tim - Traverse - TEPS Input parameters ResuGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations Relationships - SCALE - edgefactor Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations graph - SCALE - edgefactor Input parameters Graph generation Graph construction VBFS 6 results Step1 Step2 Step3 constructing ・Breadth-first search ・Single-source shortestpath ・Maximum flow ・Maximal independentset ・Centrality metrics ・Clustering ・Graph Mining BFS Tree
  • 9. BFS on Twitter follow-ship network • follow-ship network – #Users (#vertices): 41,652,230 – Follow-ships (#edges): 2,405,026,092 Lv. #users ratio (%) percentile (%) 0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00 10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00 Total 41,652,230 100.00 - BFS result from User 21,804,357 excluding unconnected users Six-degrees of separation “everyone and everything is six or fewer steps away” Ours: 60 milliseconds per BFS Twitter2009
  • 10. Graph500 and Green Graph500 • New benchmarks using graph processing (breadth-first search) • measures a performance and energy efficiency of irregular memory access TEPS score (# of Traversed edges per second) for Measuring a performance of irregular memory accesses TEPS per Watt score for measuring power-efficient performamce Graph500 benchmark Green Graph500 benchmark 1. Generation or - SCALE - edgefactor - BFS Time - Traversed edges - TEPS eters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations LE efactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS rameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS ut parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations 3. BFS x 642. Construction x 64 Median TEPS SCALE & edgefactor (=16) Kronecker graph with 2SCALE vertices and 2SCALE edgefactor edges by using SCALE-times the recursive Kronecker productsG1 G2 G3 G4 - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS Input parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations 3. BFS x 64 x 64 Median of 64 TEPSs Power consumption Power consumption in watt TEPS per Watt
  • 11. NUMA (Non-uniform memory access) system NUMA 0 NUMA 1 NUMA 2 NUMA 3 0 1 2 3 0 1 2 3 targetNUMAnode source NUMA node 24.2 3.4 3.0 3.4 3.3 23.9 3.5 3.0 3.0 3.4 24.3 3.4 3.5 3.0 3.4 24.2 Local access: 24 GB/s Remote access: 3 GB/s NUMA 0 NUMA 1 NUMA 2 NUMA 3 Data Threads NUMA system w/ 4-sockets Data threads Fast local access Slow non-local access Different distances CPU RAM CPU RAM CPU RAM CPU RAM Thread placement Memoryplacement (Example) 4-socket Xeon system • 4 (# of CPU sockets) • 8 (# of physical cores per socket) • 2 (# of threads per core) diagonal elements = local access
  • 12. SGI UV 2000 • UV 2000 – Single OS: SUSE Linux 11 (x86_64) – hypercube interconnection – Up to 2,560 cores and 64 TB RAM – (= 128 UV 2000 shassis x 2 sockets x 10 cores) ISM has two full-spec. UV 2000 • Hierarchical network topologies – Sockets, Chassis, Cubes, Inner-racks, and Outer-racks UV2000 Chassis = 2 sockets Cube = 8 Chassis Rack = 32 nodes CPU RAM CPU RAM 4 = NUMAlink6 6.7GB/s Cannotdetect NUMA node ISM Kyushu U.
  • 13. RA M CPU to other chassis to other chassis SGI UV 300 • UV 300 – Single OS: SUSE Linux 11 (x86_64) – All-to-all interconnection – Up to 1,152 cores and 16 TB RAM – (= 8 UV 300 shassis x 4 sockets x 18 cores x 2 SMT) • UV 300 chassis – 4-socket 18-core Intel Xeon E7-8867 (Haswell) – 2TB RAM (512 GB per NUMA node) UV300 chassis UV300 Rack All-to-All • 18-core Xeon E7-8867 • HT enabled (2 SMT) • 512GB RAM NUMA node UV 300 chassis Kyushu U. 8 chassis
  • 14. Memory Bandwidths on UV 2000 and UV 300 • Bandwidths in GB/s b/w NUMA nodes using STREAM TRIAD • Local access is clearly faster than remote access 0 16 32 48 0 16 32 48 MemoryPlacement Thread Placement 0 5 10 15 20 25 30 35 UV 2000 (64 sockets) 3-7 GB/s 3-7 GB/s Each chassis has 2 sockets and connects to each other by hypercube topology Local 33 GB/s UV 2000 chassis 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 MemoryPlacement Thread Placement 0 10 20 30 40 50 60 UV 300 (32 sockets) Each chassis has 4 sockets and connects to each other by all-to-all topology 6 GB/s 6 GB/s Local 56 GB/s 12-14 GB/s UV 300 chassis
  • 15. Programming cost for NUMA-aware • Thread and Memory binding – Reduce remote access – Avoid thread migration • Linux provides naïve interfaces – sched_{set,get}affinity() • binds a thread on a processor set (specifying by processor id) – mbind() • binds pages on a NUMA node set (specifying by NUMA node id) – Linux provides processer id and NUMA node id as system files; /proc/cpuinfo, /sys/devices/system/{node,cpu}/ • Reducing programming cost using ULIBC – provides Some APIs for NUMA-aware programming – available at https://bitbucket.org/yuichiro_yasui/ulibc #define _GNU_SOURCE #include <sched.h> int bind_thread(int procid) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(procid, &set); return sched_setaffinity( (pid_t)0, sizeof(cpu_set_t), &set) ); } Specifying by Processor id
  • 16. 2. Detects online topology CPU Affinity construction using ULIBC RAM 1. Detects entire topology RAM RAM RAM Socket 0 Socket 2 Socket 1 Socket 3 RAM 0 RAM 1 RAM 2 RAM 3 RAM RAM Socket 0 Socket 2 Socket 1 Socket 3 RAM 0 RAM 1 RAM 2 RAM 3 numactl --cpunodebind=1,2 ¥ --membind=1,2 e.g.) 3. Constructs two-type affinities NUMA node 0 NUMA node 1 thread 0 thread 1 thread 2 thread 3 RAM RAM Local RAM assigns threads in a position close to each other. Compact-type affinity export ULIBC_AFFINITY=compact:fine export OMP_NUM_THREADS=7 e.g.) export ULIBC_AFFINITY=scatter:fine export OMP_NUM_THREADS=7e.g.) NUMA node 0 NUMA node 1 thread 0 thread 2 thread 1 thread 3 RAM RAM Local RAM distributes the threads as evenly as possible across online processors. Scatter-type affinity RAM RAM
  • 17. NUMA-aware computation with ULIBC • ULIBC is a callable library for NUMA-aware computation • Detects processor topology on run time • Constructs thread and memory affinity setting #include <stdio.h> #include <omp.h> #include <ulibc.h> int main(void) { ULIBC_init(); _Pragma("omp parallel") { const int tid = ULIBC_get_thread_num(); ULIBC_bind_thread(); const struct numainfo_t loc = ULIBC_get_numainfo( tid ); printf(”Thread: %2d, NUMA-node: %d, NUMA-core: %d¥n", loc.id, loc.node, loc.core); /* do something */ } } initialize get thread id bind current thread get NUMA placement https://bitbucket.org/yuichiro_yasui/ulibc Thread: 4, NUMA-node: 0, NUMA-core: 1 Thread: 55, NUMA-node: 3, NUMA-core: 13 Thread: 16, NUMA-node: 0, NUMA-core: 4 Thread: 37, NUMA-node: 1, NUMA-core: 9 Thread: 30, NUMA-node: 2, NUMA-core: 7 . . . Core IDNUMA node IDThread ID include header file ULIBC is available at Execution log on 4-socket
  • 18. Level-synchronized parallel BFS (Top-down) • Started from source vertex and executes following two phases at each level Level k Level k+1CQ NQ Swap exchanges CQ and NQ for next level Traversal phase finds unvisited vertices from CQ and appends into NQ visited unvisited NQ Level 1 Source Level 0 CQ Level 2 Level 1 NQ CQ Level 3 Level 2 NQ CQ Level 0 Sync. Sync. Level 1 Level 2 NQCQ NQCQ
  • 19. Frontier Level k Level k+1 NeighborsFrontier Neighbors Level k Level k+1 Candidates of neighbors Direction-optimizing BFS [Beamer, SC12] • Top-down dir. using out-going edges • Bottom-up dir. using in-coming edges Outgoing edges Incoming edges Two directions; Top-down or Bottom-up 幅優先探索に対する前方探索 (Top-down) と後方探索 (Bottom-up) Level Top-down Bottom-up Hybrid 0 2 2,103,840,895 2 1 66,206 1,766,587,029 66,206 2 346,918,235 52,677,691 52,677,691 3 1,727,195,615 12,820,854 12,820,854 4 29,557,400 103,184 103,184 5 82,357 21,467 21,467 6 221 21,240 227 Total 2,103,820,036 3,936,072,360 65,689,631 Ratio 100.00% 187.09% 3.12% Distance from source Large frontier Top-down Top-down Bottom-up Direction-opt. BFS
  • 20. Outline • Introduction – Graph analysis for large-scale networks – Graph500 benchmark and Breadth-first search – NUMA-aware computation • Our proposal: Pruning of remote edge traversals • Numerical results on SGI UV systems
  • 21. Our contributions (Previous work and this paper) • Efficient Graph data structure 4 3 2 0 1 Input graph 1 3 0 4 2 Vertex sorting (HPCS15) Adjacency list sorting (ISC14) A0 A1 A2 A3 NUMA-aware graph (BD13) • Efficient BFS based on Beamer’s Direction-optimizing (SC12) CQ Socket−queue Remote Remote Local VS NQ Agarwal’s Top-down (SC10) Pruning edges Top-down direction Bottom-up direction • 131 GTEPS • 152 GTEPS (NEW) NUMA-aware Bottom-up (BD13) A0 A1 A2 A3 Input: CQ Data: VSk Output: NQk Local Sorting by outdegree CSR graph This paper Reduction of remote edges UV 2000 w/ 64 sockets UV 300 w/ 32 sockets • 219 GTEPS (NEW) • New results 16 % faster than highest single-node entry Binding on NUMA node
  • 22. Ours: NUMA-aware 1-D part. graph [BD13] • Divides sub graphs and assigns on each NUMA node A0 A1 A2 A3 Adjacency matrix 1-D part. Graph CPU RAM assigndivide CPU RAM CPU RAM CPU RAM NUMA node A0 A1 A2 A3 Input: Frontier CQ Data: visited VSk Output: neighbors NQ Local RAM Bottom-up direction • At bottom-up direction (Bottleneck component), each NUMA node computes partial NQ using local copied CQ and local assigned VS. Each sub graph represents by CSR graph Top-down direction uses inverse of G. (G is undirected) A0 A1 A2 A3 Input: Frontier CQk Data: visited VSk Output: neighbors NQ Local Local Remote Remote Modified version of Agarwal’s NUMA-aware BFS
  • 23. Ours: Adjacency list sorting [ISC14] • Reduces unnecessary edge traversals at Bottom-up dir. Loop count τ A(va) A(vb) finds frontier vertex and breaks this loop …… Bottom-up Skipped adjacencyvertices Traversed adjacency vertices • Sorting adjacency lists by the corresponding outdegree Vertex vi Vertex vi+1 Index Value High Low Adjacency vertices of vi Sorting by outdegree
  • 24. Ours: Vertex sorting [HPCS15] Degree distribution Access freq. w/ vertex sorting • # of vertex traversals equals the outdegree of the corresponding vertex • Our vertex sorting reorders vertex indices by the outdegrees Access freq. and OutDegree are correlated 4 3 2 0 1 1 3 0 4 2 Original indices Sorted indices by outdegree Highest outdegree Many accesses for small-index vertex
  • 25. NUMA-aware Top-down BFS • Original version was proposed by Agarwal [Agarwal-SC10] • Reducing random remote accesses using socket-queue CQ Local + Remote NUMA 0 NUMA 1 NUMA 2 NUMA 3 Local : Remote = 1 : ℓ on ℓ-sockets e.g.) focused on NUMA 2 synchronize Socket−queue Local VS NQ synchronize Swap CQ and NQ Append unvisited vertices into NQ Local Phase1: CQ NQ or Socket-queue Phase2: Socket-queue NQ Next level CQ Socket−queue Remote Remote Local VS NQ Append unvisited vertices into NQ
  • 26. NUMA-aware Top-down w/ Pruning remote edges • pruning remote edges to reduce remote accesses NUMA 0 NUMA 1 NUMA 2 NUMA 3 e.g.) focused remote edge traversal on NUMA2 This paperproposed by Agarwal’s SC10 paper with Pruningw/o Pruning (original) Each NUMA node appends remote edges (v,w) into the corresponding socket-queue, if the F doesn't contain w. (And then, F appends w) Each NUMA node appends all remote edges (v,w) into the corresponding socket-queue F (reuse CQ bitmap for Bottom-up) CQ (vector queue) Socket−queue Remote Local Local Remote The F is not initialized, while there is no change of search direction. CQ (vector queue) Socket−queue Remote Remote Each vertex is searched once only.
  • 27. Effects of pruning & Updated TEPS score • Pruned many remote edges 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 (4) (9.04K) (221M) (15.3B) (1.50B) (4.55M) (11.1K) (29) Local Pruned-remote Remote Level 7Level 6Level 5Level 4Level 3Level 2Level 1Level 0 Figure 5: Ratio of traversed edges on a NUMA node in the top-down algorithm with remote edge traversal pruning for a Kronecker graph with SCALE29 on a four-socket server (SB4). Each number in a bracket represents the total num- ber of traversed edges at each level. Algorithm 3: Top-down with pruning remote traversal Procedure NUMA-aware-Top-down(G, CQ, VS, ⇡) fork Top-down Bottom-up Top-down 0 50 100 150 200 1 2 4 8 16 32 64 128 GTEPS Number of NUMA nodes (CPU sockets) HPCS15-SG (SCALE 26 per NUMA node) This paper (SCALE 27 per NUMA node) 7.7 15.3 24.2 42.1 59.4 94.8 131.4 174.7 8.3 14.2 25.1 38.6 61.5 91.8 152.2 Figure 6: Weak scaling on UV 2000 5.2 SGI UV 300 In this study, we obtained new results on SGI UV 300, which has 32 CPU sockets and 16 TB memory. Fig. 7 de- picts TEPS versus number of CPU sockets (NUMA nodes). Table 8 shows the TEPS obtained for 32 CPU sockets. We discuss the results with the following parameters: SCALE29, on 4-socket Xeon UV2000 (used only 64 sockets) However, this method may not be effective on a few sockets, because the algorithm switch a direction as the bottom-up at middle levels. Pruned Previous: TD BU This paper: TD BU TD On many sockets, Updated TEPS score UV 2000 with 64 sockets; • w/o pruning: 131 GTEPS • w/ pruning: 152 GTEPS
  • 28. Outline • Introduction – Graph analysis for large-scale networks – Graph500 benchmark and Breadth-first search – NUMA-aware computation • Our proposal: Pruning of remote edge traversals • Numerical results on SGI UV systems
  • 29. Weak scaling performance • UV 300 clearly outperforms UV 2000. 0 50 100 150 200 1 2 4 8 16 32 64 GTEPS Number of sockets UV2000: Weak scaling with SCALE 27 per socket) UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket) UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket) UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket) UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket) 8.3 14.2 25.1 38.6 61.5 91.8 152.2 16.3 29.2 53.5 83.9 129.4 171.0 15.9 28.0 57.9 93.5 151.4 209.3 91.4 147.6 203.7 18.7 32.5 64.7 100.3 161.5 219.4 UV 2000 UV 300 Compared on next slide.
  • 30. Breakdown of system configuration on UV300 • UV 300 is 2x faster than UV 2000 – same sockets (32 sockets) – #ThreadsPerSockets = #logical cores • Best perf. of UV 300 obtained with – Larger problem size – THP (transparent huge page) enabled – Set memory reference mode as local-mode – HT (Hyperthreading) enabled System #sockets SCALE HT THP Mem-Ref-mode GTEPS UV2000 32 32 − 92 UV300 32 32 Remote 171 34 Remote 204 Remote 209 *1 Local 188 Local 219 +16.5% by HT enabled 0 50 100 150 200 1 2 4 8 16 32 64 GTEPS Number of sockets UV2000: Weak scaling with SCALE 27 per socket) UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket) UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket) UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket) UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket) 8.3 14.2 25.1 38.6 61.5 91.8 152.2 16.3 29.2 53.5 83.9 129.4 171.0 15.9 28.0 57.9 93.5 151.4 209.3 91.4 147.6 203.7 18.7 32.5 64.7 100.3 161.5 219.4 + 4.8% by Memory Reference mode + 2.5% by THP enabled +19.3% by using a larger memory space *1: uses # of threads same as physical cores. emulated "Hyperthreading disabled”. Perf. gap 28.3 % perf. gap
  • 31. New results and Nov. 2015 list Updated fastest single-node Ours fastest of single-node Ours SCALE34 219 GTEPS SGI UV300 (1 node / 576 cores) − HT enabled − THP enabled − local-ref. mode SGI UV2000 (1280 cores) SCALE 33 174.7 GTEPS SGI UV2000 (640 cores) SCALE 33 149.8 GTEPS
  • 32. Bandwidth and TEPS • BW and TEPS of our implementations on 3 systems – GB/s: STEAM TRIAD with 10 M elements per socket – TEPS: SCALE27 (n=134M, m=2.15B) per socket via a modified implementation using ULIBC, in which each thread computed the partial TRIAD operation for vectors on local memory only, shown in subsection 3.2. Figure shows correlativity between the memory bandwidth and the graph traversal performance. The optimized Graph500 implemen- tation and our previous implementation are scalable, like the memory bandwidth. In contrast, the reference code of Graph500 is not scalable and cannot exploit the NUMA sys- tem e ciently. 2 4 8 16 32 64 128 256 16 32 64 128 256 512 1024 2048 GTEPS Memory Bandwidth (GB/s) UV300 (Haswell) (HT, THP, Local): This paper UV2000 (Ivy Bridge): This paper UV2000 (Ivy Bridge): BD13 SB4 (Sandy Bride-EP) (HT, THP): This paper SB4 (Sandy Bride-EP) (HT, THP): BD13 (a) GTEPS Our previous [BD13] This paper • Bandwidth and GTEPS are correlated on three Xeon processors • UV300 – 32-sockets Haswell • UV2000 – 64-sockets Ivy Bridge • SB4 – 4-sockets Sandy Bridge-EP Systems
  • 33. Conclusion • NUMA / cc-NUMA architecture • Graph algorithm; BFS • Efficient NUMA-aware BFS algorithm – NUMA-aware to improved a locality of memory access – Exploit multithreading on many-socket system (SGI UV2000, UV300) Motivations - SCALE - edgefactor Input parameters Graph generation Graph construction BFS - SCALE - edgefactor Input parameters Graph generation Graph construction VBFS 6 Local access Remote access ・・・ Many-socket system Represents many relationships by graph structure • NUMA-aware scalable BFS algorithm – Scalable more than thousand threads on SGI UV 2000 and SGI UV 300 – Updated highest score single-node as 219 GTEPS on SGI UV300 with 32 sockets • “ULIBC”: Callable library for NUMA-aware computation – available at https://bitbucket.org/yuichiro_yasui/ulibc Contributions Pruning edge traversal to reduce remote edges
  • 34. References • [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System, IEEE BigData 2013 • [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014 • [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015. • [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui, K. Iwabuchi, and T. Endo: Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale upercomputers, Proceedings of the Optimization in the Real World -- Toward Solving Real-World Optimization Problems --, Springer, 2015. NUMA-aware BFS algorithm Other results of Our Graph500 team