The document discusses NUMA-aware scalable graph traversal on SGI UV systems. It proposes an efficient NUMA-aware breadth-first search (BFS) algorithm for large-scale graph processing by pruning remote edge traversals. Numerical results on SGI UV 300 systems with 32 sockets show the algorithm achieves 219 billion traversed edges per second (GTEPS), setting a new single-node performance record on the Graph500 benchmark.
NUMA-aware Scalable Graph Traversal on SGI UV Systems
1. NUMA-aware scalable graph traversal
on SGI UV systems
*Yuichiro Yasui, Katsuki Fujisawa
Kyushu University
Eng Lim Goh, John Baron, Atsushi Sugiura
SGI Corp.
Takashi Uchiyama
SGI Japan, Ltd.
HPGP’16 @ Kyoto at May 31, 2016
2. Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
3. Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
4. Our motivations
• NUMA / cc-NUMA architecture • Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm
– improves locality of reference in memory accesses
– exploits multithreading on many-socket system (SGI UV2000, UV300)
- SCALE
- edgefactor
Input parameters Graph generation Graph construction BFS
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
Local access
Remote access
・・・
Many-socket system
Represents many relationships by graph structure
CPU
RAM
CPU
RAM
CPU
RAM
…
Partial CSR graphassigned
Kronecker graph w/ SCALE 34
17 billion nodes and 275 billion edges
SGI UV 300
32-sockets 18-core Xeon and 16 TB RAM
5. Our contributions (Previous work and this paper)
• Efficient Graph data structure
4
3
2
0
1
Input graph
1
3
0
4
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• Efficient BFS based on Beamer’s Direction-optimizing (SC12)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges
Top-down direction Bottom-up direction
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:
CQ
Data:
VSk
Output:
NQk
Local
Sorting by outdegree
CSR graph
This paper
Reduction of
remote edges
UV 300 w/ 32 sockets
219 GTEPS (NEW)
New result !!
Updated highest score
On single-node
Binding on
NUMA node
6. Graph processing for Large scale networks
• Large-scale networks are generated in widely appli. areas
– US Road network: 58 million edges
– Twitter follow-ship: 1.47 billion edges
– Neuronal network: 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing with HPC
– categorized as data intensive application
large
61.6 million vertices
& 1.47 billion edges
7. • Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Used to understand relationships in real-world networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortestpath
・Maximum flow ・Maximal independentset
・Centrality metrics ・Clustering ・Graph Mining
8. • One of most important and fundamental algorithm to traverse graph structures
• Many algorithms and applications based on BFS (Max. flow and Centrality)
• Well-known algorithm takes O(n+m) for a digraph G with n vertices and m edges
Breadth-first search (BFS)
Source
BFS Lv. 3
Source Lv. 2
Lv. 1
Outputs
• BFS tree
• Distance
Inputs
• digraph G = (V, E)
• Source vertex
• Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Used to understand relationships in real-world networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortestpath
・Maximum flow ・Maximal independentset
・Centrality metrics ・Clustering ・Graph Mining
BFS Tree
9. BFS on Twitter follow-ship network
• follow-ship network
– #Users (#vertices): 41,652,230
– Follow-ships (#edges): 2,405,026,092
Lv. #users ratio (%) percentile (%)
0 1 0.00 0.00
1 7 0.00 0.00
2 6,188 0.01 0.01
3 510,515 1.23 1.24
4 29,526,508 70.89 72.13
5 11,314,238 27.16 99.29
6 282,456 0.68 99.97
7 11536 0.03 100.00
8 673 0.00 100.00
9 68 0.00 100.00
10 19 0.00 100.00
11 10 0.00 100.00
12 5 0.00 100.00
13 2 0.00 100.00
14 2 0.00 100.00
15 2 0.00 100.00
Total 41,652,230 100.00 -
BFS result from User 21,804,357
excluding unconnected users
Six-degrees of separation
“everyone and everything is
six or fewer steps away”
Ours: 60 milliseconds per BFS
Twitter2009
10. Graph500 and Green Graph500
• New benchmarks using graph processing (breadth-first search)
• measures a performance and energy efficiency of irregular
memory access
TEPS score (# of Traversed edges per second) for
Measuring a performance of irregular memory accesses
TEPS per Watt score for measuring
power-efficient performamce
Graph500 benchmark Green Graph500 benchmark
1. Generation
or
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
eters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
LE
efactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
ut parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 642. Construction
x 64
Median TEPS
SCALE & edgefactor (=16)
Kronecker graph with 2SCALE vertices and
2SCALE edgefactor edges
by using SCALE-times the recursive
Kronecker productsG1 G2 G3 G4
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 64
x 64
Median of
64 TEPSs
Power
consumption
Power consumption
in watt
TEPS per Watt
11. NUMA (Non-uniform memory access) system
NUMA 0
NUMA 1 NUMA 2
NUMA 3
0
1
2
3
0 1 2 3
targetNUMAnode
source NUMA node
24.2
3.4
3.0
3.4
3.3
23.9
3.5
3.0
3.0
3.4
24.3
3.4
3.5
3.0
3.4
24.2
Local access: 24 GB/s
Remote access: 3 GB/s
NUMA 0
NUMA 1 NUMA 2
NUMA 3
Data
Threads
NUMA system w/ 4-sockets
Data
threads
Fast local access Slow non-local access
Different
distances
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
Thread placement
Memoryplacement
(Example) 4-socket Xeon system
• 4 (# of CPU sockets)
• 8 (# of physical cores per socket)
• 2 (# of threads per core)
diagonal elements
= local access
12. SGI UV 2000
• UV 2000
– Single OS: SUSE Linux 11 (x86_64)
– hypercube interconnection
– Up to 2,560 cores and 64 TB RAM
– (= 128 UV 2000 shassis x 2 sockets x 10 cores)
ISM has two full-spec. UV 2000
• Hierarchical network topologies
– Sockets, Chassis, Cubes, Inner-racks, and Outer-racks
UV2000 Chassis = 2 sockets Cube = 8 Chassis Rack = 32 nodes
CPU
RAM
CPU
RAM
4 =
NUMAlink6
6.7GB/s
Cannotdetect
NUMA
node
ISM Kyushu U.
13. RA
M
CPU
to other chassis
to other chassis
SGI UV 300
• UV 300
– Single OS: SUSE Linux 11 (x86_64)
– All-to-all interconnection
– Up to 1,152 cores and 16 TB RAM
– (= 8 UV 300 shassis x 4 sockets x 18 cores x 2 SMT)
• UV 300 chassis
– 4-socket 18-core Intel Xeon E7-8867 (Haswell)
– 2TB RAM (512 GB per NUMA node)
UV300 chassis
UV300 Rack
All-to-All
• 18-core Xeon E7-8867
• HT enabled (2 SMT)
• 512GB RAM
NUMA node
UV 300 chassis
Kyushu U.
8 chassis
14. Memory Bandwidths on UV 2000 and UV 300
• Bandwidths in GB/s b/w NUMA nodes using STREAM TRIAD
• Local access is clearly faster than remote access
0
16
32
48
0 16 32 48
MemoryPlacement
Thread Placement
0
5
10
15
20
25
30
35
UV 2000 (64 sockets)
3-7 GB/s
3-7 GB/s
Each chassis has 2 sockets and connects to
each other by hypercube topology
Local
33 GB/s
UV 2000
chassis
0
4
8
12
16
20
24
28
0 4 8 12 16 20 24 28
MemoryPlacement
Thread Placement
0
10
20
30
40
50
60
UV 300 (32 sockets)
Each chassis has 4 sockets and connects to
each other by all-to-all topology
6 GB/s
6 GB/s
Local
56 GB/s
12-14 GB/s
UV 300
chassis
15. Programming cost for NUMA-aware
• Thread and Memory binding
– Reduce remote access
– Avoid thread migration
• Linux provides naïve interfaces
– sched_{set,get}affinity()
• binds a thread on a processor set (specifying by processor id)
– mbind()
• binds pages on a NUMA node set (specifying by NUMA node id)
– Linux provides processer id and NUMA node id as system files;
/proc/cpuinfo, /sys/devices/system/{node,cpu}/
• Reducing programming cost using ULIBC
– provides Some APIs for NUMA-aware programming
– available at
https://bitbucket.org/yuichiro_yasui/ulibc
#define _GNU_SOURCE
#include <sched.h>
int bind_thread(int procid) {
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(procid, &set);
return sched_setaffinity( (pid_t)0,
sizeof(cpu_set_t), &set) );
}
Specifying by Processor id
16. 2. Detects online topology
CPU Affinity construction using ULIBC
RAM
1. Detects entire topology
RAM
RAM RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
RAM
RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
numactl --cpunodebind=1,2 ¥
--membind=1,2
e.g.)
3. Constructs two-type affinities
NUMA node 0
NUMA node 1
thread 0
thread 1
thread 2
thread 3
RAM
RAM
Local RAM
assigns threads in a position close to each other.
Compact-type affinity
export ULIBC_AFFINITY=compact:fine
export OMP_NUM_THREADS=7
e.g.)
export ULIBC_AFFINITY=scatter:fine
export OMP_NUM_THREADS=7e.g.)
NUMA node 0
NUMA node 1
thread 0 thread 2
thread 1 thread 3
RAM
RAM
Local RAM
distributes the threads as evenly as possible
across online processors.
Scatter-type affinity
RAM
RAM
17. NUMA-aware computation with ULIBC
• ULIBC is a callable library for NUMA-aware computation
• Detects processor topology on run time
• Constructs thread and memory affinity setting
#include <stdio.h>
#include <omp.h>
#include <ulibc.h>
int main(void) {
ULIBC_init();
_Pragma("omp parallel") {
const int tid = ULIBC_get_thread_num();
ULIBC_bind_thread();
const struct numainfo_t loc = ULIBC_get_numainfo( tid );
printf(”Thread: %2d, NUMA-node: %d, NUMA-core: %d¥n",
loc.id, loc.node, loc.core);
/* do something */
}
}
initialize
get thread id
bind current thread
get NUMA placement
https://bitbucket.org/yuichiro_yasui/ulibc
Thread: 4, NUMA-node: 0, NUMA-core: 1
Thread: 55, NUMA-node: 3, NUMA-core: 13
Thread: 16, NUMA-node: 0, NUMA-core: 4
Thread: 37, NUMA-node: 1, NUMA-core: 9
Thread: 30, NUMA-node: 2, NUMA-core: 7
. . .
Core IDNUMA node IDThread ID
include header file
ULIBC is available at
Execution log on 4-socket
18. Level-synchronized parallel BFS (Top-down)
• Started from source vertex and
executes following two phases at
each level
Level k
Level k+1CQ
NQ
Swap exchanges CQ and NQ for next
level
Traversal phase finds unvisited vertices
from CQ and appends into NQ
visited
unvisited
NQ
Level 1
Source
Level 0
CQ
Level 2
Level 1
NQ
CQ
Level 3
Level 2
NQ
CQ
Level 0
Sync.
Sync.
Level 1
Level 2
NQCQ
NQCQ
19. Frontier
Level k
Level k+1
NeighborsFrontier Neighbors
Level k
Level k+1
Candidates of
neighbors
Direction-optimizing BFS [Beamer, SC12]
• Top-down dir. using out-going edges • Bottom-up dir. using in-coming edges
Outgoing
edges Incoming
edges
Two directions; Top-down or Bottom-up
幅優先探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Distance from source
Large
frontier
Top-down
Top-down
Bottom-up
Direction-opt. BFS
20. Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
21. Our contributions (Previous work and this paper)
• Efficient Graph data structure
4
3
2
0
1
Input graph
1
3
0
4
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• Efficient BFS based on Beamer’s Direction-optimizing (SC12)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges
Top-down direction Bottom-up direction
• 131 GTEPS
• 152 GTEPS (NEW)
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:
CQ
Data:
VSk
Output:
NQk
Local
Sorting by outdegree
CSR graph
This paper
Reduction of
remote edges
UV 2000 w/ 64 sockets
UV 300 w/ 32 sockets
• 219 GTEPS (NEW)
• New results
16 %
faster than highest
single-node entry
Binding on
NUMA node
22. Ours: NUMA-aware 1-D part. graph [BD13]
• Divides sub graphs and assigns on each NUMA node
A0
A1
A2
A3
Adjacency matrix 1-D part. Graph
CPU
RAM
assigndivide
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
A0
A1
A2
A3
Input:
Frontier
CQ
Data:
visited VSk
Output:
neighbors
NQ
Local RAM
Bottom-up direction
• At bottom-up direction (Bottleneck component), each NUMA node
computes partial NQ using local copied CQ and local assigned VS.
Each sub graph represents by CSR graph
Top-down direction uses inverse of G.
(G is undirected)
A0
A1
A2
A3
Input:
Frontier
CQk
Data:
visited VSk
Output:
neighbors
NQ
Local
Local
Remote Remote
Modified version of Agarwal’s NUMA-aware BFS
23. Ours: Adjacency list sorting [ISC14]
• Reduces unnecessary edge traversals at Bottom-up dir.
Loop count τ
A(va)
A(vb)
finds frontier vertex and breaks this loop
……
Bottom-up
Skipped adjacencyvertices
Traversed adjacency vertices
• Sorting adjacency lists by the corresponding outdegree
Vertex vi Vertex vi+1
Index
Value
High Low
Adjacency vertices of vi
Sorting by outdegree
24. Ours: Vertex sorting [HPCS15]
Degree distribution
Access freq. w/ vertex sorting
• # of vertex traversals equals the outdegree of the corresponding vertex
• Our vertex sorting reorders vertex indices by the outdegrees
Access freq. and OutDegree are correlated
4
3
2
0
1
1
3
0
4
2
Original
indices
Sorted indices by outdegree
Highest
outdegree
Many accesses
for small-index vertex
25. NUMA-aware Top-down BFS
• Original version was proposed by Agarwal [Agarwal-SC10]
• Reducing random remote accesses using socket-queue
CQ
Local + Remote
NUMA 0
NUMA 1
NUMA 2
NUMA 3
Local : Remote = 1 : ℓ on ℓ-sockets
e.g.) focused on NUMA 2
synchronize
Socket−queue
Local
VS NQ
synchronize
Swap CQ and NQ
Append unvisited
vertices into NQ
Local
Phase1: CQ NQ or Socket-queue Phase2: Socket-queue NQ Next level
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Append unvisited
vertices into NQ
26. NUMA-aware Top-down w/ Pruning remote edges
• pruning remote edges to reduce remote accesses
NUMA 0
NUMA 1
NUMA 2
NUMA 3
e.g.) focused remote edge traversal on NUMA2
This paperproposed by Agarwal’s SC10 paper
with Pruningw/o Pruning (original)
Each NUMA node appends remote edges (v,w)
into the corresponding socket-queue, if the F
doesn't contain w. (And then, F appends w)
Each NUMA node appends all remote
edges (v,w) into the corresponding
socket-queue
F (reuse CQ bitmap for Bottom-up)
CQ
(vector queue)
Socket−queue
Remote
Local
Local
Remote
The F is not initialized,
while there is no change of
search direction.
CQ
(vector queue)
Socket−queue
Remote
Remote
Each vertex is
searched once only.
27. Effects of pruning & Updated TEPS score
• Pruned many remote edges
0.0
0.2
0.4
0.6
0.8
1.0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
(4) (9.04K) (221M) (15.3B) (1.50B) (4.55M) (11.1K) (29)
Local
Pruned-remote
Remote
Level 7Level 6Level 5Level 4Level 3Level 2Level 1Level 0
Figure 5: Ratio of traversed edges on a NUMA node in the
top-down algorithm with remote edge traversal pruning for
a Kronecker graph with SCALE29 on a four-socket server
(SB4). Each number in a bracket represents the total num-
ber of traversed edges at each level.
Algorithm 3: Top-down with pruning remote traversal
Procedure NUMA-aware-Top-down(G, CQ, VS, ⇡)
fork
Top-down Bottom-up Top-down
0
50
100
150
200
1 2 4 8 16 32 64 128
GTEPS Number of NUMA nodes (CPU sockets)
HPCS15-SG (SCALE 26 per NUMA node)
This paper (SCALE 27 per NUMA node)
7.7
15.3
24.2
42.1
59.4
94.8
131.4
174.7
8.3
14.2
25.1
38.6
61.5
91.8
152.2
Figure 6: Weak scaling on UV 2000
5.2 SGI UV 300
In this study, we obtained new results on SGI UV 300,
which has 32 CPU sockets and 16 TB memory. Fig. 7 de-
picts TEPS versus number of CPU sockets (NUMA nodes).
Table 8 shows the TEPS obtained for 32 CPU sockets. We
discuss the results with the following parameters:
SCALE29, on 4-socket Xeon UV2000 (used only 64 sockets)
However, this method may not be effective on a few
sockets, because the algorithm switch a direction as
the bottom-up at middle levels.
Pruned
Previous: TD BU
This paper: TD BU TD
On many sockets, Updated TEPS score
UV 2000 with 64 sockets;
• w/o pruning: 131 GTEPS
• w/ pruning: 152 GTEPS
28. Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
29. Weak scaling performance
• UV 300 clearly outperforms UV 2000.
0
50
100
150
200
1 2 4 8 16 32 64
GTEPS
Number of sockets
UV2000: Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)
8.3
14.2
25.1
38.6
61.5
91.8
152.2
16.3
29.2
53.5
83.9
129.4
171.0
15.9
28.0
57.9
93.5
151.4
209.3
91.4
147.6
203.7
18.7
32.5
64.7
100.3
161.5
219.4
UV 2000
UV 300
Compared on next slide.
30. Breakdown of system configuration on UV300
• UV 300 is 2x faster than UV 2000
– same sockets (32 sockets)
– #ThreadsPerSockets = #logical cores
• Best perf. of UV 300 obtained with
– Larger problem size
– THP (transparent huge page) enabled
– Set memory reference mode as local-mode
– HT (Hyperthreading) enabled
System #sockets SCALE HT THP Mem-Ref-mode GTEPS
UV2000 32 32 − 92
UV300 32
32 Remote 171
34
Remote 204
Remote 209
*1 Local 188
Local 219 +16.5% by HT enabled
0
50
100
150
200
1 2 4 8 16 32 64
GTEPS
Number of sockets
UV2000: Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)
8.3
14.2
25.1
38.6
61.5
91.8
152.2
16.3
29.2
53.5
83.9
129.4
171.0
15.9
28.0
57.9
93.5
151.4
209.3
91.4
147.6
203.7
18.7
32.5
64.7
100.3
161.5
219.4
+ 4.8% by Memory
Reference mode
+ 2.5% by THP enabled
+19.3% by using a
larger memory space
*1: uses # of threads same as physical cores. emulated "Hyperthreading disabled”.
Perf.
gap
28.3 % perf. gap
32. Bandwidth and TEPS
• BW and TEPS of our implementations on 3 systems
– GB/s: STEAM TRIAD with 10 M elements per socket
– TEPS: SCALE27 (n=134M, m=2.15B) per socket
via a modified implementation using ULIBC, in which each
thread computed the partial TRIAD operation for vectors
on local memory only, shown in subsection 3.2. Figure shows
correlativity between the memory bandwidth and the graph
traversal performance. The optimized Graph500 implemen-
tation and our previous implementation are scalable, like
the memory bandwidth. In contrast, the reference code of
Graph500 is not scalable and cannot exploit the NUMA sys-
tem e ciently.
2
4
8
16
32
64
128
256
16 32 64 128 256 512 1024 2048
GTEPS
Memory Bandwidth (GB/s)
UV300 (Haswell) (HT, THP, Local): This paper
UV2000 (Ivy Bridge): This paper
UV2000 (Ivy Bridge): BD13
SB4 (Sandy Bride-EP) (HT, THP): This paper
SB4 (Sandy Bride-EP) (HT, THP): BD13
(a) GTEPS
Our previous [BD13]
This paper
• Bandwidth and GTEPS are correlated on three Xeon processors
• UV300
– 32-sockets Haswell
• UV2000
– 64-sockets Ivy Bridge
• SB4
– 4-sockets Sandy Bridge-EP
Systems
33. Conclusion
• NUMA / cc-NUMA architecture • Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm
– NUMA-aware to improved a locality of memory access
– Exploit multithreading on many-socket system (SGI UV2000, UV300)
Motivations
- SCALE
- edgefactor
Input parameters Graph generation Graph construction BFS
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
Local access
Remote access
・・・
Many-socket system
Represents many relationships by graph structure
• NUMA-aware scalable BFS algorithm
– Scalable more than thousand threads on SGI UV 2000 and SGI UV 300
– Updated highest score single-node as 219 GTEPS on SGI UV300 with 32 sockets
• “ULIBC”: Callable library for NUMA-aware computation
– available at https://bitbucket.org/yuichiro_yasui/ulibc
Contributions Pruning edge traversal
to reduce remote edges
34. References
• [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel
Breadth-first Search on Multicore Single-node System, IEEE BigData 2013
• [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient
Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014
• [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread
parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015.
• [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui,
K. Iwabuchi, and T. Endo: Advanced Computing & Optimization
Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale
upercomputers, Proceedings of the Optimization in the Real World --
Toward Solving Real-World Optimization Problems --, Springer, 2015.
NUMA-aware BFS algorithm
Other results of Our Graph500 team