Please cite the following paper:
Toyotaro Suzumura and Koji. Ueno, "ScaleGraph: A high-performance library for billion-scale graph analytics," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 76-84.
doi: 10.1109/BigData.2015.7363744
Recently, large-scale graph analytics has become a very popular topic owing to the emergence of gigantic graphs whose number of vertices and edges is in millions, billions or even trillions. Many graph analytics libraries and frameworks have been proposed with various computational models and programming languages to deal with such graphs. X10 programming language is a PGAS language that aims at both software performance and programmer's productivity. We introduce ScaleGraph library developed using X10 programming to illustrate the use of X10 for large-scale graph analytics. ScaleGraph library provides XPregel framework that is inspired by Google's Pregel computation model, serving as a building block for implementing graph kernels. We also optimized X10 runtime in some parts such as collective communication and memory management. We evaluated the performance and scalability of ScaleGraph libraries. The result shows that most graph kernels have good performance and scalability. ScaleGraph library is 9.4 times faster than Giraph in the experiment of PageRank with 16 machine nodes. To the best of our knowledge, ScaleGraph is the first X10-based library to address performance, scalability and productivity issues in dealing with large-scale graph analytics.
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
1. ScaleGraph
A High-Performance Library for Billion-
Scale Graph Analytics
Toyotaro Suzumura1,2 and Koji Ueno2
1 IBM T.J. Watson Research Center, New York, USA
2 Tokyo Institute of Technology, Tokyo, Japan
2. Billion-Scale Data
§ World Population: 7.15 billion (2013/07)
§ Social Network
– Facebook : 1.23 billion users (2013/12)
– WhatsApp : 1 billion users (2015/08)
§ Internet of Things / M2M: 26 billion
devices by 2020 (2013/12, Gartner)
§ RDF (Linked Data) Graph: 2.46
billion triples in DBPedia
§ Human Brain : 100 billion neurons
with 100 trillion connections
3. Large-Scale Graph Mining is Everywhere
Internet Map
Symbolic Networks:
Protein
InteractionsSocial Networks
Cyber Security (15 billion log
entries / day for large enterprise)
Cybersecurity
Medical Informatics
Data Enrichment
Social Networks
Symbolic Networks
6. Project Goal: ScaleGraph Library
§ Build an open source Highly Scalable Large
Scale Graph Analytics Library beyond the
scale of billions of vertices and edges on
Distributed Systems
6
Internet Map
Symbolic Networks:
Protein
InteractionsSocial Networks
Cyber Security (15 billion log entries / day for
large enterprise)
7. Research Challenges and Problem Statement
§ Programming Model
– Should have sufficient capabilities of representing various graph algorithms
– Should be easy-to-use programming model for users, Sync. vs. Async. ?
§ Data Representation and Distribution
– Should be as much efficient as possible, and need to handle highly skewed
workload imbalance
§ Programming Language
– Java, C/C++, or new HPCS language ?
– Should cope with the advance of the underlying hardware infrastructure (e.g.
Accelerator, etc)
§ Communication Abstractions : MPI, PAMI (BG/Q), GASNet (LLNL), Threads,..
7
How do you design and implement a high performance graph analytics
platform that is capable of dealing with various distributed-memory or
many-core environments in a highly productive manner ?
9. Pregel Programming Model [SIGMOD’10]
§ Each vertex initializes its state.
9
1Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing." Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data. ACM, 2010.
12. Pregel Programming Model
§ Each vertex sends messages to other vertices.
12
And compute and send messages and …
13. Design of ScaleGraph
§ Language Choice : X10 (IBM Research)
§ Programming Model:
– Pregel computation model or SpMV Model
§ Graph Representation
– Distributed Sparse Matrix (1D or 2D)
§ Performance and Memory Management Optimization
– Optimized collective routines (e.g., alltoall, allgather, scatter and barrier)
– Message Optimization
– Highly optimized array data structure (i.e., MemoryChunk) for very large
chunk of memory allocation
15. Why X10 as the underlying language ?
§ High Productivity
– X10 allows us to write a platform on distributed systems in a highly
productivity manner than C/C++/Fortran with MPI.
– Examples:
• Graph Algorithm (Degree distribution) → 60 lines of X10 codes
• XPregel (Graph Processing System) → 1600 lines of X10 codes
(Apache Giraph : around 11,000 only for communication package)
§ Interoperability with existing C/C++ codes
– X10 program can call functions written in native language (C/C++)
without performance loss.
– It is easy to integrate existing native libraries (such as SCALAPACK,
ParMETIS and PARPACK).
– We can also write performance critical codes in C/C++ and integrate it
with X10 program.
§ Communication Abstraction
16. ScaleGraph Software Stack
16
XPregel
Graph Processing System
ScaleGraph Core Lib
MPI
Graph Algorithm
X10 Core Lib
X10
BLAS for Sparse Matrix File IO
User Program
Third-Party Libraries
(ARPACK, METIS)X10 & C++
Optimized Team
X10 Native Runtime
Third-Party Library Interface
17. Two Models for Computing Graph Algorithms
§ Pregel [G. Malewicz, SIGMOD '10]
– Programming model and system for graph processing.
– Based on Bulk Synchronous Parallel Model [Valient, 1990]
– We built a Pregel-model platform with X10 named XPregel
§ Sparse Matrix Vector Multiplication
– PageRank, Random walk with Restart, Spectral Clustering
(which uses eigen vector computation)
18. XPregel : X10-based Pregel Runtime
§ X10-based Pregel-model runtime platform that aims at
running on various computing environments from many-
core systems to distributed systems
§ Performance Optimization
1. Utilize native MPI collective communication for message
exchange.
2. Avoid serialization, which enables utilizing fast inter-
communication of supercomputers
3. The destination of message can be computed by a simple bit
manipulation because of the vertex id renumbering.
4. Optimized message communication method that can be
used when a vertex send the same message to all the
neighbor vertices.
18
19. Programming Model
§ The core algorithm of a graph kernel can be
implemented by calling iterate method of
XPregelGraph as shown in the example.
§ Users are also required to specify the type of
messages (M) as well as the type of
aggregated value (V).
§ The method accepts three closures: compute
closure, aggregator closure, and end closure.
§ In each superstep (iteration step), a vertex
contributes its value, which depends on the
number of links, to its neighbors.
§ Each vertex summarizes the score from its
neighbors and then set the score as its
value.
§ The computation continues until the
aggregated value of change in vertex’s value
less than a given criteria or the number of
iterations less than a given value.
xpgraph.iterate[Double,Double](
// Compute closure
(ctx :VertexContext[Double, Double, Double, Double],
messages :MemoryChunk[Double]) => {
val value :Double;
if(ctx.superstep() == 0) {
// calculate initial page rank score of each vertex
value = 1.0 / ctx.numberOfVertices();}
else {
// for step onward,
value = (1.0-damping) / ctx.numberOfVertices() +
damping * MathAppend.sum(messages);}
// sum score
ctx.aggregate(Math.abs(value - ctx.value()));
// set new rank score
ctx.setValue(value);
// broadcast its score to its neighbors
ctx.sendMessageToAllNeighbors(value /
ctx.outEdgesId().size());
},
// Aggregate closure: calculate aggregate value
(values :MemoryChunk[Double]) => MathAppend.sum(values),
// End closure : should continue ?
(superstep :Int, aggVal :Double) => {
return (superstep >= maxIter || aggVal < eps);
});
PageRank Example
public def iterate[M,A](
compute :(ctx:VertexContext [V,E,M,A],
messages:MemoryChunk[M]) => void,
aggregator :(MemoryChunk[A])=>A,
end :(Int,A)=>Boolean)
20. Graph representation and its 1D row-wise
distribution on distributed systems
§ A directed weighted graph is represented as a distributed adjacency
matrix, where row indices represent source vertices and column indices
represent target vertices
§ The local id and the place of a vertex can be determined from the vertex id
itself by using only bit-wise operations
§ This reduces computation overhead of graph algorithms that usually
frequently check which place is the owner of given vertices
20
0 3
2 4
1 5
6
7
1
2 1
2
3
4
5
0 1 2 3 4 5 6 7
0 ∞ 1 ∞ ∞ ∞ ∞ ∞ ∞
1 ∞ ∞ ∞ 2 1 3 ∞ ∞
2 ∞ 2 ∞ ∞ ∞ ∞ ∞ ∞
3 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
4 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
5 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
6 ∞ ∞ ∞ ∞ ∞ 4 ∞ ∞
7 ∞ ∞ ∞ ∞ ∞ 5 ∞ ∞
target
source
21. Various distributions of distributed
sparse matrix on four Places
§ For two-dimensional block distribution, the sparse matrix will be partitioned into blocks. The number
of the blocks is given by R C and must match the number of the given places, where R is the
number of rows and C is the number of columns to partition.
§ 2D block (R=2,C=2), 1D column wise (R=1, C=4), and 1D row wise (R=4,C=1)
21
0 1 2 3 4 5 6 7
0 ∞ 1 ∞ ∞ ∞ ∞ ∞ ∞
1 ∞ ∞ ∞ 2 1 3 ∞ ∞
2 ∞ 2 ∞ ∞ ∞ ∞ ∞ ∞
3 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
4 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
5 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
6 ∞ ∞ ∞ ∞ ∞ 4 ∞ ∞
7 ∞ ∞ ∞ ∞ ∞ 5 ∞ ∞
target
source
P0 P1 P2 P3
0 1 2 3 4 5 6 7
0 ∞ 1 ∞ ∞ ∞ ∞ ∞ ∞
1 ∞ ∞ ∞ 2 1 3 ∞ ∞
2 ∞ 2 ∞ ∞ ∞ ∞ ∞ ∞
3 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
4 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
5 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
6 ∞ ∞ ∞ ∞ ∞ 4 ∞ ∞
7 ∞ ∞ ∞ ∞ ∞ 5 ∞ ∞
target
source
0 1 2 3 4 5 6 7
0 ∞ 1 ∞ ∞ ∞ ∞ ∞ ∞
1 ∞ ∞ ∞ 2 1 3 ∞ ∞
2 ∞ 2 ∞ ∞ ∞ ∞ ∞ ∞
3 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
4 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
5 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
6 ∞ ∞ ∞ ∞ ∞ 4 ∞ ∞
7 ∞ ∞ ∞ ∞ ∞ 5 ∞ ∞
target
source
22. Graph Representation
§ Edge list file
– The files that contains edge list.
§ Distributed edge list
§ Distributed Sparse Matrix
– CSR format
(Compressed Sparse Row)
source,target
0,10
0,13
1,2
3,5
…
Place 0 Place 1 Place 2 Place 3
Edge List File
source
target
offset
vertices
weight
Read Write
Graph
Construction
Output
Distributed Edge List
Distributed Sparse Matrix
ScaleGraph supports renumbering
vertex IDs when it loads graphs from file.
ScaleGraph uses cyclic vertex
distribution.
ScaleGraph supports both 1D and
2D matrix distribution.
XPregel
(CSR)
SpMV
(CSC)
22
24. Our Proposed Optimization (1):
- Efficient Memory Management for Big Graphs
§ Our proposed Explicit Memory Management (EMM) can be used through an array,
MemoryChunk (used as the same as X10’s native array)
§ It is designed to deal with a large number of items.
§ The memory allocation in MemoryChunk consists of two modes for small memory
requests and large memory requests, respectively.
– The appropriate mode is determined internally from the size of requested memory and a certain
memory threshold.
§ For small memory requests, MemoryChunk uses Boehm GC (Garbage Collection)
allocation scheme, while for large memory requests,MemoryChunk explicitly uses
malloc and free system calls
24 PageRank on RMAT scale 24 graph
25. Our Proposed Optimization (2):
- Optimizing Collective Communication
§ Modified X10 so that we can use native MPI collective communication
via x10.util.Team.
§ We implemented parallel serialization for Team collective
communication.
25Speedup of optimized Team against the existing X10’s communication methods on 128 nodes by
exchanging 8MB for each place on TSUBAME
26. Our Proposed Optimization (3)
- Reducing Communication Messages
§ Our proposed “SendAll” technique is aimed at reducing messages when a vertex happens
to send the same messages to all of its neighbors since in normal situation, sending the
same message to all neighbors creates many identical messages that might be sent to the
same place (e.g. PageRank, BFS)
§ If SendAll is enabled by calling SendMessageToAllNeighbors() method, the source place
will send only one message to the destination places for each vertex and then each
destination place will duplicate the massage passing to respective destination
vertices.
26
0
50
100
150
200
250
16 32 64 128
E
lap
sed
tim
e
(secon
d
s)
# of nodes
PageRank(Normal)
PageRank(SendAll)
PageRank(Combine)
0
5
10
15
20
25
30
35
Normal Combine SendAll
N
u
m
b
er
of
tra
n
sferred
m
essag
es
(b
illio
n
)
PageRank 16 nodes
PageRank 128 nodes
The wall-clock time for computing PageRank with
ElapsedTime(s)
The number of message sent during computing PageRank
with normal configuration, SendAll enable, and Combine
enable on 16 and 128 of machine nodes
#oftransmittedmessages
27. Parallel Text File Reader/Writer for Graph
§ Motivation
– Loading and writing data from IO storage are considered important equally to
executing graph kernels.
– When loading a large graph, if the graph loader is not well designed, the time of
loading graph will take longer significantly time than that of executing a graph
kernel because of network communication overhead and the large latency of IO
storage.
§ Solution
– ScaleGraph provides parallel text file reader/writer.
– At the beginning, an input file will be separated into even chunks, the number of
which is equal to the number of places available.
– Each place will load only its respective chunk, and it then separates the chunk
into smaller, even chunks that the number of them is equal to the number of
worker threads and assigns these smaller chunks to respective threads.
27
28. Graph Algorithms
PageRank
Degree Distribution
Betweenness Centrality
Shortest path
Breadth First Search
Minimum spanning tree (forest)
Strongly connected component
Spectral clustering
Separation of Degree
(HyperANF)
Cluster Coefficient
Blondel clustering
Eigen solver for sparse matrix
Connected component
Random walk with restart
etc.
Currently supported algorithms The algorithms that will be
supported in the future.
29. Weak Scaling and Strong Scaling Performance up
to 128 nodes (1536 cores)
29
Evaluation Environment: TSUBAME 2.5 (Each node is equipped with two Intel® Xeon® X5760
2.93 GHz CPUs by each CPU having 6 cores and 12 hardware threads, 54GB of memory. All
compute nodes are connected with InifinitBand QDR
Weak Scaling Performance of Each Algorithm (seconds): RMAT Graph of Scale 22 per node
Strong Scaling Performance of Each Algorithm (seconds): RMAT Graph of Scale 28
30. Degree Distribution
30
0
5
10
15
20
25
30
35
40
45
16 32 64 128
ElapsedTime(s)
# of machines
Strong-scaling result of degree distribution (scale
28)
RMAT
Random
The scale-28 graphs we used have 228 (≈268 million) of vertices
and 16×228 (≈4.29 billion) of edges
32. Degree of Separation
32The scale-28 graphs we used have 228 (≈268 million) of vertices
and 16×228 (≈4.29 billion) of edges
0
10
20
30
40
50
60
70
80
90
100
16 32 64 128
ElapsedTime(s)
# of machines
Strong-scaling result of HyperANF (scale 28)
RMAT
Random
33. Performance of XPregel
Framework Execution Time (second)
Giraph 153
GPS 100
Optimized X-Pregel 2.4
The execution time of PageRank 30 iteration for the Scale 20 (1million vertices,
16 million edges) RMAT graph with 4 TSUBAME nodes.
153
100
2.4
0
20
40
60
80
100
120
140
160
180
Giraph GPS Optimized X-Pregel
Elapsedtime(seconds)
Giraph and GPS data is from [Bao and Suzumura, LSNA 2013 WWW Workshop].
34. ScaleGraph vs. Apache Giraph, PBGL
0
200
400
600
800
1000
1200
1 2 4 8 16
Elapsed Time (s)
Number of Nodes
PageRank in Strong Scaling
(RMAT Graph, Scale 25, 30 iterations)
ScaleGraph
PBGL
0
100
200
300
400
500
600
700
1 2 4 8 16 32 64 128
Elapsed Time (s)
Number of Nodes
PageRank in Weak Scaling
(RMAT Graph, Scale 22, 30 Iterations)
ScaleGraph
PBGL
Nodes ScaleGraph (s) Giraph (s) PBGL (s)
1 158.9 - -
2 85.0 - 966.8
4 44.9 2885.1 470.3
8 23.4 443.1 309.5
16 13.3 125.3 290.9
STRONG-SCALING PERFORMANCE ON RMAT SCALE 25
37. Steps Towards Billion-Scale Graph Processing:
Performance Speed-ups from Version 1.0 to the latest version, 2.2.
Ver. Date Problem
Size
(Max)
Kernel # of
nodes
(max)
Elapsed
Time
Features
1.0 ‘12/6 42 million
vertices
(Twitter
KAIST)
Degree
distribution
8 More than
1 hour
• Initial Design
2.1 ‘13/09 Scale 26
(67 million
vertices)
PageRank 128 1.35 sec
(iteration)
• Team Library wrapping native
MPI collective communication
• Xpregel including
communication optimization
2.2 ‘14/03 Scale 32
(4.3 billion
vertices)
PageRank 128 0.88 sec
per
iteration
• Explicit Memory
• Optimized X10 Activity
Scheduler, etc
38. Performance Summary for ScaleGraph 2.2
§ Artificial big graph that follows various features
of Social Network
– Largest data : 4.3 billion vertices and 68.7 billion edges
(RMAT : Scale 32, 128 nodes)
– PageRank : 16.7 seconds for 1 iteration
– HyperANF (b=5) = 71 seconds
§ Twitter Graph (0.47 billion vertices and 7 billion
edges – around Scale 28.8)
– PageRank (128 nodes): 76 seconds
– Spectral Clustering (128 nodes) : 1,839 seconds
– Degree of Separation (128 nodes): 56 seconds
– Degree Distribution (128 nodes): 128 seconds
39. Concluding Remarks
§ ScaleGraph Official web site – http://www.scalegraph.org/
– License: Eclipse Public License v1.0
– Project information and Documentation
– Source code distribution / VM Image
– Source Code Repository : http://github.com/scalegraph/
§ Ongoing/Future Work
– Integration with Graph Databases such as IBM System G Native Store
– Other domains: RDF Graph, Human Brain Project (EU)
– More temporal web analytics on our whole Twitter follower-followee
network and all the user profile as of 2012/10
39
Special thanks for contributors in this talk including my current and past
students, Koji Ueno, Charuwat Houngkaew, Hiroki Kanezashi, Hidefumi Ogata,
Masaru Watanabe and ScaleGraph Team