In computer science and mathematics, graphs are abstract data structures that model structural relationships among objects. They are now widely used for data modeling in application domains for which identifying relationship patterns, rules, and anomalies is useful. These domains include the web graph,
social networks,etc. The ever increasing size of graph structured data for these applications creates a critical need for scalable systems that can process large amounts of it efficiently. The project aims at making a benchmarking tool for testing the performance of graph algorithms like BFS, Pagerank, DFS. with
MapReduce, Giraph, GraphLab and testing which approach works better on what kind of graphs.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Benchmarking Tool for Graph Algorithms
1. BenchMarking Tool for
Graph Algorithms
IIIT-H Cloud Computing - Major Project
By:
Abhinaba Sarkar 201405616
Malavika Reddy 201201193
Yash Khandelwal 201302164
Nikita Kad 201330030
2. Description
● In computer science and mathematics, graphs are abstract data structures that model
structural relationships among objects. They are now widely used for data modeling in
application domains for which identifying relationship patterns, rules, and anomalies is useful.
● These domains include the web graph, social networks,etc. The ever-increasing size of graph-
structured data for these applications creates a critical need for scalable systems that can
process large amounts of it efficiently.
● The project aims at making a benchmarking tool for testing the performance of graph
algorithms like BFS, Pagerank,etc. with MapReduce, Giraph, GraphLab and Neo4j and testing
which approach works better on what kind of graphs.
3. Motivation
● Analyze the runtime of different types of graph algorithms on different
types of distributed systems.
● Performing computation on a graph data structure requires processing at
each node.
● Each node contains node-specific data as well as links (edges) to other
nodes. So computation must traverse the graph which will take a huge
amount of time.
4. Approach
The BFS/SSSP algorithm is broken in 2 tasks:
● Map Task:In each Map task, we discover all the neighbors of the node currently in queue (we
used color encoding GRAY for nodes in queue) and add them to our graph.
● Reduce Task:In each Reduce task, we set the correct level of the nodes and update the graph.
The pagerank algorithm is also broken in 2 steps:
● Map Task: Each page emit its neighbours and current pagerank.
● Reduce Task: For each key(page) new page rank is calculated using pagerank emitted in the
map task.
○ PR(A)=(1-d) + d(PR(T1)/C(T1) + ... +PR(Tn)/C(Tn)) Where - C(P) is the cardinality (out-
degree) of page P, d is the damping (“random URL”) factor.
Dijkstra:
● Map task : In each of the map tasks, neighbors are discovered and put into
the queue with color coding gray.
● Reduce task : In each of the reduce tasks, we select the nodes according to
the shortest distances from the current node.
5. Approach contd.
Giraph and Hadoop
All the computations are done on a cluster of 2 nodes
Graphlab
All the computations are performed on single machine
6. Applications
In today’s world, dynamic social graphs (like:
linkedin, twitter and facebook) are not feasible to
process in single node. Therefore we need to
benchmark the runtime of different graph
algorithms in distributed system.
Example graph: LinkedIn’s social graph
7. Complexity
● BFS: The complexity of standard BFS algorithm is O(V+E) but because of
the overhead of read/write in distributed computing, the order reaches O
(E*Depth).
● Similar is the case for Dijkstra’s algorithm. But number of iterations will be
higher than BFS.
● Page Rank: The Complexity of pagerank in distributed system is –
(No. of Node + No. of Relations)*Iterations
8. Benchmarking - Giraph
Nodes Time
1000 4 min
7.836 sec
1 million 10 min
11.443sec
Nodes Time
1000 3 min 5.655
sec
1 million 11 min 0.05
sec
Nodes Time
1000 5 min
12.111 sec
1 million 16 min
8.652 sec
BFS Dijkstra Pagerank
9. Nodes Time
1000 6.029 sec
10,000 20.154 sec
1 million 1 min 11.124
sec
Nodes Time
1000 4.852 sec
10,000 13.029 sec
1 million 1 min 10.576sec
Page-Rank
Dijkstra
Benchmarking - Graphlab
10. Benchmarking - Hadoop
Nodes Time
1000 4 min
7.836 sec
1 million 10 min
11.443sec
Nodes Time
1000 3 min 5.655
sec
1 million 11 min 0.05
sec
BFS Dijkstra Pagerank
Nodes Time
1000 5 min
12.111 sec
1 million 16 min
8.652 sec
BFS and Dijkstra’s runtime depend on the depth of the input graph.
11. Problems we faced
● Poor locality of memory access.
● Very little work per vertex.
● Changing degree of parallelism.
● Running over many machines makes the problem worse
12. Conclusion and Future Work
● Although GraphLab is fast, there is constraint on memory as it requires as much memory to
contain the edges and their associated values of any single vertex in the graph.
● From the experimental results, it is seen that the time taken for pagerank algorithm is directly
proportional to the number of relations in the graph when the number of nodes and iterations
are constant. This explains the huge difference in time.
● The runtime of BFS is directly proportional to the depth of the graph. So, greater the depth,
more will be the number of iterations and hence more time.
Future Work:
Taking the input graph from file adds a huge overhead of reading and writing to files in each
iteration, so if somehow we can store the graph and its properties in a Database, the read/write
overhead will be gone and the query time will be reduced. So,we plan to include Database in it.