Introduce what are Graphs and explore what happens behind some of the applications (PageRank, Maps, FaceBook etc) using Graph processing. Introduce @ a high level the different frameworks/softwares behind Graph processing.
2. Agenda
Introduction to Graphs
Representing graphs
Different types of graphs
Algorithms in graphs
What constitutes a graph application
Graph databases (examples and how they work)
Graph computing engines (examples and how they work)
Questions & Answers
4. How is a graph represented?
4
1 2 3 6
Vertex
5
Edge
A collection of vertices connected to each other using edges, with both vertices and edges
having properties. A vertex can be a person, place, account or any item which needs to be
tracked.
5. W
Sh hom
n ds
? A social graph ee s
ta ho
l t ul
o d
f rie be I r
's fri eco
run Deepak
en m
reA ds m
h oa 4 wi en
W th d
?
Friend Relative
Friend Friend
Friend
1 2 3 Bob 6 Sheetal
Name:Arun Tom
Age : 25
Sex : M Friend Relation : Collegue
Collegue
Vertex
5
Edge
Properties Prajval
6. Facebook Recruiting Competition
@
w The challenge is to recommend missing links in a social
vie
inter ok? network. Participants will be presented with an external
t an cebo anonymized, directed social graph (no, not Facebook, keep
an Fa guessing) from which some edges have been deleted, and
W
asked to make ranked predictions for each user in the test set
of which other users they would want to follow.
What is Kaggle?
4 Kaggle is an innovative solution for
statistical/analytics outsourcing. We are the
leading platform for predictive modeling
competitions. Companies, governments and
1 2 3 6 researchers present datasets and problems - the
world's best data scientists then compete to
produce the best solutions. At the end of a
competition, the competition host pays prize
money in exchange for the intellectual property
5
behind the winning model.
http://www.kaggle.com/c/FacebookRecruiting
7. I
th wou
r tes
t
ho een ta?
A spatial graph e
pl ld l
a
s sh ce ike
t he etw lcut or s, to
t is e b Ca New Delhi te wh co
st ic v
ha tanc and pa h er
W is re
D alo 4 th is all
g ? th
e
B an 450 km
600 km
250 km
350 km 450 km
1 2 3 Lucknow 6 Kolkotta
Name:Bangalore Mumbai
Populataion : 25,00,000 850 km
Area : 35,000 SqKm Distance : 700 km
Vertex
800 km
5
Edge
Properties Chennai
8. How to represent a Graph for computing?
3, 6
.... as an adjacency list for sparse graph 4
1 -> 2,4,5
2 -> 3
3 -> 5 2, 4, 5 3 5
4 -> 3.6
5 -> 1 2 3 6
6 -> 5
5
.... as an adjacency matrix for dense graph
1 2 3 4 5 6
5
1 0 1 0 1 1 0
2 0 0 1 0 0 0 A graph with few edges is sparse,
many edges is dense.
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 0 Obviously, the web with billions
of pages cannot be represented
6 0 0 0 0 1 0 as an adjaceny matrix.
9. Different Graphs
Social graph (Facebook, LinkedIn etc)
Spacial graph (Google Maps, MapQuest, FedEx etc)
Web graph (PageRank, Recomendations etc)
Computer network graph (Optimal network layout
etc)
Financial graph (Fraud detection, Currency Flow
etc)
Data representations (Lists etc)
Chemistry (to represent genomes/molucules)
And others
10. Some of the Graph Algorithms
Shortest path (Finding the shortest path from A to B)
Minimal Spanning Tree (Cheapest way to connect objects, so that each
object is connected to another – can be used in internet, cable wiring etc)
Graph center (placing a warehouse, hospital in a city, so that all the
locations can be reached easily)
Bipartite Matching (Matching in a dating site, job to employee and others)
Finding Planar Graph (as in the case of circuit designs).
http://www.graph-magics.com/practic_use.php
12. How to store a Graph?
Sim
an ple, b
de
Option 1 : In a flat file as asy ut no
to t effi
ma cie
1- 4,5,6 inta nt
in.
4- 2,5,6
Where vertex 1 is connected to vertex 4,5,6 and so on
Option 2 : In a relational database using referencing
tables or join tables.
Option 3 : Using a specialized database designed only
and only for graphs.
13. Comparing Graph with Relational DB
ld
wou ring
one r sto
ich fer fo ata?
Wh pre h d In a DB of 1,000,000 users finding friends-of-friends
p
y ou Gra for 1,000 users at various depths.
Depth Execution Time – MySQL Execution Time –Neo4j
2 0.016 0.010
3 30.267 0.168
4 1,543.505 1.359
5 Not Finished in 1 Hour 2.132
http://www.neotechnology.com/2012/06/how-much-faster-is-a-graph-database-really/
14. So, what is a Graph DB?
A graph database is any storage system that
provides `index free adjacency`. 3, 6
4
2, 4, 5 3 5
1 2 3 6
5
5
Every element (node or edge) has a direct pointer to it's adjacent element.
No Index lookup : We can determine which vertex is adjacent wo which other vertex
without lookup an index-tree.
15. So, what is a Graph DB? (.....)
n
p tio s.
th e o raph
is g g
h DB istin
s
rap per
G en
wh
16. So, what is a Graph DB? (.....)
Key Value Store like Amazon Dynamo.
Data Size
Columnar Databases like Cassandra, HBase.
Document Databases like MongoDB,
CouchDB..
Graph Databases like Neo4J
ily
m
fa
L
Q
oS
N
t he
Data Complexity
of
rt
Pa
17. Graph DB Bindings (~JDBC API)
//connect to the database
//begin transaction
Node firstNode;
Node secondNode;
Relationship relationship;
firstNode = graphDb.createNode();
firstNode.setProperty( "message", "Hello, " );
secondNode = graphDb.createNode();
secondNode.setProperty( "message", "World!" );
relationship = firstNode.createRelationshipTo( secondNode,
RelTypes.KNOWS );
relationship.setProperty( "message", "brave Neo4j " );
//end the transaction
//close the connection to the database
http://docs.neo4j.org/chunked/milestone/tutorials-java-embedded-hello-world.html
19. Different Graph Databases
FlockDB from
Twitter
Allegrograph
GraphBase
From
Objectivity
http://en.wikipedia.org/wiki/Graph_database
20. What is a Graph Computing Engine?
Algorithms
Graph Computing OutputFormat
Engine Output Location
Graph engines come with some built-in graph
InputFormat processing algorithms, but also provide an easy to use
Input Location API to build new algorithms and extend the framework.
http://incubator.apache.org/giraph/apidocs/index.html
http://incubator.apache.org/hama/docs/r0.3.0/api/index.html
21. Different Graph Computing Engines
Memory based graphs like (graph size < local machine ram)
- jung.sourceforge.net
- igraph.sourceforge.net
- metworkx.lanl.gov
Disk based graphs like (graph size < local hard disk size)
- Neo4j
- Infinite Graph – objectivity.com
- sparsity-technologies.com/dex
Cluster based graphs like (depends on the cluster specs)
l
- Apache Hama de
mo l
- Apache Giraph SP llel) ege
B a r
- GoldenORB
d on Par le p
se ous oog
Ba ron f G
h o
y nc pirit
l k S he s
( Bu in t
22. Bulk Synchronous Parallel
Some quick facts
• An alternate computing model to MapReduce (Not all problems can be solved with
MapReduce efficiently). Also, any MR algorithm can be simulated on BSP and
vice versa.
Developed by Leslie Valinat during the 1980s. Was resurrected by Google in the
Pregel Paper (extensively used for PageRank)
Good for
- Processing big data with complicated relationships, eg., graph and networks.
- Iterative and Recursive scientific computations
- Continious Event Processing (CEP)
http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html
http://arxiv.org/abs/1203.2081 – Comparing MR vs BSP
23. What is Bulk Synchronous Parallel?
Super Step 1
Super Step 2
Super Step 3
http://en.wikipedia.org/wiki/Bulk_synchronous_parallel/
http://blog.octo.com/en/introduction-to-large-scale-graph-processing/
24. Hama vs Giraph
Derived Derived
Google Pregel **
Giraph
Hama BSP
BSP MapReduce
HDFS
** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html
25. Hama vs Giraph (.....)
Hama Giraph
Pure BSP engine. Uses BSP, but BSP API is not exposed.
Matrix, Graph, Network and other Just for Graph processing.
procesing.
Jobs are run as a BSP Job on HDFS. Jobs as run as MapReduce on Hadoop.
Both of them are derived from on `Pregel : A System for Large-Scale Graph
Processing` paper published by Google. Both have been recently promoted from
Incubator to Apache Top Level Project.
Both of them have a few graph algorithms implemented and also provide a very easy
API to implement new Graph algorithms.
** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html
26. Page Rank in Hama
PageRank Algorithm assigns numerical
weightage to each element of a hyperlinked set of
documents
.
bin/hama jar ../hama-0.4.0-examples.jar pagerank
<input path> <output path> [damping factor]
[epsilon error] [tasks]
Input Output
Site1tSite2tSite3 Site1 0.5
Site2tSite3 Site2 1.3
Site3 Site3 1.2
http://wiki.apache.org/hama/PageRank
27. What's next?
Deep dive into
- Both Graph databases and frameworks with a Demo.
- Bulk Syncronous Parallel procssing model.
Hadoop, Hive, Pig and others are too crowded. Graph Frameworks and
Databases are emerging and are an easy entry to contribute to in Apache.
Would suggest to subscribe/follow the mailing lists in Apache and try to get
familiar and contribute to them.