Graph Processing Applications @ HUG

Graph Processing
Applications
praveensripati@gmail.com

www.thecloudavenue.com

@praveensripati

Agenda

Introduction to Graphs

Representing graphs

Different types of graphs

Algorithms in graphs

What constitutes a graph application

Graph databases (examples and how they work)

Graph computing engines (examples and how they work)

Questions & Answers

What are/aren't Graphs in this context?

YES NO

How is a graph represented?
4

1 2 3 6

Vertex

5
Edge

A collection of vertices connected to each other using edges, with both vertices and edges
having properties. A vertex can be a person, place, account or any item which needs to be
tracked.

W
Sh hom

n ds
? A social graph ee s
ta ho
l t ul
o d
f rie be I r
's fri eco
run Deepak
en m
reA ds m
h oa 4 wi en
W th d
?

Friend Relative
Friend Friend

Friend
1 2 3 Bob 6 Sheetal
Name:Arun Tom
Age : 25
Sex : M Friend Relation : Collegue
Collegue
Vertex
5
Edge
Properties Prajval

Facebook Recruiting Competition
@
w The challenge is to recommend missing links in a social
vie
inter ok? network. Participants will be presented with an external
t an cebo anonymized, directed social graph (no, not Facebook, keep
an Fa guessing) from which some edges have been deleted, and
W
asked to make ranked predictions for each user in the test set
of which other users they would want to follow.

What is Kaggle?
4 Kaggle is an innovative solution for
statistical/analytics outsourcing. We are the
leading platform for predictive modeling
competitions. Companies, governments and
1 2 3 6 researchers present datasets and problems - the
world's best data scientists then compete to
produce the best solutions. At the end of a
competition, the competition host pays prize
money in exchange for the intellectual property
5
behind the winning model.

http://www.kaggle.com/c/FacebookRecruiting

I
th wou
r tes
t
ho een ta?
A spatial graph e
pl ld l
a
s sh ce ike
t he etw lcut or s, to
t is e b Ca New Delhi te wh co
st ic v
ha tanc and pa h er
W is re
D alo 4 th is all
g ? th
e
B an 450 km
600 km
250 km

350 km 450 km
1 2 3 Lucknow 6 Kolkotta
Name:Bangalore Mumbai
Populataion : 25,00,000 850 km
Area : 35,000 SqKm Distance : 700 km
Vertex
800 km
5
Edge
Properties Chennai

How to represent a Graph for computing?
3, 6
.... as an adjacency list for sparse graph 4

1 -> 2,4,5
2 -> 3
3 -> 5 2, 4, 5 3 5
4 -> 3.6
5 -> 1 2 3 6
6 -> 5
5
.... as an adjacency matrix for dense graph

1 2 3 4 5 6
5
1 0 1 0 1 1 0
2 0 0 1 0 0 0 A graph with few edges is sparse,
many edges is dense.
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 0 Obviously, the web with billions
of pages cannot be represented
6 0 0 0 0 1 0 as an adjaceny matrix.

Different Graphs

Social graph (Facebook, LinkedIn etc)

Spacial graph (Google Maps, MapQuest, FedEx etc)

Web graph (PageRank, Recomendations etc)

Computer network graph (Optimal network layout
etc)

Financial graph (Fraud detection, Currency Flow
etc)

Data representations (Lists etc)

Chemistry (to represent genomes/molucules)

And others

Some of the Graph Algorithms

Shortest path (Finding the shortest path from A to B)

Minimal Spanning Tree (Cheapest way to connect objects, so that each
object is connected to another – can be used in internet, cable wiring etc)


Graph center (placing a warehouse, hospital in a city, so that all the
locations can be reached easily)

Bipartite Matching (Matching in a dating site, job to employee and others)

Finding Planar Graph (as in the case of circuit designs).

http://www.graph-magics.com/practic_use.php

Graph Applications

Applications

Hama
Giraph

Graph Databases Graph processing frameworks

How to store a Graph?
Sim
an ple, b
de
Option 1 : In a flat file as asy ut no
to t effi
ma cie
1- 4,5,6 inta nt
in.
4- 2,5,6

Where vertex 1 is connected to vertex 4,5,6 and so on

Option 2 : In a relational database using referencing
tables or join tables.

Option 3 : Using a specialized database designed only
and only for graphs.

Comparing Graph with Relational DB
ld
wou ring
one r sto
ich fer fo ata?
Wh pre h d In a DB of 1,000,000 users finding friends-of-friends
p
y ou Gra for 1,000 users at various depths.

Depth Execution Time – MySQL Execution Time –Neo4j
2 0.016 0.010
3 30.267 0.168
4 1,543.505 1.359
5 Not Finished in 1 Hour 2.132

http://www.neotechnology.com/2012/06/how-much-faster-is-a-graph-database-really/

So, what is a Graph DB?
A graph database is any storage system that
provides `index free adjacency`. 3, 6
4

2, 4, 5 3 5
1 2 3 6

5

5
Every element (node or edge) has a direct pointer to it's adjacent element.

No Index lookup : We can determine which vertex is adjacent wo which other vertex
without lookup an index-tree.

So, what is a Graph DB? (.....)

n
p tio s.
th e o raph
is g g
h DB istin
s
rap per
G en
wh

So, what is a Graph DB? (.....)

Key Value Store like Amazon Dynamo.
Data Size

Columnar Databases like Cassandra, HBase.

Document Databases like MongoDB,
CouchDB..

Graph Databases like Neo4J
ily
m
fa
L
Q
oS
N
t he

Data Complexity
of
rt
Pa

Graph DB Bindings (~JDBC API)
//connect to the database
//begin transaction

Node firstNode;
Node secondNode;
Relationship relationship;

firstNode = graphDb.createNode();
firstNode.setProperty( "message", "Hello, " );
secondNode = graphDb.createNode();
secondNode.setProperty( "message", "World!" );

relationship = firstNode.createRelationshipTo( secondNode,
RelTypes.KNOWS );
relationship.setProperty( "message", "brave Neo4j " );

//end the transaction
//close the connection to the database

http://docs.neo4j.org/chunked/milestone/tutorials-java-embedded-hello-world.html

Graph Adhoc Query (~SQL)

START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof

john fof
Node[4]{name:"John"} Node[2]{name:"Maria"}
Node[4]{name:"John"} Node[3]{name:"Steve"}

http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html

Different Graph Databases
FlockDB from
Twitter

Allegrograph

GraphBase

From
Objectivity

http://en.wikipedia.org/wiki/Graph_database

What is a Graph Computing Engine?

Algorithms

Graph Computing OutputFormat
Engine Output Location

Graph engines come with some built-in graph
InputFormat processing algorithms, but also provide an easy to use
Input Location API to build new algorithms and extend the framework.

http://incubator.apache.org/giraph/apidocs/index.html
http://incubator.apache.org/hama/docs/r0.3.0/api/index.html

Different Graph Computing Engines

Memory based graphs like (graph size < local machine ram)
- jung.sourceforge.net
- igraph.sourceforge.net
- metworkx.lanl.gov

Disk based graphs like (graph size < local hard disk size)
- Neo4j
- Infinite Graph – objectivity.com
- sparsity-technologies.com/dex

Cluster based graphs like (depends on the cluster specs)
l
- Apache Hama de
mo l
- Apache Giraph SP llel) ege
B a r
- GoldenORB
d on Par le p
se ous oog
Ba ron f G
h o
y nc pirit
l k S he s
( Bu in t

Bulk Synchronous Parallel

Some quick facts

• An alternate computing model to MapReduce (Not all problems can be solved with
MapReduce efficiently). Also, any MR algorithm can be simulated on BSP and
vice versa.

Developed by Leslie Valinat during the 1980s. Was resurrected by Google in the
Pregel Paper (extensively used for PageRank)

Good for

- Processing big data with complicated relationships, eg., graph and networks.
- Iterative and Recursive scientific computations
- Continious Event Processing (CEP)

http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html
http://arxiv.org/abs/1203.2081 – Comparing MR vs BSP

What is Bulk Synchronous Parallel?

Super Step 1

Super Step 2

Super Step 3

http://en.wikipedia.org/wiki/Bulk_synchronous_parallel/
http://blog.octo.com/en/introduction-to-large-scale-graph-processing/

Hama vs Giraph
Derived Derived

Google Pregel **

Giraph

Hama BSP

BSP MapReduce

HDFS

** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html

Hama vs Giraph (.....)

Hama Giraph
Pure BSP engine. Uses BSP, but BSP API is not exposed.
Matrix, Graph, Network and other Just for Graph processing.
procesing.
Jobs are run as a BSP Job on HDFS. Jobs as run as MapReduce on Hadoop.

Both of them are derived from on `Pregel : A System for Large-Scale Graph
Processing` paper published by Google. Both have been recently promoted from
Incubator to Apache Top Level Project.
Both of them have a few graph algorithms implemented and also provide a very easy
API to implement new Graph algorithms.

** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html

Page Rank in Hama

PageRank Algorithm assigns numerical
weightage to each element of a hyperlinked set of
documents

.
bin/hama jar ../hama-0.4.0-examples.jar pagerank
<input path> <output path> [damping factor]
[epsilon error] [tasks]

Input Output

Site1tSite2tSite3 Site1 0.5
Site2tSite3 Site2 1.3
Site3 Site3 1.2

http://wiki.apache.org/hama/PageRank

What's next?
Deep dive into

- Both Graph databases and frameworks with a Demo.
- Bulk Syncronous Parallel procssing model.

Hadoop, Hive, Pig and others are too crowded. Graph Frameworks and
Databases are emerging and are an easy entry to contribute to in Apache.

Would suggest to subscribe/follow the mailing lists in Apache and try to get
familiar and contribute to them.

Graph Processing Applications @ HUG

Graph Processing Applications @ HUG

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (15)

Semelhante a Graph Processing Applications @ HUG

Semelhante a Graph Processing Applications @ HUG (10)

Último

Último (20)

Graph Processing Applications @ HUG