O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Massive Graph Mining 
Apache Spark’s GraphX and Data Mining
Who we are 
Andy 
@Noootsab 
@NextLab_be 
@Wajug co-driver 
@Devoxx4Kids organizer 
Maths & CS 
Data lover: geo, open, mas...
Graph 101 
A graph is a mathematical representation of 
linked data. 
It’s defined in term of its Vertices and Edges, 
G(V...
Graph 101 
A Graph represent data in a less convenient 
way for classical processing framework. 
Because the burden is not...
Graph 101 
A Graph, G(V,E) has a reverse representation, 
its Dual. 
A Dual is nothing other than the graph, G’(V’, 
E’), ...
Graph 101 
The classical way to store or share the 
connectivity of a graph is using its tabular 
version, that is, its Ad...
GraphX (Apache Spark) 
Spark 101
GraphX (Apache Spark) 
Offers a Graph API on top of Spark. 
Enabling cross-world manipulations
GraphX (Apache Spark) 
How it differs from other classical systems...
GraphX (Apache Spark)
GraphX (Apache Spark)
GraphX (Apache Spark) 
Plenty of operators on both RDDs, but
GraphX (Apache Spark) 
Plenty of operators on both RDDs, but
GraphX (Apache Spark) 
1. Sends messages to neighbors 
2. Returns an RDD of aggregated messages
GraphX (Apache Spark) 
Offers higher level operators and algo, like
GraphX (Apache Spark) 
This one rules them all (and more) 
More later...
PageRank and Pregel 
Everybody know PageRank, right? 
If not: it’s our oil, our friend, our preferred black 
box… 
It’s wh...
PageRank and Pregel 
Essentially, PageRank is all about importance 
of a node in a Graph → Link Analysis. 
The bottom line...
PageRank and Pregel 
https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
PageRank and Pregel 
TL;DR 
The importance of a node is the probability that 
a random (drunk) walker fall on a given node...
PageRank and Pregel 
Solution: Power Method/Iteration (recursive) 
r_new = A x r_old 
Matrix algebra is a pain in distribu...
PageRank and Pregel 
Pregel (google again) 
Based on BSP, Bulk Sync Parallel 
BSP works like message passing style
PageRank and Pregel 
During Superstep i, a vertex can: 
● use messages received from Superstep i-1 
● execute a function 
...
PageRank and Pregel
PageRank and Pregel 
In GraphX, as usual with Spark, it’s simple: 
mapReduceTriplet
PageRank and Pregel 
PageRank with Pregel:
PageRank and Pregel 
Applying on our USA.csv file:
OpenStreetMap 
Founded by Steve Coast (UK, 2004) 
Aims to take Geodata off the govs hands to 
give them to the crowd 
Actu...
OSM
OSM
OSM 
So it’s a Graph! 
Node = Vertex 
single point in space defined by its latitude, longitude and node id 
Way = Edge 
A ...
OSM 
The network is over-complex for what we need, 
thus: 
● reducing cycling ways like roundabouts to a 
single one 
● tr...
OSM 
Hence, OSM ~ G(Node, Way) 
If it’s not exactly we can still manipulate them 
In our case, we don’t need the connectiv...
Dataset 
● 80 cities 
● 3M edges in total 
● smallest city 200 edges (Tempe) 
● largest city 200,000 edges (Los Angeles)
Comparing Cities 
● Hypothesis: Cities with similar connectivity 
have similar PageRank distribution 
NYC Chicago
Fort Worth = Philadelphia? 
Looks the same!
Smells like Spurious Correlation
Normalizing PageRank distributions 
● Problem: PageRank is correlated with the 
size of the city 
● size of city = number ...
Fort Worth != Philadelphia! 
Totally different!
Fort Worth before and after 
Note that range of PageRank is preserved
Distance between PG Distributions 
● How to compare PageRank distributions? 
● It’s not always a normal distribution! 
● C...
KL Divergence 
● Easy to compute 
● Units is nats (can be bits if using log2 
instead of ln)
Very different cities: Dallas & Seattle 
● KL divergence = 18.407 
● Dallas is irregular, Seattle is a perfect grid
Very similar cities: Atlanta & Boston 
● KL divergence = 0.36 
● Both are very irregular
Next steps 
● Using multiple street topology indicators to 
measure the risk of car accident
Q.E.D 
Thanks for keeping up! 
Question => 
Future[(Option[Response], Future[Question])]
Próximos SlideShares
Carregando em…5
×

Machine Learning and GraphX

5.399 visualizações

Publicada em

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Machine Learning and GraphX

  1. 1. Massive Graph Mining Apache Spark’s GraphX and Data Mining
  2. 2. Who we are Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Rand @randhindi @snips Entrepreneur PhD bioinformatics, etc.. Love data & ML
  3. 3. Graph 101 A graph is a mathematical representation of linked data. It’s defined in term of its Vertices and Edges, G(V,E). A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices, and can also own a bag of data.
  4. 4. Graph 101 A Graph represent data in a less convenient way for classical processing framework. Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density. Thus, the problem is often translated as a self-join one.
  5. 5. Graph 101 A Graph, G(V,E) has a reverse representation, its Dual. A Dual is nothing other than the graph, G’(V’, E’), where ● a vertex is an edge in G, and ● an edge is a vertex in G, which has at least one edge.
  6. 6. Graph 101 The classical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix. ref: http://en.wikipedia.org/wiki/Adjacency_matrix
  7. 7. GraphX (Apache Spark) Spark 101
  8. 8. GraphX (Apache Spark) Offers a Graph API on top of Spark. Enabling cross-world manipulations
  9. 9. GraphX (Apache Spark) How it differs from other classical systems...
  10. 10. GraphX (Apache Spark)
  11. 11. GraphX (Apache Spark)
  12. 12. GraphX (Apache Spark) Plenty of operators on both RDDs, but
  13. 13. GraphX (Apache Spark) Plenty of operators on both RDDs, but
  14. 14. GraphX (Apache Spark) 1. Sends messages to neighbors 2. Returns an RDD of aggregated messages
  15. 15. GraphX (Apache Spark) Offers higher level operators and algo, like
  16. 16. GraphX (Apache Spark) This one rules them all (and more) More later...
  17. 17. PageRank and Pregel Everybody know PageRank, right? If not: it’s our oil, our friend, our preferred black box… It’s why Google Search works so fine!
  18. 18. PageRank and Pregel Essentially, PageRank is all about importance of a node in a Graph → Link Analysis. The bottom line is: ● In-Links are votes ● In-Links from important node are more important →recursion
  19. 19. PageRank and Pregel https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
  20. 20. PageRank and Pregel TL;DR The importance of a node is the probability that a random (drunk) walker fall on a given node. So, it depends on: 1. the probability that he lands into one of its neighbor 2. the probability that he crosses a link from the neighbor to it 3. an arbitrary probability of teleportation
  21. 21. PageRank and Pregel Solution: Power Method/Iteration (recursive) r_new = A x r_old Matrix algebra is a pain in distributed environment… But wait, the process is rather graph oriented!
  22. 22. PageRank and Pregel Pregel (google again) Based on BSP, Bulk Sync Parallel BSP works like message passing style
  23. 23. PageRank and Pregel During Superstep i, a vertex can: ● use messages received from Superstep i-1 ● execute a function ● send messages ● vote to halt
  24. 24. PageRank and Pregel
  25. 25. PageRank and Pregel In GraphX, as usual with Spark, it’s simple: mapReduceTriplet
  26. 26. PageRank and Pregel PageRank with Pregel:
  27. 27. PageRank and Pregel Applying on our USA.csv file:
  28. 28. OpenStreetMap Founded by Steve Coast (UK, 2004) Aims to take Geodata off the govs hands to give them to the crowd Actually, the crowd has to create them...
  29. 29. OSM
  30. 30. OSM
  31. 31. OSM So it’s a Graph! Node = Vertex single point in space defined by its latitude, longitude and node id Way = Edge A way can have between 2 and 2,000 nodes
  32. 32. OSM The network is over-complex for what we need, thus: ● reducing cycling ways like roundabouts to a single one ● transforming the nodes into sections, i.e. pieces of streets between 2 intersections
  33. 33. OSM Hence, OSM ~ G(Node, Way) If it’s not exactly we can still manipulate them In our case, we don’t need the connectivity of an intersection, but the connectivity of a section. This is given by G’ (dual of G)
  34. 34. Dataset ● 80 cities ● 3M edges in total ● smallest city 200 edges (Tempe) ● largest city 200,000 edges (Los Angeles)
  35. 35. Comparing Cities ● Hypothesis: Cities with similar connectivity have similar PageRank distribution NYC Chicago
  36. 36. Fort Worth = Philadelphia? Looks the same!
  37. 37. Smells like Spurious Correlation
  38. 38. Normalizing PageRank distributions ● Problem: PageRank is correlated with the size of the city ● size of city = number of sections (edges) in the graph ● Normalized PageRank = PageRank / size_of_city ● Now we can compare cities of different sizes!
  39. 39. Fort Worth != Philadelphia! Totally different!
  40. 40. Fort Worth before and after Note that range of PageRank is preserved
  41. 41. Distance between PG Distributions ● How to compare PageRank distributions? ● It’s not always a normal distribution! ● Can use the Kullback-Leibler divergence from information theory ● the Kullback–Leibler divergence of Q from P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P
  42. 42. KL Divergence ● Easy to compute ● Units is nats (can be bits if using log2 instead of ln)
  43. 43. Very different cities: Dallas & Seattle ● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid
  44. 44. Very similar cities: Atlanta & Boston ● KL divergence = 0.36 ● Both are very irregular
  45. 45. Next steps ● Using multiple street topology indicators to measure the risk of car accident
  46. 46. Q.E.D Thanks for keeping up! Question => Future[(Option[Response], Future[Question])]

×