O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Próximos SlideShares
Carregando em…5
×

# Machine Learning and GraphX

5.399 visualizações

• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Seja o primeiro a comentar

### Machine Learning and GraphX

1. 1. Massive Graph Mining Apache Spark’s GraphX and Data Mining
2. 2. Who we are Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Rand @randhindi @snips Entrepreneur PhD bioinformatics, etc.. Love data & ML
3. 3. Graph 101 A graph is a mathematical representation of linked data. It’s defined in term of its Vertices and Edges, G(V,E). A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices, and can also own a bag of data.
4. 4. Graph 101 A Graph represent data in a less convenient way for classical processing framework. Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density. Thus, the problem is often translated as a self-join one.
5. 5. Graph 101 A Graph, G(V,E) has a reverse representation, its Dual. A Dual is nothing other than the graph, G’(V’, E’), where ● a vertex is an edge in G, and ● an edge is a vertex in G, which has at least one edge.
6. 6. Graph 101 The classical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix. ref: http://en.wikipedia.org/wiki/Adjacency_matrix
7. 7. GraphX (Apache Spark) Spark 101
8. 8. GraphX (Apache Spark) Offers a Graph API on top of Spark. Enabling cross-world manipulations
9. 9. GraphX (Apache Spark) How it differs from other classical systems...
10. 10. GraphX (Apache Spark)
11. 11. GraphX (Apache Spark)
12. 12. GraphX (Apache Spark) Plenty of operators on both RDDs, but
13. 13. GraphX (Apache Spark) Plenty of operators on both RDDs, but
14. 14. GraphX (Apache Spark) 1. Sends messages to neighbors 2. Returns an RDD of aggregated messages
15. 15. GraphX (Apache Spark) Offers higher level operators and algo, like
16. 16. GraphX (Apache Spark) This one rules them all (and more) More later...
17. 17. PageRank and Pregel Everybody know PageRank, right? If not: it’s our oil, our friend, our preferred black box… It’s why Google Search works so fine!
18. 18. PageRank and Pregel Essentially, PageRank is all about importance of a node in a Graph → Link Analysis. The bottom line is: ● In-Links are votes ● In-Links from important node are more important →recursion
19. 19. PageRank and Pregel https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
20. 20. PageRank and Pregel TL;DR The importance of a node is the probability that a random (drunk) walker fall on a given node. So, it depends on: 1. the probability that he lands into one of its neighbor 2. the probability that he crosses a link from the neighbor to it 3. an arbitrary probability of teleportation
21. 21. PageRank and Pregel Solution: Power Method/Iteration (recursive) r_new = A x r_old Matrix algebra is a pain in distributed environment… But wait, the process is rather graph oriented!
22. 22. PageRank and Pregel Pregel (google again) Based on BSP, Bulk Sync Parallel BSP works like message passing style
23. 23. PageRank and Pregel During Superstep i, a vertex can: ● use messages received from Superstep i-1 ● execute a function ● send messages ● vote to halt
24. 24. PageRank and Pregel
25. 25. PageRank and Pregel In GraphX, as usual with Spark, it’s simple: mapReduceTriplet
26. 26. PageRank and Pregel PageRank with Pregel:
27. 27. PageRank and Pregel Applying on our USA.csv file:
28. 28. OpenStreetMap Founded by Steve Coast (UK, 2004) Aims to take Geodata off the govs hands to give them to the crowd Actually, the crowd has to create them...
29. 29. OSM
30. 30. OSM
31. 31. OSM So it’s a Graph! Node = Vertex single point in space defined by its latitude, longitude and node id Way = Edge A way can have between 2 and 2,000 nodes
32. 32. OSM The network is over-complex for what we need, thus: ● reducing cycling ways like roundabouts to a single one ● transforming the nodes into sections, i.e. pieces of streets between 2 intersections
33. 33. OSM Hence, OSM ~ G(Node, Way) If it’s not exactly we can still manipulate them In our case, we don’t need the connectivity of an intersection, but the connectivity of a section. This is given by G’ (dual of G)
34. 34. Dataset ● 80 cities ● 3M edges in total ● smallest city 200 edges (Tempe) ● largest city 200,000 edges (Los Angeles)
35. 35. Comparing Cities ● Hypothesis: Cities with similar connectivity have similar PageRank distribution NYC Chicago
36. 36. Fort Worth = Philadelphia? Looks the same!
37. 37. Smells like Spurious Correlation
38. 38. Normalizing PageRank distributions ● Problem: PageRank is correlated with the size of the city ● size of city = number of sections (edges) in the graph ● Normalized PageRank = PageRank / size_of_city ● Now we can compare cities of different sizes!
39. 39. Fort Worth != Philadelphia! Totally different!
40. 40. Fort Worth before and after Note that range of PageRank is preserved
41. 41. Distance between PG Distributions ● How to compare PageRank distributions? ● It’s not always a normal distribution! ● Can use the Kullback-Leibler divergence from information theory ● the Kullback–Leibler divergence of Q from P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P
42. 42. KL Divergence ● Easy to compute ● Units is nats (can be bits if using log2 instead of ln)
43. 43. Very different cities: Dallas & Seattle ● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid
44. 44. Very similar cities: Atlanta & Boston ● KL divergence = 0.36 ● Both are very irregular
45. 45. Next steps ● Using multiple street topology indicators to measure the risk of car accident
46. 46. Q.E.D Thanks for keeping up! Question => Future[(Option[Response], Future[Question])]