This document summarizes a presentation about analyzing graphs using Apache Spark's GraphFrames and GraphX libraries. It begins with an introduction of the speaker and their interests. It then discusses what graphs are and provides examples of graph analytics like node scoring and community detection. It introduces GraphX and GraphFrames, how they allow working with property graphs and integrating graph operations with DataFrames. It also provides an example of how financial institutions can use graph analytics to detect synthetic identity fraud by analyzing relationships between customer addresses.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Â
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
1. Traversing our way through
Apache Spark GraphFrames
and
GraphX
Mo Patel
Data Day Texas 2017
2. A bit about me
âą Currently Deep Learning Practice Director atTeradata
â Road Object Detection & Scene Labeling
â Visual Product Search
â Chatbots
âą Previously
â Analytics @ Social Sharing Startup
â Analytics @ Intelligence Community
â Distributed Systems @ Satellite Operations Company
â Software Engineering @ Defense Communications Program
âą Research Interests: Distributed Systems for Analytics
âą Love snowboarding and in general outdoor sports and working out to keep doing those things
mopatel
3. What is this talk about?
âą What are Graphs and what are some interesting
things about Graphs?
âą What are some Graph Analytics Examples?
âą What are GraphFrames?
âą What is GraphX?
âą How can Graph Analytics help financial
companies fight Synthetic Identity Fraud?
4. What is a Graph?
Natural Artificial
Wikipedia
Wikipedia
5. Power of Graphs
Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
6. Power of Graphs
âą Good: Facebook,Twitter,WhatAppâŠmost
popular social networks
âą Bad: MySpace, Friendster, OrkutâŠâNobody
goes there anymore. It's too crowdedâ âYogi
Berra
7. âą Data Growth: Recall Metcalfeâs (n2) and Reedâs
Law (2n)
âą Memory Intensive
âą Processing Intensive
Graph Databases cost money,
Graph Analytics make money!
9. Node Score in a Graph
âą Usecase: Find out how important an entity is
in a graph
â Entity Fraud Detection
â Influencers
â Crime Bosses
âą Methods: PageRank, EigenCentrality
PageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph)
EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
14. GraphFrame
âą SQL like context is very popular
âą Lots of ways to work with Graphs: Cypher, SPARQL,
Gremlin..
âą Spark introduced DataFrame in February 2015
âą Goal: Make it easy for DataFrame users to work with
Graphs
âą GraphFrame: GraphX & DataFrame Operations
https://graphframes.github.io/index.html
20. What is the impact of Synthetic Identity Fraud?
Verafin
Verafin
21. How can Graph Analytics helps
solve Synthetic Identity Problem?
Customer Address DataFrame
val customerAddresses =
sqlContext.createDataFrame(
List(
(âa1", â123 Main Street", â123abc456efgâ),
(âb2", â345 High Street", â123abc456efgâ),
(âc3", â789 Park Ave", â123abc456efgâ)
)).toDF("id", âaddress", âcustomerid")
vertices.
Add Fake Address
val fakeAddress = sqlContext.createDataFrame(
List(
(âd4", â999 Ocean Ave", â123abc456efgâ)
)).toDF("id", âaddress", âcustomerid")
val tempCustomerAddresses =
customerAddresses.union(fakeAddress)
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
22. How can Graph Analytics helps
solve Synthetic Identity Problem?
Master Address Connection Edges
DataFrame
val masterAddressConnections = sqlContext.createDataFrame(
List(
("b2", "a1"),
("e5", "c3"),
("c3", "b2"),
("a1", "c3"),
("e5", "d4")
âŠ
)).toDF("src", "dst")
val toEdgeMatches = masterAddressConnections.join(customerAddresses,
masterAddressConnections("to") ===
customerAddresses("address")).select("to","from")
val fromEdgeMatches =
masterAddressConnections.join(customerAddresses,
masterAddressConnections("from") ===
customerAddresses("address")).select("to","from")
val checkEdges = fromEdgeMatches.union(toEdgeMatches)
Detection GraphFrame
PageRank
val detectionGraphFrame =
GraphFrame(tempCustomerAddresses ,
checkEdges)
//PageRank
val resultRanks =
detectionGraphFrame.pageRank.resetProbability(0.
15).tol(0.01).run()
//Personalized PageRank
val d4Ranks =
detectionGraphFrame.pageRank.resetProbability(0.
15).maxIter(10).sourceId("d4").run()
resultRanks.vertices.select("id", "pagerank").show()
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
23. How do we decide if this address is
fraud or not?
PageRank
id pagerank
a1 0.9463535901944437
b2 0.9463535901944437
c3 0.9463535901944437
d4 0.15
Personalized PageRank
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
a1
id pagerank
a1 0.33343371928623045
c3 0.28341866139329586
b2 0.21580437563085933
d4 0.0
b2
id pagerank
b2 0.33343371928623045
a1 0.28341866139329586
c3 0.21580437563085933
d4 0.0
c2
id pagerank
c3 0.33343371928623045
b2 0.28341866139329586
a1 0.21580437563085933
d4 0.0
d4
id pagerank
d4 0.15
a1 0.0
b2 0.0
c3 0.0
24. Future Directions and Thoughts
âą Focus on delivering value over tools and
technologies
âą Will we settle on a language for Graph Analytics?
âą More algorithms in GraphX?
âą Large scale Graph Analytics is still not scalable
25. Apache Spark GraphX: http://spark.apache.org/graphx/
Follow me on Twitter (@mopatel) for interesting Deep Learning and
Analytics tweets