SlideShare uma empresa Scribd logo
1 de 25
Traversing our way through
Apache Spark GraphFrames
and
GraphX
Mo Patel
Data Day Texas 2017
A bit about me
‱ Currently Deep Learning Practice Director atTeradata
– Road Object Detection & Scene Labeling
– Visual Product Search
– Chatbots
‱ Previously
– Analytics @ Social Sharing Startup
– Analytics @ Intelligence Community
– Distributed Systems @ Satellite Operations Company
– Software Engineering @ Defense Communications Program
‱ Research Interests: Distributed Systems for Analytics
‱ Love snowboarding and in general outdoor sports and working out to keep doing those things
mopatel
What is this talk about?
‱ What are Graphs and what are some interesting
things about Graphs?
‱ What are some Graph Analytics Examples?
‱ What are GraphFrames?
‱ What is GraphX?
‱ How can Graph Analytics help financial
companies fight Synthetic Identity Fraud?
What is a Graph?
Natural Artificial
Wikipedia
Wikipedia
Power of Graphs
Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
Power of Graphs
‱ Good: Facebook,Twitter,WhatApp
most
popular social networks
‱ Bad: MySpace, Friendster, Orkut
“Nobody
goes there anymore. It's too crowded” –Yogi
Berra
‱ Data Growth: Recall Metcalfe’s (n2) and Reed’s
Law (2n)
‱ Memory Intensive
‱ Processing Intensive
Graph Databases cost money,
Graph Analytics make money!
Graph Databases cost money,
Graph Analytics make money!
‱ Page Rank, EigenCentrality
‱ Modularity, Clustering Coefficient,
Betweenness, Closeness
‱ Loopy Belief Propogation, SALSA
Node Score in a Graph
‱ Usecase: Find out how important an entity is
in a graph
– Entity Fraud Detection
– Influencers
– Crime Bosses
‱ Methods: PageRank, EigenCentrality
PageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph)
EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
Communities in a Graph
‱ Usecase: Detect similar nodes
– Behavioral Segmentation
– Crime Rings
– Product Strength &Weakness
‱ Methods: Modularity, Clustering Coefficient,
Betweenness, Closeness
Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi)
Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf
(Implemented: Spark, iGraph)
Growth in Graph
‱ Usecase: Predict where will the graph grow or
suggest new edges
– Event Prediction
– Product Recommendation
‱ Methods: Loopy Belief Propagation, Belief
Networks, SALSA
Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian)
SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)
GraphX
‱ Apache Spark Library for conducting Graph
Analytics
‱ Graph Operations: num[Edges,Vertices],
degress, collectNeighbors
‱ Graph Analytics:
– PageRank
– Connected Components
– Triangle Counter
http://spark.apache.org/graphx/
Property Graph
GraphFrame
‱ SQL like context is very popular
‱ Lots of ways to work with Graphs: Cypher, SPARQL,
Gremlin..
‱ Spark introduced DataFrame in February 2015
‱ Goal: Make it easy for DataFrame users to work with
Graphs
‱ GraphFrame: GraphX & DataFrame Operations
https://graphframes.github.io/index.html
GraphFrame
Vertices DataFrame
val vertices =
sqlContext.createDataFrame(
List(
(“a1", “Wine", “Beverage”),
(“b2", "Beer", “Beverage”),
(“c3", “Pretzel", “Snack”),
(“d4", "Cheese", “Snack”)
)).toDF("id", "name", “type")
Edges DataFrame GraphFrame
val edges =
sqlContext.createDataFrame(
List(
("a1", “d4", 15455),
("b2", “c3", 4849),
(“a1", “c3", 40),
(“b2”, “d4”, 134)
)).toDF(“item1", “item2", “count")
val productsGraphFrame =
GraphFrame(vertices, edges)
productsGraphFrame.
vertices.filter(“type == Snack")
productsGraphFrame. numEdges
What is Synthetic Identity Fraud?
http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud
Why has Synthetic Identity Fraud
emerged as a big problem?
Verafin
How are Synthetic IDs created?
Verafin
Verafin
How are Financial Companies exploited?
Verafin
What is the impact of Synthetic Identity Fraud?
Verafin
Verafin
How can Graph Analytics helps
solve Synthetic Identity Problem?
Customer Address DataFrame
val customerAddresses =
sqlContext.createDataFrame(
List(
(“a1", “123 Main Street", “123abc456efg”),
(“b2", ”345 High Street", “123abc456efg”),
(“c3", “789 Park Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
vertices.
Add Fake Address
val fakeAddress = sqlContext.createDataFrame(
List(
(“d4", “999 Ocean Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
val tempCustomerAddresses =
customerAddresses.union(fakeAddress)
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
How can Graph Analytics helps
solve Synthetic Identity Problem?
Master Address Connection Edges
DataFrame
val masterAddressConnections = sqlContext.createDataFrame(
List(
("b2", "a1"),
("e5", "c3"),
("c3", "b2"),
("a1", "c3"),
("e5", "d4")


)).toDF("src", "dst")
val toEdgeMatches = masterAddressConnections.join(customerAddresses,
masterAddressConnections("to") ===
customerAddresses("address")).select("to","from")
val fromEdgeMatches =
masterAddressConnections.join(customerAddresses,
masterAddressConnections("from") ===
customerAddresses("address")).select("to","from")
val checkEdges = fromEdgeMatches.union(toEdgeMatches)
Detection GraphFrame
PageRank
val detectionGraphFrame =
GraphFrame(tempCustomerAddresses ,
checkEdges)
//PageRank
val resultRanks =
detectionGraphFrame.pageRank.resetProbability(0.
15).tol(0.01).run()
//Personalized PageRank
val d4Ranks =
detectionGraphFrame.pageRank.resetProbability(0.
15).maxIter(10).sourceId("d4").run()
resultRanks.vertices.select("id", "pagerank").show()
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
How do we decide if this address is
fraud or not?
PageRank
id pagerank
a1 0.9463535901944437
b2 0.9463535901944437
c3 0.9463535901944437
d4 0.15
Personalized PageRank
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
a1
id pagerank
a1 0.33343371928623045
c3 0.28341866139329586
b2 0.21580437563085933
d4 0.0
b2
id pagerank
b2 0.33343371928623045
a1 0.28341866139329586
c3 0.21580437563085933
d4 0.0
c2
id pagerank
c3 0.33343371928623045
b2 0.28341866139329586
a1 0.21580437563085933
d4 0.0
d4
id pagerank
d4 0.15
a1 0.0
b2 0.0
c3 0.0
Future Directions and Thoughts
‱ Focus on delivering value over tools and
technologies
‱ Will we settle on a language for Graph Analytics?
‱ More algorithms in GraphX?
‱ Large scale Graph Analytics is still not scalable
Apache Spark GraphX: http://spark.apache.org/graphx/
Follow me on Twitter (@mopatel) for interesting Deep Learning and
Analytics tweets

Mais conteĂșdo relacionado

Mais procurados

From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j Internals
Tobias Lindaaker
 

Mais procurados (20)

Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j Internals
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Efficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema RegistryEfficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema Registry
 
Presto
PrestoPresto
Presto
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 

Semelhante a Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Semelhante a Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case (20)

Offensive OSINT
Offensive OSINTOffensive OSINT
Offensive OSINT
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
From Developer to Data Scientist
From Developer to Data ScientistFrom Developer to Data Scientist
From Developer to Data Scientist
 
A whirlwind tour of graph databases
A whirlwind tour of graph databasesA whirlwind tour of graph databases
A whirlwind tour of graph databases
 
From Rocket Science to Data Science
From Rocket Science to Data ScienceFrom Rocket Science to Data Science
From Rocket Science to Data Science
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Tf gsds
Tf gsdsTf gsds
Tf gsds
 
ADV Slides: Graph Databases on the Edge
ADV Slides: Graph Databases on the EdgeADV Slides: Graph Databases on the Edge
ADV Slides: Graph Databases on the Edge
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

  • 1. Traversing our way through Apache Spark GraphFrames and GraphX Mo Patel Data Day Texas 2017
  • 2. A bit about me ‱ Currently Deep Learning Practice Director atTeradata – Road Object Detection & Scene Labeling – Visual Product Search – Chatbots ‱ Previously – Analytics @ Social Sharing Startup – Analytics @ Intelligence Community – Distributed Systems @ Satellite Operations Company – Software Engineering @ Defense Communications Program ‱ Research Interests: Distributed Systems for Analytics ‱ Love snowboarding and in general outdoor sports and working out to keep doing those things mopatel
  • 3. What is this talk about? ‱ What are Graphs and what are some interesting things about Graphs? ‱ What are some Graph Analytics Examples? ‱ What are GraphFrames? ‱ What is GraphX? ‱ How can Graph Analytics help financial companies fight Synthetic Identity Fraud?
  • 4. What is a Graph? Natural Artificial Wikipedia Wikipedia
  • 5. Power of Graphs Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
  • 6. Power of Graphs ‱ Good: Facebook,Twitter,WhatApp
most popular social networks ‱ Bad: MySpace, Friendster, Orkut
“Nobody goes there anymore. It's too crowded” –Yogi Berra
  • 7. ‱ Data Growth: Recall Metcalfe’s (n2) and Reed’s Law (2n) ‱ Memory Intensive ‱ Processing Intensive Graph Databases cost money, Graph Analytics make money!
  • 8. Graph Databases cost money, Graph Analytics make money! ‱ Page Rank, EigenCentrality ‱ Modularity, Clustering Coefficient, Betweenness, Closeness ‱ Loopy Belief Propogation, SALSA
  • 9. Node Score in a Graph ‱ Usecase: Find out how important an entity is in a graph – Entity Fraud Detection – Influencers – Crime Bosses ‱ Methods: PageRank, EigenCentrality PageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
  • 10. Communities in a Graph ‱ Usecase: Detect similar nodes – Behavioral Segmentation – Crime Rings – Product Strength &Weakness ‱ Methods: Modularity, Clustering Coefficient, Betweenness, Closeness Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
  • 11. Growth in Graph ‱ Usecase: Predict where will the graph grow or suggest new edges – Event Prediction – Product Recommendation ‱ Methods: Loopy Belief Propagation, Belief Networks, SALSA Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)
  • 12. GraphX ‱ Apache Spark Library for conducting Graph Analytics ‱ Graph Operations: num[Edges,Vertices], degress, collectNeighbors ‱ Graph Analytics: – PageRank – Connected Components – Triangle Counter http://spark.apache.org/graphx/
  • 14. GraphFrame ‱ SQL like context is very popular ‱ Lots of ways to work with Graphs: Cypher, SPARQL, Gremlin.. ‱ Spark introduced DataFrame in February 2015 ‱ Goal: Make it easy for DataFrame users to work with Graphs ‱ GraphFrame: GraphX & DataFrame Operations https://graphframes.github.io/index.html
  • 15. GraphFrame Vertices DataFrame val vertices = sqlContext.createDataFrame( List( (“a1", “Wine", “Beverage”), (“b2", "Beer", “Beverage”), (“c3", “Pretzel", “Snack”), (“d4", "Cheese", “Snack”) )).toDF("id", "name", “type") Edges DataFrame GraphFrame val edges = sqlContext.createDataFrame( List( ("a1", “d4", 15455), ("b2", “c3", 4849), (“a1", “c3", 40), (“b2”, “d4”, 134) )).toDF(“item1", “item2", “count") val productsGraphFrame = GraphFrame(vertices, edges) productsGraphFrame. vertices.filter(“type == Snack") productsGraphFrame. numEdges
  • 16. What is Synthetic Identity Fraud? http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud
  • 17. Why has Synthetic Identity Fraud emerged as a big problem? Verafin
  • 18. How are Synthetic IDs created? Verafin Verafin
  • 19. How are Financial Companies exploited? Verafin
  • 20. What is the impact of Synthetic Identity Fraud? Verafin Verafin
  • 21. How can Graph Analytics helps solve Synthetic Identity Problem? Customer Address DataFrame val customerAddresses = sqlContext.createDataFrame( List( (“a1", “123 Main Street", “123abc456efg”), (“b2", ”345 High Street", “123abc456efg”), (“c3", “789 Park Ave", “123abc456efg”) )).toDF("id", ”address", “customerid") vertices. Add Fake Address val fakeAddress = sqlContext.createDataFrame( List( (“d4", “999 Ocean Ave", “123abc456efg”) )).toDF("id", ”address", “customerid") val tempCustomerAddresses = customerAddresses.union(fakeAddress) DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
  • 22. How can Graph Analytics helps solve Synthetic Identity Problem? Master Address Connection Edges DataFrame val masterAddressConnections = sqlContext.createDataFrame( List( ("b2", "a1"), ("e5", "c3"), ("c3", "b2"), ("a1", "c3"), ("e5", "d4") 
 )).toDF("src", "dst") val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from") val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from") val checkEdges = fromEdgeMatches.union(toEdgeMatches) Detection GraphFrame PageRank val detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges) //PageRank val resultRanks = detectionGraphFrame.pageRank.resetProbability(0. 15).tol(0.01).run() //Personalized PageRank val d4Ranks = detectionGraphFrame.pageRank.resetProbability(0. 15).maxIter(10).sourceId("d4").run() resultRanks.vertices.select("id", "pagerank").show() DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
  • 23. How do we decide if this address is fraud or not? PageRank id pagerank a1 0.9463535901944437 b2 0.9463535901944437 c3 0.9463535901944437 d4 0.15 Personalized PageRank DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx a1 id pagerank a1 0.33343371928623045 c3 0.28341866139329586 b2 0.21580437563085933 d4 0.0 b2 id pagerank b2 0.33343371928623045 a1 0.28341866139329586 c3 0.21580437563085933 d4 0.0 c2 id pagerank c3 0.33343371928623045 b2 0.28341866139329586 a1 0.21580437563085933 d4 0.0 d4 id pagerank d4 0.15 a1 0.0 b2 0.0 c3 0.0
  • 24. Future Directions and Thoughts ‱ Focus on delivering value over tools and technologies ‱ Will we settle on a language for Graph Analytics? ‱ More algorithms in GraphX? ‱ Large scale Graph Analytics is still not scalable
  • 25. Apache Spark GraphX: http://spark.apache.org/graphx/ Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets