Spark's graph capabilities are great at enabling analysis of networks for use-cases such as fraud-detection, illicit network detection, and supply chain risk analysis. However, in order for a data scientist to perform analytics on a network (e.g., Page Rank, community detection, etc.), they end up spending all their time fighting a mountain of data integration challenges. A specific challenge this talk will focus on is connecting entities in a network within and across data domains. We will explore how you can leverage the Spark ecosystem's graph capabilities to perform massive-scale entity resolution (ER). As a result, your data scientists will be able to more quickly and effectively perform graph analytics that drive business and mission value. Key takeaways: 1) The Spark ecosystem enables you to quickly get started with graph analytics use-cases at scale 2) Complementing traditional ER techniques with the context of graph relationships allows you to connect entities that you could not easily connect before
2. Max Melnick, Deloitte Consulting LLP
Massive-Scale Entity Resolution
Using Spark + Graph
#UnifiedAnalytics #SparkAISummit
3. About Me
3#UnifiedAnalytics #SparkAISummit
• Passion for building tech products
• Engineering Lead / Architect / Developer
• Spark Certified Developer
• Based in Washington, DC
• UVA Systems Engineering
• Love sports, travel, cooking/eating, and
listening to podcasts
maxmelnick.com
maxmelnick@gmail.com
linkedin.com/in/maxmelnick
13. ER is hard
• Difficult to scale algorithms vertically (more of the same data) or horizontally
(new types of data)
• Prohibitively expensive to compare each record with every other record
• Heterogeneous datasets
• Data lacks strong keys
• Difficult to manage changes over time
• Similarity varies significantly across types of entities, languages, etc.
• Data quality issues
13#UnifiedAnalytics #SparkAISummit
14. Improve ER with Spark + Graph
14#UnifiedAnalytics #SparkAISummit
+ =
Better
ER
16. Flexible graph candidate selection
16#UnifiedAnalytics #SparkAISummit
The flexibility of graph enables you to easily add
new attributes to your candidate selection query
vs
27. Graph context helps when data is limited (cont.)
27#UnifiedAnalytics #SparkAISummit
28. Graph context helps when data is limited (cont.)
28#UnifiedAnalytics #SparkAISummit
29. Graph gotchas
• Supernodes
• Graph adoption learning curve
• Not a silver bullet
• Less streaming support than traditional SQL-
based workflows
29#UnifiedAnalytics #SparkAISummit
30. Graph tip #1: Persist graph at scale
30#UnifiedAnalytics #SparkAISummit
31. Graph tip #2: Debug visually
31#UnifiedAnalytics #SparkAISummit
.show() GraphFrame vertex
and edge DataFrames
View in DSE Studio
(must be persisted in DSE Graph)
Easier to
understand
visually
vs
32. Graph tip #3: Is it a graph problem?
Graph is great for…
• Connecting many different types of data
• Performing indeterminate number of hops analysis
Alternatives to consider
• Fuzzy search / programmable indexes -> search engine
• Simple, static joins on homogenous data -> SQL
• Hybrid (graph + SQL/search/etc)
32#UnifiedAnalytics #SparkAISummit
33. Code for this presentation
https://github.com/maxmelnick/spark-graph-er
33#UnifiedAnalytics #SparkAISummit
34. Recap
• ER enables many analytics use-cases
• ER is hard, but Spark + Graph = Improved ER
34#UnifiedAnalytics #SparkAISummit