Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Max Melnick, Deloitte Consulting LLP
Massive-Scale Entity Resolution
Using Spark + Graph
#UnifiedAnalytics #SparkAISummit

About Me
3#UnifiedAnalytics #SparkAISummit
• Passion for building tech products
• Engineering Lead / Architect / Developer
• Spark Certified Developer
• Based in Washington, DC
• UVA Systems Engineering
• Love sports, travel, cooking/eating, and
listening to podcasts
maxmelnick.com
maxmelnick@gmail.com
linkedin.com/in/maxmelnick

MissionGraph™ is an open
architecture, data integration,
enhancement, and exploration
platform that powers massive-
scale analysis.
MissionGraph™ by

Agenda
• Entity Resolution (ER) Overview
• Spark + Graph ER Solution Walkthrough
– Technical Architecture
– Example Patterns
• Graph gotchas and tips

ER enables analytics

ER Use-Cases
• Customer 360
• Fraud Detection
• Network Analysis
• Recommendation Engines

Logical ER Flow

Simple ER Example

Simple ER Example (cont.)

ER is hard
• Difficult to scale algorithms vertically (more of the same data) or horizontally
(new types of data)
• Prohibitively expensive to compare each record with every other record
• Heterogeneous datasets
• Data lacks strong keys
• Difficult to manage changes over time
• Similarity varies significantly across types of entities, languages, etc.
• Data quality issues

Improve ER with Spark + Graph
+ =
Better
ER

Technical Architecture

Flexible graph candidate selection
The flexibility of graph enables you to easily add
new attributes to your candidate selection query
vs

Flexible graph candidate selection –
Spark GraphFrames query

query by phone

query by phone
GraphFrames
SparkSQL

query by phone or address

query by phone or address
GraphFrames
SparkSQL
Same candidate
selection query
Candidate
selection query
changes

query by phone or address or email
vs

query by phone or address or email
GraphFramesSparkSQL
Same candidate
selection query
Candidate
selection query
changes

Simplify entity canonicalization

Simplify entity canonicalization (cont.)

Graph context helps when data is limited

Graph context helps when data is limited (cont.)

Graph gotchas
• Supernodes
• Graph adoption learning curve
• Not a silver bullet
• Less streaming support than traditional SQL-
based workflows

Graph tip #1: Persist graph at scale

Graph tip #2: Debug visually
.show() GraphFrame vertex
and edge DataFrames
View in DSE Studio
(must be persisted in DSE Graph)
Easier to
understand
visually
vs

Graph tip #3: Is it a graph problem?
Graph is great for…
• Connecting many different types of data
• Performing indeterminate number of hops analysis
Alternatives to consider
• Fuzzy search / programmable indexes -> search engine
• Simple, static joins on homogenous data -> SQL
• Hybrid (graph + SQL/search/etc)

Code for this presentation
https://github.com/maxmelnick/spark-graph-er

Recap
• ER enables many analytics use-cases
• ER is hard, but Spark + Graph = Improved ER

Thank You!
maxmelnick.com
maxmelnick@gmail.com
linkedin.com/in/maxmelnickThis publication contains general information only, and none of the member firms of Deloitte Touche Tohmatsu Limited, its member firms, or their related entities (collective, the “Deloitte
Network”) is, by means of this publication, rendering professional advice or services. Before making any decision or taking any action that may affect your business, you should consult a
qualified professional adviser. No entity in the Deloitte Network shall be responsible for any loss whatsoever sustained by any person who relies on this publication.
As used in this document, “Deloitte” means Deloitte Consulting LLP, a subsidiary of Deloitte LLP. Please see www.deloitte.com/us/about for a detailed description of the legal structure of
Deloitte USA LLP, Deloitte LLP and their respective subsidiaries. Certain services may not be available to attest clients under the rules and regulations of public accounting.
Copyright © 2019 Deloitte Development LLC.
All rights reserved. Member of Deloitte Touche Tohmatsu Limited

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph

Semelhante a Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph