6. Entity Resolution
ID Name Website Geo
A Facebook facebook.com Menlo Park, CA
B FB facebook.com CA
C Joe's Cookies joescookies.com San Francisco, CA
Suppose we have the following data:
7. Entity Resolution
Suppose we have the following data:
ID Name Website Geo
A Facebook facebook.com Menlo Park, CA
B FB facebook.com CA
C Joe's Cookies joescookies.com San Francisco, CA
D Joes Cookies facebook.com San Francisco, CA
8. Entity Resolution
Suppose we have the following data:
ID Name Website Geo
A Facebook facebook.com Menlo Park, CA
B FB facebook.com CA
C Joe's Cookies joescookies.com San Francisco, CA
D Joes Cookies facebook.com San Francisco, CA
E Joes Cookies NULL New York, NY
14. Think Like a Graph
A B
EC
D
ID Name Website Geo
A Facebook facebook.com
Menlo Park,
CA
B FB facebook.com CA
C Joe's Cookies joescookies.com
San Francisco,
CA
D Joes Cookies facebook.com
San Francisco,
CA
E Joes Cookies NULL New York, NY
15. Think Like a Graph
A B
EC
D
150
50
-100 -100
50 50
50 50
-150-150
16. Think Like a Graph
A B
EC
D
150
50
-100 -100
50 50
50 50
-150-150
22. Overlapping Cliques
An entity can’t belong to more
than one clique.
When we choose a clique, we
must ensure no other cliques
use any of those entities
25. Recap
• Given a dataset of entities…
• Take the powerset of those entities => every
possible clique
• Score all the cliques
• In sorted order, choose the best cliques when no
elements have been touched
34. Joe’s CookiesJoe’s Cookie’s
joescookies.com joescookies.com
A B C
“Joe Cookie” “Joe Cookie” “”
LSN on “name”
Joe’s Cookie’s
joescookies.com joescookies.com
Clique #3
Clique #2
“joescookies.com” “joescookies.com”
LSN on “website”
Clique #1
35. Clique Choosing
• We now have all potential cliques, spread across
the cluster
• We now need to choose the best cliques?
• Remember: But choosing one clique invalidates
others
• Fundamentally a Serial Algorithm!
43. Recap
• Challenge: Get data to the right machine.
Solution: Use Locality-Sensitive-Hashing
• Challenge: Choose the best cliques.
Solution: Use serial iterator and bloom-filters to
keep memory low
47. Temporal Entity
Resolution
A B
Zen Payroll
zenpayroll.com
+100
C
Zen Payroll <=> Gusto
zenpayroll.com <=> gusto.com
Gusto
gusto.com
+100
-1000
48. Iterative Poison Pills
• Basic Idea: Use ER techniques we’ve already
established
• Introduce “poison pills” that can break up cliques if
temporal properties don’t match
• Iteratively use the poison pills to match on
increasingly temporally-aware entities
49. gusto.com
(Payroll)
2016
Perform Regular ER
gusto.com
(Travel)
2010
gusto.com
< 2015
gusto.com
zenpayroll.com
> 2015
zenpayroll.com
(Payroll)
2014
A B C D E
A, C, D, E B, E
Kick Out Entities That
Don’t Match Temporal
Requirements
A, D
gusto.com < 2015
B, E
gusto.com > 2015
zenpayroll < 2014
C, E
gusto,2016
Perform Regular ER
(now with more temporal
fields available)
A, C, D B, C, E
Temporal Poison Pills
50. Temporal Entity
Resolution
• Very Computational Expensive
• Requires Significant Tuning & Tweaking to Keep
Tractable
• Considered one of the Holy Grails of ER
53. Supervised Learning ER
• Basic Idea: Use a training set to learn the weights
in our scoring functions
• Disclaimer: Only proceed with this if you have very
complex scoring properties
56. More Learning Opts
• Gradient Descent: What if we viewed the system
as having overall “error”? We can then use
Gradient Descent to find optimal solution.
• Very very computationally intense