08448380779 Call Girls In Civil Lines Women Seeking Men
Beyond Kaggle: Solving Data Science Challenges at Scale
1. 1
Think Big, Start Smart, Scale Fast
Dato Conference
Data Matching and Deduplication
using Dato Toolkits
July 21st, 2015
Guillermo Breto Rangel, PhD
2. 2
Entity Resolution: Multiple Definitions
2
(ER)
Entity Resolution
Extract, match and disambiguate
entity records in data.
3. 3
Extract, match and disambiguate entity records in data.
Entity Resolution: Real World Entity
Matching real world entities with profiles, mentions...
Yo
u
Facebook account(s)
LinkedIn profile(s)
Tweets
Google Searches
Many recordsUnique Identities
…...…...
......ER
4. 4
Entity Resolution: Use Cases
4
◆ Network Analysis
◆ Vocabulary Normalization:
Different organizations report
different names for same
entities
◆ Network Security: Finding user
actions/intents
◆ Data Cleaning: removing
duplicated records
◆ Metadata enrichment: records
when matched append
metadata to the entity.
5. 5
Entity Resolution: Challenges
5
◆ Missing Values
◆ Data entry errors
◆ Abbreviations and formatting
◆ Data volume
◆ Variety of raw data sources
o free text, semi-structured, streaming
◆ Data integration from multiple sources
◆ Preprocessing
◆ Normalization
◆ Choosing similarity metrics
6. 6
Dataset: Dbpedia/Amazon-Google Products
6
Putting a schema to Wikipedia
Crowd-sourced community project
Queries against Wikipedia
Data Match data sets on the Web to Wikipedia data
A set of triples → <dbpedia:Luc_Besson> <dbpedia-
owl:spouse><dbpedia:Milla_Jovovich>
Matching Amazon Products and
Google Products
Deich Library and
8. 8
Algorithm: Nearest Neighbors
8
● The entity resolution problem is approached as a network problem
○ Nodes: entity records
○ Edges: similarity measures
● Define distance between entities to find the nearest neighbors.
Composite distances could be built using euclidean, squared
euclidean, levenshtein, Jaccard, Manhattan, cosine, dot product
● Compute the distance between all entities and find the nearest
neighbors
● Duplicates are the connected components of the graph which are
labeled as an entity
● Some parameters to keep in mind are:
○ Grouping_features
○ k (number of neighbors to compare)
○ Radius (the distance threshold)
10. 10
Lessons Learned:
10
◆Most of the time spent on preprocessing
◆Hard to define the distance threshold
◆Weighting the composite distance
◆Data volume
◆Dealing with missing values
◆Tuning the parameters
◆Finding exact matches
11. 11
Some Resources/Bibliography
11
◆ Ricardo Vasquez Sierra, PhD: Senior Data Scientist
from Ooyala
◆ Kevin Glynn, MS: Data Scientist and Khan Academy
Instructor
◆ Vince Gonzalez: MapR Software Engineer
◆ Alexey Svyatkovskiy, PhD: BigData Scientist
Princeton University
◆ Ashwin Machanavajjhala, PhD: Professor of
Computer Science, Duke University
◆ Lise Getoor, PhD: Professor of Computer Science,
UC Santa Cruz
o KDD Tutorial on Entity Resolution in Big Data
o Deduplication and Group Detection using Links, Indrajit
Bhattacharya and Lise Getoor, The 10th ACM SIGKDD Workshop
on Link Analysis and Group Detection (LinkKDD-04).
o Collective Entity Resolution in Relational Data, Indrajit
Bhattacharya and Lise Getoor, ACM Transactions on Knowledge
Discovery from Data (ACM-TKDD), 2007
◆ The Dato Team
◆ My colleagues at Think Big