3. The problem
• How to tell if two different records represent
the same real-world entity?
DBPEDIA MONDIAL
Id http://dbpedia.org/resource/Samoa Id 17019
Name Samoa Name Western Samoa
Founding date 1962-01-01 Independence 01 01 1962
Capital Apia Capital Apia, Samoa
Currency Tala Population 214384
Area 2831 Area 2860
Leader name Tuilaepa Aiono Sailele Malielegaoi GDP 415
3
4. Linking across datasets
?
A B
• Independent applications, for example
– even when unique identifiers exist, they tend to be
unreliable, or missing
4
5. Within datasets
?
A B
• Duplicate records in the same application
– in different tables (poor modelling)
– in the same table (poor UI, poor logic, sloppy entry)
5
6. In data interchange
?
A B
XML
• Receiving data from third parties
– do we have this record already?
6
7. A difficult problem
• It requires O(n2) comparisons for n records
– a million comparisons for 1000 records, 100 million
for 10,000, ...
• Exact string comparison is not enough
– must handle misspellings, name variants, etc
• Interpreting the data can be difficult even for a
human being
– is the address different because there are two
different people, or because the person moved?
• ...
7
8. Record linkage
• Statisticians very often must connect data sets
from different sources
• They call it “record linkage”
– term coined in 19461)
– mathematical foundations laid in 19592)
– formalized in 1969 as “Fellegi-Sunter” model3)
• A whole subject area has been developed with
well-known techniques, methods, and tools
– these can of course be applied outside of statistics
1) http://ajph.aphapublications.org/cgi/reprint/36/12/1412
8 2) http://www.sciencemag.org/content/130/3381/954.citation
3) http://www.jstor.org/pss/2286061
9. The basic idea
Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5
0.931
9
10. Standard algorithm
• n2 comparisons for n records is unacceptable
– must reduce number of direct comparisons
• Solution
– produce a key from field values,
– sort records by the key,
– for each record, compare with n nearest neighbours
– sometimes several different keys are produced, to increase
chances of finding matches
• Downsides
– requires coming up with a key
– difficult to apply incrementally
– sorting is expensive
10
11. String comparison
• Measure “distance” between strings
– must handle spelling errors and name variants
– must estimate probability that values belong to same entity
despite differences
• Examples
– John Smith ≈ Jonh Smith
– J. Random Hacker ≈ James Random Hacker
• Many well-known algorithms have been developed
– no one best choice, unfortunately
– some are eager to merge, others less so
• Many are computationally expensive
– O(n2) where n is length of string is common
– this gets very costly when number of record pairs to compare is
already high
11
12. Existing tools
• Commercial tools
– big, sophisticated, and expensive
– seem to follow the same principles
– presumably also effective
• Open source tools
– generally made by and for statisticians
– nice user interfaces and rich configurability
– advanced maths
– architecture often not as flexible as it could be
12
14. Context
• Doing a project for Hafslund where we
integrate data from many sources
• Entities are duplicated both inside systems and
across systems
Suppliers
Customers Customers Customers
Companies
ERP CRM Billing
14
15. Requirements
• Must be flexible and configurable
– no way to know in advance exactly what data we will need
to deduplicate
• Must scale to large data sets
– CRM alone has 1.4 million customer records
– that‟s 2 trillion comparisons with naïve approach
• Must have an API
– project uses SDshare protocol everywhere
– must therefore be able to build connectors
• Must be able to work incrementally
– process data as it changes and update conclusions on the
fly
15
17. Duke
• Java deduplication engine
– released as open source
– http://code.google.com/p/duke/
• Does not use key approach
– instead indexes data with Lucene
– does Lucene searches to find potential matches
• Still a work in progress, but
– high performance,
– several data sources and comparators,
– being used for real in real projects,
– flexible architecture
17
18. How it works
DataSource
Extrac
Map Clean
t
DataSource
Extrac
Map Clean
Index Search Compare Link
t
DataSource
Extrac
Map Clean
t
Lucene index
18
19. Cleaning of data
Côte D’Ivoire cote d’ivoire
Main St 113 113 main street
Dec 25 1973 1973-12-25
• Used to normalize data from sources
– necessary to get rid of differences that don‟t matter
• Makes comparing much more effective
– pluggable in Duke
– utilities and components already provided
19
20. Components
Data sources Comparators
• CSV • ExactComparator
• JDBC • NumericComparator
• Sparql • SoundexComparator
• NTriples • TokenizedComparator
• <plug in your own> • Levenshtein
• JaroWinkler
• Dice coefficient
• <plug in your own>
20
21. Features
• XML syntax for configuration
• Fairly complete command-line tool
– with debugging support
• API for embedding
• Pluggable data sources, comparators, and
cleaners
• Framework for composing cleaners
• Incremental processing
• High performance
21
23. Why probabilities?
• Some tools use rules instead
– if field1 and field2 match, then ...
– if field1 and (field3 or field4) match, then ...
• This is easy to understand, but
– there are many, many combinations
– doesn‟t take inexact matching into account
– adding another field becomes a real pain
23
24. Probabilities weigh all the evidence
Field Record 1 Record 2 Probability Accum.
Name acme inc acme inc 0.9 0.9
Assoc no 177477707 0.5 0.9
Zip code 9161 9161 0.6 0.931
Country norway norway 0.51 0.934
Address 1 mb 113 mailbox 0.3 0.857
113
Address 2 0.5 0.857
Probabilities combined with Bayes Theorem
Probabilities reduced if values don’t match exactly
24
25. Linking Mondial and DBpedia
A real-world example
25 http://code.google.com/p/duke/wiki/LinkingCountries
26. Finding properties to match
• Need properties providing identity evidence
• Matching on the properties in bold below
• Extracted data to CSV for ease of use
DBPEDIA MONDIAL
Id http://dbpedia.org/resource/Samoa Id 17019
Name Samoa Name Western Samoa
Founding date 1962-01-01 Independence 01 01 1962
Capital Apia Capital Apia, Samoa
Currency Tala Population 214384
Area 2831 Area 2860
Leader name Tuilaepa Aiono Sailele Malielegaoi GDP 415
26
27. Configuration – data sources
<group> <group>
<csv> <csv>
<param name="input-file" value="dbpedia.csv"/> <param name="input-file" value="mondial.csv"/>
<param name="header-line" value="false"/>
<column name="id" property="ID"/>
<column name="1" property="ID"/> <column name="country"
<column name="2" cleaner="no.priv...examples.CountryNameCleaner"
cleaner="no.priv...CountryNameCleaner" property="NAME"/>
property="NAME"/> <column name="capital"
<column name="3" cleaner="no.priv...LowerCaseNormalizeCleaner"
property="AREA"/> property="CAPITAL"/>
<column name="4" <column name="area"
cleaner="no.priv...CapitalCleaner" property="AREA"/>
property="CAPITAL"/> </csv>
</csv> </group>
</group>
Using groups tells Duke that we are linking across two data sets,
not deduplicating by comparing all records against all others
27
28. Configuration – matching
<schema>
<threshold>0.65</threshold>
Duke analyzes this setup and decides
<property type="id"> only NAME and CAPITAL need to be
<name>ID</name> searched on in Lucene.
</property>
<property>
<name>NAME</name>
<comparator>no.priv.garshol.duke.Levenshtein</comparator>
<low>0.3</low>
<high>0.88</high>
</property>
<property>
<name>AREA</name>
<comparator>AreaComparator</comparator>
<low>0.2</low> <object class="no.priv.garshol.duke.NumericComparator"
<high>0.6</high> name="AreaComparator">
</property> <param name="min-ratio" value="0.7"/>
<property> </object>
<name>CAPITAL</name>
<comparator>no.priv.garshol.duke.Levenshtein</comparator>
<low>0.4</low>
<high>0.88</high>
</property>
</schema>
28
29. Result
• Correct links found: 206 / 217 (94.9%)
• Wrong links found: 0 / 12 (0.0%)
• Unknown links found: 0
• Percent of links correct 100.0%, wrong 0.0%,
unknown 0.0%
• Records with no link: 25
• Precision 100.0%, recall 94.9%, f-number
0.974
29
30. Examples
Field DBpedia Mondial Field DBpedia Mondial
Name albania albania Name kazakhstan kazakstan
Area 28748 28750 Area 2724900 2717300
Capital tirana tirane Capital astana almaty
Probability 0.980 Probability 0.838
Field DBpedia Mondial Field DBpedia Mondial
Name côte d'ivoire cote divoire Name grande comore comoros
Area 322460 322460 Area 1148 2170
Capital yamoussoukro yamoussoukro Capital moroni moroní
Probability 0.975 Probability 0.440
Field DBpedia Mondial Field DBpedia Mondial
Name samoa western samoa Name serbia serbia and mont
Area 2831 2860 Area 102350 88361
Capital apia apia Capital sarajevo sarajevo
Probability 0.824 Probability 0.440
30
31. Western Samoa or American Samoa?
Field DBpedia Mondial Probability
Name samoa western samoa 0.3
Area 2831 2860 0.6
Capital apia apia 0.88
Probability 0.824
Field DBpedia Mondial Probability
Name samoa american samoa 0.3
Area 2831 199 0.4
Capital apia pago pago 0.4
Probability 0.067
31
32. An example of failure
Field DBpedia Mondial
• Duke doesn‟t find this match Name kazakhstan kazakstan
Area 2724900 2717300
– no tokens matching exactly Capital astana almaty
– Lucene search finds nothing Probability 0.838
• Detailed comparison gives correct result
– the problem is Lucene search
• Lucene does have Levenshtein search, but
– in Lucene 3.x it‟s very slow
– therefore not enabled now
– thinking of adding option to enable where needed
– Lucene 4.x will fix the performance problem
32
33. The effects of value matching
Case Precision Recall F
Optimal 100% 94.9% 97.4%
Cleaning, 99.3% 73.3% 84.4%
exact compare
No cleaning, 100% 93.4% 96.6%
good compare
No cleaning, 99.3% 70.9% 82.8%
exact compare
Cleaning, 97.6% 96.7% 97.2%
JaroWinkler
Cleaning, 95.9% 97.2% 96.6%
Soundex
33
35. The big picture
DUPLICATES!
SDshare Public SDshare
360
Search SDshare SDshare
Virtuoso
ERP
engine (RDF)
CRM
SDshare Billing
Duke SDshare
contains owl:sameAs and
haf:possiblySameAs
Can override here LinkDB
35
36. Experiences so far
• Incremental processing works fine
– links added and retracted as data changes
• Performance not an issue at all
– despite there being ~1.4 million records
– requires a little bit of tuning, however
• Matching works well, but not perfect
– data are very noisy and messy
– biggest issue is clusters caused by generic values
– also, matching is hard
36
38. Levenshtein
• Also known as edit distance
• Measures the number of edit operations
necessary to turn s1 into s2
• Edit operations are
– insert a character
– remove a character
– substitute a character
• Complexity
– O(n2) – naïve algorithm
– O(n + d2) – where d = distance (with optimizations)
38
40. Weighted Levenshtein
• Not all edit operations are equal
– substituting “i” for “e” is a smaller edit than
substituting “o” for “k”
• Weighted Levenshtein evaluates each edit
operation as a number 0.0-1.0
• Weights depend on context
– language, for example
• Slower, because some Levenshtein
optimizations do not apply with weights
40
41. Jaro-Winkler
• Developed at the US Bureau of the Census
• For name comparisons
– not well suited to long strings
– best if given name/surname are separated
• Exists in a few variants
– originally proposed by Winkler
– then modified by Jaro
– a few different versions of modifications etc
41
42. Jaro-Winkler definition
• Formula:
– m = number of matching characters
– t = number of transposed characters
• A character from string s1 matches s2 if the
same character is found in s2 less then half the
length of the string away
• Levenshtein ~ Löwenstein = 0.8
• Axel ~ Aksel = 0.783
42
44. Soundex
• A coarse schema for matching names by sound
– produces a key from the name
– names match if key is the same
• In common use in many places
– Nav‟s person register uses it for search
– built-in in many databases
– ...
44
47. Metaphone
• Developed by Lawrence Philips
• Similar to Soundex, but much more complex
– both more accurate and more sensitive
• Developed further into Double Metaphone
• Metaphone 3.0 also exists, but only available
commercially
47
49. Norphone
• I‟m working on a Norwegian alternative to
Metaphone
– designed to handle phonetic rules in Norweigan
names
– for example, “ch” = “k”
• Work in progress
– http://code.google.com/p/duke/source/browse/src/
main/java/no/priv/garshol/duke/comparators/Norp
honeComparator.java
49
50. Dice coefficient
• A similarity measure for sets
– set can be tokens in a string
– or characters in a string
• Formula:
50
51. TFIDF
• Compares strings as sets of tokens
– a la Dice coefficient
• However, takes frequency of tokens in corpus
into account
– this matches how we evaluate matches mentally
• Has done well in evaluations
– however, can be difficult to evaluate
– results will change as corpus changes
51
52. More comparators
• Smith-Waterman
– originated in DNA sequencing
• Q-grams distance
– breaks string into sets of pieces of q characters
– then does set similarity comparison
• Monge-Elkan
– similar to Smith-Waterman, but with affine gap distances
– has done very well in evaluations
– costly to evaluate
• Many, many more
– ...
52
54. Why performance matters
• Performance really is a major issue
– O(n2) behaviour
– slow comparators
• An example
– linking 840‟000 authors from Deichmann library
with 200‟000 person records from DBpedia
– with naïve in-memory backend this takes ~4 secs per
record in second group
– that‟s 9.5 days to do the whole job...
54
55. Bigger data
So... 9.5 days for ~400,000 records
Time to process increases with square of record count
These guys can expect to spend 585 years...
55
56. Obvious solution #1
• Where is the program spending its time?
– mostly in Levenshtein string comparisons
– these actually take a while
• Solution
– switch from n x n array to single n2 array (30% off)
– break off comparison if we‟re going to go over
difference limit, anyway (another 30% off)
• Overall speed improvement ~50%
– so, from 9.5 days to 4.25 days
– or, from 585 years to 293 years
56
57. Obvious solution #2
• Parallel computing
– indexing takes just a few minutes (for small case)
– record comparisons are independent of each other
– so, spin up different computers and let them work in
parallell
• Does it work?
– well, let‟s say we use 20 nodes
– enough to cost a bit, and give us some admin work
– not too many to cause communication issues
– expect speed-up by a factor of 10
• That gives us
– from 4.5 days to 10 hours Didn’t actually do this...
– from 293 years to 29 years
57
58. Obvious solution #3
• Improve the algorithm!
– that is, use the Lucene backend
– now we only compare records matched by searches
– dramatically cuts number of record pairs
• But does it work?
– Deichmann case now runs in 2 hours (single machine)
– performance still not linear
– Mexican runtime therefore probably still too long
(hence their email)
58
59. Final solution
• Searches can still return huge numbers of
potential candidates
– most relevant ones at the top
– so, we can configure cutoffs (by relevance or number)
• Setting
– min-relevance: 0.99
– max-search-hits: 10
– runs Deichmann in 3 minutes (98.2% accuracy)
– performance now close to linear
– should work for the Mexicans, too
59
60. Next step
• We can do some more tweaking with Lucene
– can use boosting to improve relevance
– then can cut more records without losing recall
– there are limits, however
• So next step may well be parallel computation
– should be fun
60