Data Cleaning and Transformation – Record Linkage Techniques

Data Cleaning and
Transformation – Record
Linkage

Helena Galhardas
DEI IST

(based on the slides: “A Survey of Data Quality Issues in
Cooperative Information Systems”, Carlo Batini, Tiziana
Catarci, Monica Scannapieco, 23rd International Conference
on Conceptual Modelling (ER 2004) and “Searching and
Integrating Information on the Web, Seminar 3”, Chen Li,
Univ. Irvine)

Agenda

 Motivation
 Introduction
 Details on specific steps
 Pre-processing
 Comparison functions
 Types of blocking methods
 Sorted Neighborhood Method
 Approximate String Joins over RDBMS

Motivation

 Correlate data from different data sources
 Data is often dirty
 Needs to be cleansed before being used
 Example:
 A hospital needs to merge patient records from
different data sources
 They have different formats, typos, and
abbreviations

Example

Table R Table S

Name SSN Addr Name SSN Addr
Jack Lemmon 430-871-8294 Maple St Ton Hanks 234-162-1234 Main Street

Harrison Ford 292-918-2913 Culver Blvd Kevin Spacey 928-184-2813 Frost Blvd

Tom Hanks 234-762-1234 Main St Jack Lemon 430-817-8294 Maple
Street
… … …
… … …

 Find records from different datasets that could be
the same entity

Another Example

 P. Bernstein, D. Chiu: Using Semi-Joins to Solve
Relational Queries. JACM 28(1): 25-40(1981)
 Philip A. Bernstein, Dah-Ming W. Chiu,
Using Semi-Joins to Solve Relational
Queries, Journal of the ACM (JACM), v.28
n.1, p.25-40, Jan. 1981

Type of data considered
 (Two) formatted tables
 Homogeneous in common attributes
 Heterogeneous

 Format (Full name vs Acronym)
 Semantics (e.g. Age vs Date of Birth)

 Errors

 (Two) groups of tables
 Dimensional hierarchy
 (Two) XML documents

Record linkage and its evolution towards
object identification in databases ….
Record linkage
First record and second record represent the same aspect of reality?

Crlo Batini 55 Carlo Btini 54

Object identification in databases
First group of records and second group of records represent the same
aspect of reality?

Crlo Batini Pescara Pscara Itly Europe

Carlo Batini Pscara Pscara Italy Europe

…and object identification in
semistructured documents
<country>
<name> United States of America </name>
<cities> New York, Los Angeles, Chicago
</cities>
<lakes>
<name> Lake Michigan </name>
</lakes>
</country> and
<country>
United States
<city> New York </city> are the same
<city> Los Angeles </city>
object?
<lakes>
<lake> Lake Michigan </lake>
</lakes>

Record linkage

Problem statement:
“Given two relations, identify the potentially
matched records
 Efficiently and
 Effectively”
 Also known as: approximated duplicate
detection, entity resolution/identification,
etc

Challenges
 How to define good similarity functions?
 Many functions proposed (edit distance, cosine similarity, …)
 Domain knowledge is critical
 Names: “Wall Street Journal” and “LA Times”
 Address: “Main Street” versus “Main St”
 Record-oriented: A pair of records with different fields is
considered
 How to do matching efficiently
 Offline join version
 Online (interactive) search
 Nearest search
 Range search
 Record-set oriented: A potentially large set (or two sets )of
records needs to be compared

Relevant steps of record linkage techs.
Assignment
Input
files Initial
Search space Match
Reduced
search
A Space S’
Search Decision Possible
AxB space S’ A x B model matched
reduction

B
Unmatch

General strategy
0. Preprocessing
 Standardize fields to compare and correct simple errors

1. Estabilish a search space reduction method (also called
blocking method)
 Given the search space S = A x B of the two files, find a new
search space S’ contained in S, to apply further steps.
2. Choose comparison function
 Choose the function/set of rules that express the distance

between pairs of records in S’
3. Choose decision model
 Choose the method for assigning pairs in S’ to M, the set of
matching records, U the set of unmatching records, and P the
set of possible matches
4. Check effectiveness of method

Phases
- [Preprocessing]
- Estabilish a blocking method
- Compute comparison
function
- Apply decision model
- [Check effectiveness of
method]

Data preprocessing (preparation)
 Parsing: locates, identifies, and isolates individual
fields in the source files
 Ex: parsing of name and address components into
consistent packets of information
 Data transformation: simple conversions applied to
te data in order for them to conform to the data types
of the corresponding domains
 Ex: data type conversions, range checking
 Data standardization: normalization of the
information represented in certain fields to a specific
content format
 Ex: address information, date and time formatting, name
and title formatting

Comparison functions
 It is not easy to match duplicate records even after
parsing, data standardization, and identification of
similar fields
 Misspellings and different conventions for recording the
same information still result in different, multiple
representations of a unique object in the database.

Field matching techniques: for measuring the
similarity of individual fields

Detection of duplicate records: for measuring the
similarity of entire records

(String) field matching functions
Character-based similarity metrics: handle
typographical errors well

Token-based similarity metrics: handle
typographical conventions that lead to
rearrangement of words
 Ex: “John Smith” vs “Smith, John”

Phonetic similarity measures: handle similarity
between strings that are not similar at a character
or token level
 Ex: “Kageonne” and “Cajun”

Character-based similarity metrics (1)

Hamming distance - counts the number of mismatches
between two numbers or text strings (fixed length strings)
Ex: The Hamming distance between “00185” and “00155” is
1, because there is one mismatch

Edit distance - the minimum cost to convert one of the
strings to the other by a sequence of character insertions,
deletions and replacements. Each one of these
modifications is assigned a cost value.
Ex: the edit distance between “Smith” and “Sitch” is 2, since
“Smith” is obtained by adding “m” and deleting “c” from
“Sitch”.
Several variations/improvements: Affine Gap Distance, Smith-
Waterman distance, etc

Jaro’s algorithm or distance finds the number of common
characters and the number of transposed characters in the two
strings. Jaro distance accounts for insertions, deletions, and
transpositions.
Common characters are all characters s1[i] and s2[j], for which
s1[i] =s2[j] and |i-j|<= 1/2min(|s1|, |s2|) , which means
characters that appear in both strings within a distance of half
the length of the shorter string. A transposed character is a
common character that appears in different positions.
Ex: Comparing “Smith” and “Simth”, there are five common
characters, two of which are transposed
Jaro(s1, s2) = 1/3(Nc/|s1|+Nc/|s2|+(Nc-Nt/2)/Nc)
Where Nc is the nb of comon characters between the two strings
and Nt is the number of transpositions

N-grams comparison function forms a vector as the set
of all the substrings of length n for each string. The n-
gram representation of a string has a non-zero
component vector for each n-letter substring it
contains, where the magnitude is equal to the
number of times the substring occurs. The distance
between the two strings is defined as the square radix of
the distance between the two vectors.

Token-based similarity metrics (1)
TF-IDF (Token Frequency-Inverse Document Frequency) or cosine
similarity assigns higher weights to tokens appearing frequently in a
document (TF weight) and lower weights to tokens that appear
frequently in the whole set of documents (IDF weight).
TFw is the number of times that w appears in the field
IDFw = |D|/nw
where nw is the number of records in the database D that contain the
word w
Separates each string s into words and assign a weight to each word
w:
vs(w) = log(TFw+1)*log(IDFw)
The cosine similarity of s1 and s2 is:
sim(s1, s2) = j=1|D| vs1(j)*vs2(j)/||vs1||*||vs2||

Token-based similarity metrics (2)

 Widely used for matching strings in documents
 The cosine similarity works well for a large variety of
entries and is insensitive to the location of words
 Ex: “John Smith” is equivalent to “Smith, John”
 The introduction of frequent words only minimally
affects the similarity of two strings
 Ex: “John Smith” and “Mr John Smith” have similarity
close to one.
 Does not capture word spelling errors
 Ex: “Compter Science Departament” and “Deprtment of
Computer Scence” wil have similarity equal to zero

Phonetic similarity metrics

Soundex: the most common phonetic coding scheme. Based
on the assignment of identical code digits to phonetically
similar groups of consonants and is used mainly to match
surnames.

Detection of duplicate records

 Two categories of methods for matching records
with multiple fields:
 Approaches that rely on training data to learn how to
match the records
 Some probabilistic approaches
 Ex: [Fellegi and Sunter 67]
 Supervised machine learning techniques
 Ex: [Bilenko et al 03]
 Approaches that rely on domain knowledge or on generic
distance metrics to match records
 Approaches that use declarative languages for matching
 Ex: AJAX
 Approaches that devise distance metrics suitable for duplicate
detection
 Ex: [Monge and Elkan 96]

General strategy
0. Preprocessing
 Standardize fields to compare and correct simple errors

 1. Estabilish a search space reduction method (also
called blocking method)
 Given the search space S = A x B of the two files, find a new
search space S’ contained in S, to apply further steps.
2. Choose comparison function
 Choose the function/set of rules that express the distance
between pairs of records in S’
3. Choose decision model
 Choose the method for assigning pairs in S’ to M, the set of
matching records, U the set of unmatching records, and P the
set of possible matches
4. Check effectiveness of method

Types of Blocking/Searching
methods (1)
 Blocking - partition of the file into mutually
exclusive blocks.
 Comparisons are restricted to records within each
block.
 Blocking can be implemented by:
 Sorting the file according to a block key (chosen
by an expert).
 Hashing - a record is hashed according to its
block key in a hash block. Only records in the
same hash block are considered for comparison.

Types of Blocking/Searching methods
(2)
 Sorted neighbour or Windowing – moving a
window of a specific size w over the data file,
comparing only the records that belong to this
window
 Multiple windowing - Several scans, each of
which uses a different sorting key may be
applied to increase the possibility of combining
matched records
 Windowing with choice of key based on the
quality of the key

Sorted Neighbour searching method
1. Create Key: Compute a key K for each record in the
list by extracting relevant fields or portions of fields.
 Relevance is decided by experts.
2. Sort Data: Sort the records in the data list using K
3. Merge: Move a fixed size window through the
sequential list of records limiting the comparisons for
matching records to those records in the window. If
the size of the window is w records, then every new
record entering the window is compared with the
previous w – 1 records to find “matching” records

SNM: Create key

 Compute a key for each record by extracting
relevant fields or portions of fields
 Example:

First Last Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

SNM: Sort Data

 Sort the records in the data list using the key
in step 1
 This can be very time consuming
 O(NlogN) for a good algorithm,
 O(N2) for a bad algorithm

SNM: Merge records

 Move a fixed size
window through the
sequential list of
records.
 This limits the
comparisons to the
records in the window

SNM: Considerations

 What is the optimal window size while
 Maximizing accuracy
 Minimizing computational cost
 Execution time for large DB will be bound by
 Disk I/O
 Number of passes over the data set

Selection of Keys

 The effectiveness of the SNM highly depends
on the key selected to sort the records
 A key is defined to be a sequence of a subset
of attributes
 Keys must provide sufficient discriminating
power

Example of Records and Keys

First Last Address ID Key



Sal Stolpho 123 First Street 45678987 STLSAL123FRST456

Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456

Equational Theory

 The comparison during the merge phase is
an inferential process
 Compares much more information than
simply the key
 The more information there is, the better
inferences can be made

Equational Theory - Example

 Two names are spelled nearly identically and
have the same address
 It may be inferred that they are the same person
 Two social security numbers are the same
but the names and addresses are totally
different
 Could be the same person who moved
 Could be two different people and there is an
error in the social security number

A simplified rule in English

Given two records, r1 and r2
IF the last name of r1 equals the last name
of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of
r2
THEN
r1 is equivalent to r2

The distance function

 A “distance function” is used to compare
pieces of data (usually text)
 Apply “distance function” to data that “differ
slightly”
 Select a threshold to capture obvious
typographical errors.
 Impacts number of successful matches and
number of false positives

Examples of matched records

SSN Name (First, Initial, Last) Address
334600443 Lisa Boardman 144 Wars St.
334600443 Lisa Brown 144 Ward St.
525520001 Ramon Bonilla 38 Ward St.
525520001 Raymond Bonilla 38 Ward St.
0 Diana D. Ambrosion 40 Brik Church Av.
0 Diana A. Dambrosion 40 Brick Church Av.
789912345 Kathi Kason 48 North St.
879912345 Kathy Kason 48 North St.
879912345 Kathy Kason 48 North St.
879912345 Kathy Smith 48 North St.

Building an equational theory

 The process of creating a good equational
theory is similar to the process of creating a
good knowledge-base for an expert system
 In complex problems, an expert’s assistance
is needed to write the equational theory

Transitive Closure

 In general, no single pass (i.e. no single key)
will be sufficient to catch all matching records
 An attribute that appears first in the key has
higher discriminating power than those
appearing after them
 If an employee has two records in a DB with SSN
193456782 and 913456782, it’s unlikely they will
fall under the same window

Transitive Closure

 To increase the number of similar records
merged
 Widen the scanning window size, w
 Execute several independent runs of the SNM
 Use a different key each time
 Use a relatively small window
 Call this the Multi-Pass approach

Multi-pass approach

 Each independent run of the Multi-Pass
approach will produce a set of pairs of
records
 Although one field in a record may be in
error, another field may not
 Transitive closure can be applied to those
pairs to be merged

Multi-pass Matches

Pass 1 (Lastname discriminates)
KSNKAT48NRTH789 (Kathi Kason 789912345 )
KSNKAT48NRTH879 (Kathy Kason 879912345 )
Pass 2 (Firstname discriminates)
KATKSN48NRTH789 (Kathi Kason 789912345 )
KATKSN48NRTH879 (Kathy Kason 879912345 )
Pass 3 (Address discriminates)
48NRTH879KSNKAT (Kathy Kason 879912345 )
48NRTH879SMTKAT (Kathy Smith 879912345 )

Transitive Equality Example

IF A implies B
AND B implies C
THEN A implies C
From example:
789912345 Kathi Kason 48 North St. (A)
879912345 Kathy Kason 48 North St. (B)
879912345 Kathy Smith 48 North St. (C)

Another approach: Approximate
String Joins [Gravano 2001]
 We want to join tuples with “similar” string fields
 Similarity measure: Edit Distance
 Each Insertion, Deletion, Replacement increases distance by
one
Service A Service B
Jenny Stamatopoulou Panos Ipirotis K=1
John Paul McDougal Jonh Smith K=2
Aldridge Rodriguez …
Panos Ipeirotis Jenny Stamatopulou K=1
John Smith John P. McDougal K=3
… …
… Al Dridge Rodriguez K=1

Focus: Approximate String Joins
over Relational DBMSs

 Join two tables on string attributes and keep all
pairs of strings with Edit Distance ≤ K

 Solve the problem in a database-friendly way
(if possible with an existing RDBMS)

Current Approaches for Processing
Approximate String Joins

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions:
 Join data outside of DBMS
 Join data via user-defined functions (UDFs) inside
the DBMS

Approximate String Joins outside
of a DBMS
1. Export data
2. Join outside of DBMS
3. Import the result

Main advantage:
We can exploit any state-of-the-art string-matching
algorithm, without restrictions from DBMS functionality
Disadvantages:
 Substantial amounts of data to be exported/imported
 Cannot be easily integrated with further processing
steps in the DBMS

Approximate String Joins with UDFs
1. Write a UDF to check if two strings match
within distance K
2. Write an SQL statement that applies the UDF
to the string pairs

SELECT R.stringAttr, S.stringAttr
FROM R, S
WHERE edit_distance(R.stringAttr,
S.stringAttr, K)

Main advantage:
Ease of implementation
Main disadvantage:
UDF applied to entire cross-product of relations

Our Approach: Approximate String
Joins over an Unmodified RDBMS
1. Preprocess data and generate auxiliary tables
2. Perform join exploiting standard RDBMS
capabilities

Advantages
 No modification of underlying RDBMS needed.

 Can leverage the RDBMS query optimizer.

 Much more efficient than the approach based on
naive UDFs

Intuition and Roadmap
 Intuition:
 Similar strings have many common substrings
 Use exact joins to perform approximate joins
(current DBMSs are good for exact joins)
 A good candidate set can be verified for false
positives
 Roadmap:
 Break strings into substrings of length q (q-grams)
 Perform an exact join on the q-grams

 Find candidate string pairs based on the results

 Check only candidate pairs with a UDF to obtain
final answer

What is a “Q-gram”?
 Q-gram: A sequence of q characters of the original
string

Example for q=3
vacations
{##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

String with length L → L + q - 1 q-grams

 Similar strings have a many common q-grams

Q-grams and Edit Distance Operations
 With no edits: L + q - 1 common q-grams

 Replacement: (L + q – 1) - q common q-grams
Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}
Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}

 Insertion: (Lmax + q – 1) - q common q-grams
Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}

 Deletion: (Lmax + q – 1) - q common q-grams
Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Number of Common Q-grams and
Edit Distance
 For Edit Distance = K, there could be at most K
replacements, insertions, deletions

 Two strings S1 and S2 with Edit Distance ≤ K have
at least [max(S1.len, S2.len) + q - 1] – Kq q-grams
in common

 Useful filter: eliminate all string pairs without
"enough" common q-grams (no false dismissals)

Using a DBMS for Q-gram Joins

 If we have the q-grams in the DBMS, we can
perform this counting efficiently.

 Create auxiliary tables with tuples of the form:
<sid, qgram>
and join these tables

 A GROUP BY – HAVING COUNT clause can
perform the counting / filtering

Eliminating Candidate Pairs: COUNT
FILTERING
SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sid
FROM R, S
WHERE R.qgram=S.qgram
GROUP BY R.sid, S.sid
HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) –
K*q

The result is the pair of strings with sufficiently
enough common q-grams to ensure that we will
not have false negatives.

Eliminating Candidate Pairs
Further: LENGTH FILTERING
Strings with length difference larger than K cannot
be within Edit Distance K

SELECT R.sid, S.sid
FROM R, S
WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K
HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) –
K*q

We refer to this filter as LENGTH FILTERING

Exploiting Q-gram Positions for Filtering

 Consider strings aabbzzaacczz and aacczzaabbzz
 Strings are at edit distance 4
 Strings have identical q-grams for q=3

Problem: Matching q-grams that are at different
positions in both strings
 Either q-grams do not "originate" from same q-gram,
or
 Too many edit operations "caused" spurious q-grams

at various parts of strings to match

Example of positional q-grams
Positional q-grams of length q=3 for
john_smith
{(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn_), (6,n_s), (7,_sm), (8,smi),
(9,mit), (10, ith), (11,th$), (12, h$$)}

Positional q-grams of length q=3 for
john_a_smith
{(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn_), (6,n_a), (7,_a_), (8,a_s),
(9,_sm), (10,smi), (11,mit), (12,ith), (13,th$), (14,h$$)}

 Ignoring positional information, the two strings have 11
n-grams in common
 Only the five first positional q-grams are common in
both strings
 An additional six positional q-grams in the two strings
differ in their position by just two positions

POSITION FILTERING -
Filtering using positions
 Keep the position of the q-grams
<sid, strlen, pos, qgram>
 Do not match q-grams that are more than K positions away

SELECT R.sid, S.sid
FROM R, S
WHERE R.qgram=S.qgram
AND abs(R.strlen - S.strlen)<=K
AND abs(R.pos - S.pos)<=K
HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as POSITION FILTERING

The Actual, Complete SQL Statement
SELECT R1.string, S1.string, R1.sid, S1.sid
FROM R1, S1, R, S,
WHERE R1.sid=R.sid
AND S1.sid=S.sid
AND R.qgram=S.qgram
AND abs(strlen(R1.string)–strlen(S1.string))<=K
AND abs(R.pos - S.pos)<=K
GROUP BY R1.sid, S1.sid, R1.string, S1.string
HAVING COUNT(*) >=
(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q

References

 “Data Quality: Concepts, Methodologies and
Techniques”, C. Batini and M. Scannapieco,
Springer-Verlag, 2006.
 “Duplicate Record Detection: A Survey”, A.
Elmagarmid, P. Ipeirotis, V. Verykios, IEEE
Transactions on Knowledge and Data Engineering,
Vol. 19, no. 1, Jan. 2007.
 Approximate String Joins in a Database (Almost) for
Free, Gravano et al, VLDB 2001,
 Efficient merge and purge: Hernandez and Stolfo,
SIGMOD 1995

Data Cleaning and Transformation – Record Linkage Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Cleaning and Transformation – Record Linkage Techniques

Similar to Data Cleaning and Transformation – Record Linkage Techniques (20)

More from amooool2000

More from amooool2000 (10)

Data Cleaning and Transformation – Record Linkage Techniques