SlideShare a Scribd company logo
1 of 64
Data Cleaning and
Transformation – Record
Linkage

  Helena Galhardas
  DEI IST


  (based on the slides: “A Survey of Data Quality Issues in
  Cooperative Information Systems”, Carlo Batini, Tiziana
  Catarci, Monica Scannapieco, 23rd International Conference
  on Conceptual Modelling (ER 2004) and “Searching and
  Integrating Information on the Web, Seminar 3”, Chen Li,
  Univ. Irvine)
Agenda

   Motivation
   Introduction
   Details on specific steps
       Pre-processing
       Comparison functions
       Types of blocking methods
           Sorted Neighborhood Method
           Approximate String Joins over RDBMS
Motivation

   Correlate data from different data sources
       Data is often dirty
       Needs to be cleansed before being used
   Example:
       A hospital needs to merge patient records from
        different data sources
       They have different formats, typos, and
        abbreviations
Example

                Table R                                     Table S

   Name             SSN           Addr          Name             SSN          Addr
Jack Lemmon 430-871-8294       Maple St      Ton Hanks      234-162-1234   Main Street

Harrison Ford   292-918-2913   Culver Blvd   Kevin Spacey   928-184-2813   Frost Blvd

Tom Hanks       234-762-1234   Main St       Jack Lemon     430-817-8294   Maple
                                                                           Street
     …               …             …
                                                  …               …            …


           Find records from different datasets that could be
            the same entity
Another Example

   P. Bernstein, D. Chiu: Using Semi-Joins to Solve
    Relational Queries. JACM 28(1): 25-40(1981)
   Philip A. Bernstein, Dah-Ming W. Chiu,
    Using Semi-Joins to Solve Relational
    Queries, Journal of the ACM (JACM), v.28
    n.1, p.25-40, Jan. 1981
Type of data considered
   (Two) formatted tables
     Homogeneous in common attributes
     Heterogeneous

         Format (Full name vs Acronym)
         Semantics (e.g. Age vs Date of Birth)

         Errors

   (Two) groups of tables
       Dimensional hierarchy
   (Two) XML documents
Record linkage and its evolution towards
object identification in databases ….
  Record linkage
  First record and second record represent the same aspect of reality?

      Crlo      Batini    55              Carlo    Btini       54


 Object identification in databases
 First group of records and second group of records represent the same
 aspect of reality?

       Crlo      Batini   Pescara         Pscara    Itly       Europe




       Carlo     Batini    Pscara         Pscara    Italy      Europe
…and object identification in
semistructured documents
   <country>
      <name> United States of America </name>
       <cities> New York, Los Angeles, Chicago
          </cities>
       <lakes>
            <name> Lake Michigan </name>
       </lakes>
   </country>                                     and
  <country>
     United States
     <city> New York </city>                     are the same
     <city> Los Angeles </city>
                                                 object?
     <lakes>
          <lake> Lake Michigan </lake>
     </lakes>
Record linkage

Problem statement:
“Given two relations, identify the potentially
  matched records
       Efficiently and
       Effectively”
   Also known as: approximated duplicate
    detection, entity resolution/identification,
    etc
Challenges
   How to define good similarity functions?
       Many functions proposed (edit distance, cosine similarity, …)
       Domain knowledge is critical
            Names: “Wall Street Journal” and “LA Times”
            Address: “Main Street” versus “Main St”
       Record-oriented: A pair of records with different fields is
        considered
   How to do matching efficiently
       Offline join version
       Online (interactive) search
            Nearest search
            Range search
       Record-set oriented: A potentially large set (or two sets )of
        records needs to be compared
Introduction
Relevant steps of record linkage techs.
                                                  Assignment
Input
files   Initial
        Search space                               Match
                            Reduced
                             search
 A                          Space S’
                 Search                Decision    Possible
        AxB       space     S’ A x B    model      matched
                reduction

  B
                                                  Unmatch
General strategy
0. Preprocessing
     Standardize fields to compare and correct simple errors

1. Estabilish a search space reduction method (also called
   blocking method)
     Given the search space S = A x B of the two files, find a new
      search space S’ contained in S, to apply further steps.
2. Choose comparison function
     Choose the function/set of rules that express the distance

      between pairs of records in S’
3. Choose decision model
     Choose the method for assigning pairs in S’ to M, the set of
      matching records, U the set of unmatching records, and P the
      set of possible matches
4. Check effectiveness of method
Phases
    - [Preprocessing]
    - Estabilish a blocking method
    - Compute comparison
      function
    - Apply decision model
    - [Check effectiveness of
      method]
Details on specific steps
Data preprocessing (preparation)
   Parsing: locates, identifies, and isolates individual
    fields in the source files
        Ex: parsing of name and address components into
         consistent packets of information
   Data transformation: simple conversions applied to
    te data in order for them to conform to the data types
    of the corresponding domains
        Ex: data type conversions, range checking
   Data standardization: normalization of the
    information represented in certain fields to a specific
    content format
        Ex: address information, date and time formatting, name
         and title formatting
Comparison functions
   It is not easy to match duplicate records even after
    parsing, data standardization, and identification of
    similar fields
       Misspellings and different conventions for recording the
        same information still result in different, multiple
        representations of a unique object in the database.


Field matching techniques: for measuring the
  similarity of individual fields

Detection of duplicate records: for measuring the
 similarity of entire records
(String) field matching functions
Character-based similarity metrics: handle
 typographical errors well

Token-based similarity metrics: handle
  typographical conventions that lead to
  rearrangement of words
     Ex: “John Smith” vs “Smith, John”

Phonetic similarity measures: handle similarity
 between strings that are not similar at a character
 or token level
     Ex: “Kageonne” and “Cajun”
Character-based similarity metrics (1)

 Hamming distance - counts the number of mismatches
   between two numbers or text strings (fixed length strings)
 Ex: The Hamming distance between “00185” and “00155” is
   1, because there is one mismatch

 Edit distance - the minimum cost to convert one of the
   strings to the other by a sequence of character insertions,
   deletions and replacements. Each one of these
   modifications is assigned a cost value.
 Ex: the edit distance between “Smith” and “Sitch” is 2, since
   “Smith” is obtained by adding “m” and deleting “c” from
   “Sitch”.
 Several variations/improvements: Affine Gap Distance, Smith-
   Waterman distance, etc
Character-based similarity metrics (2)
Jaro’s algorithm or distance finds the number of common
  characters and the number of transposed characters in the two
  strings. Jaro distance accounts for insertions, deletions, and
  transpositions.
 Common characters are all characters s1[i] and s2[j], for which
  s1[i] =s2[j] and |i-j|<= 1/2min(|s1|, |s2|) , which means
  characters that appear in both strings within a distance of half
  the length of the shorter string. A transposed character is a
  common character that appears in different positions.
Ex: Comparing “Smith” and “Simth”, there are five common
  characters, two of which are transposed
Jaro(s1, s2) = 1/3(Nc/|s1|+Nc/|s2|+(Nc-Nt/2)/Nc)
Where Nc is the nb of comon characters between the two strings
  and Nt is the number of transpositions
Character-based similarity metrics (3)
  N-grams comparison function forms a vector as the set
    of all the substrings of length n for each string. The n-
    gram representation of a string has a non-zero
    component vector for each n-letter substring it
    contains, where the magnitude is equal to the
    number of times the substring occurs. The distance
    between the two strings is defined as the square radix of
    the distance between the two vectors.
Token-based similarity metrics (1)
TF-IDF (Token Frequency-Inverse Document Frequency) or cosine
  similarity assigns higher weights to tokens appearing frequently in a
  document (TF weight) and lower weights to tokens that appear
  frequently in the whole set of documents (IDF weight).
  TFw is the number of times that w appears in the field
  IDFw = |D|/nw
  where nw is the number of records in the database D that contain the
  word w
  Separates each string s into words and assign a weight to each word
  w:
  vs(w) = log(TFw+1)*log(IDFw)
The cosine similarity of s1 and s2 is:
  sim(s1, s2) = j=1|D| vs1(j)*vs2(j)/||vs1||*||vs2||
Token-based similarity metrics (2)

   Widely used for matching strings in documents
   The cosine similarity works well for a large variety of
    entries and is insensitive to the location of words
       Ex: “John Smith” is equivalent to “Smith, John”
   The introduction of frequent words only minimally
    affects the similarity of two strings
       Ex: “John Smith” and “Mr John Smith” have similarity
        close to one.
   Does not capture word spelling errors
       Ex: “Compter Science Departament” and “Deprtment of
        Computer Scence” wil have similarity equal to zero
Phonetic similarity metrics

Soundex: the most common phonetic coding scheme. Based
  on the assignment of identical code digits to phonetically
  similar groups of consonants and is used mainly to match
  surnames.
Detection of duplicate records

   Two categories of methods for matching records
    with multiple fields:
       Approaches that rely on training data to learn how to
        match the records
           Some probabilistic approaches
               Ex: [Fellegi and Sunter 67]
           Supervised machine learning techniques
               Ex: [Bilenko et al 03]
       Approaches that rely on domain knowledge or on generic
        distance metrics to match records
           Approaches that use declarative languages for matching
               Ex: AJAX
           Approaches that devise distance metrics suitable for duplicate
            detection
               Ex: [Monge and Elkan 96]
General strategy
0. Preprocessing
     Standardize fields to compare and correct simple errors

   1. Estabilish a search space reduction method (also
    called blocking method)
     Given the search space S = A x B of the two files, find a new
      search space S’ contained in S, to apply further steps.
2. Choose comparison function
     Choose the function/set of rules that express the distance
      between pairs of records in S’
3. Choose decision model
     Choose the method for assigning pairs in S’ to M, the set of
      matching records, U the set of unmatching records, and P the
      set of possible matches
4. Check effectiveness of method
Types of Blocking/Searching
    methods (1)
   Blocking - partition of the file into mutually
    exclusive blocks.
        Comparisons are restricted to records within each
         block.
   Blocking can be implemented by:
        Sorting the file according to a block key (chosen
         by an expert).
        Hashing - a record is hashed according to its
         block key in a hash block. Only records in the
         same hash block are considered for comparison.
Types of Blocking/Searching methods
(2)
   Sorted neighbour or Windowing – moving a
    window of a specific size w over the data file,
    comparing only the records that belong to this
    window
   Multiple windowing - Several scans, each of
    which uses a different sorting key may be
    applied to increase the possibility of combining
    matched records
   Windowing with choice of key based on the
    quality of the key
Sorted Neighbour searching method
1. Create Key: Compute a key K for each record in the
  list by extracting relevant fields or portions of fields.
      Relevance is decided by experts.
2. Sort Data: Sort the records in the data list using K
3. Merge: Move a fixed size window through the
  sequential list of records limiting the comparisons for
  matching records to those records in the window. If
  the size of the window is w records, then every new
  record entering the window is compared with the
  previous w – 1 records to find “matching” records
SNM: Create key

    Compute a key for each record by extracting
     relevant fields or portions of fields
    Example:



    First   Last        Address            ID            Key

    Sal     Stolfo   123 First Street   45678987   STLSAL123FRST456
SNM: Sort Data

   Sort the records in the data list using the key
    in step 1
   This can be very time consuming
       O(NlogN) for a good algorithm,
       O(N2) for a bad algorithm
SNM: Merge records

   Move a fixed size
    window through the
    sequential list of
    records.
   This limits the
    comparisons to the
    records in the window
SNM: Considerations

   What is the optimal window size while
       Maximizing accuracy
       Minimizing computational cost
   Execution time for large DB will be bound by
       Disk I/O
       Number of passes over the data set
Selection of Keys

   The effectiveness of the SNM highly depends
    on the key selected to sort the records
   A key is defined to be a sequence of a subset
    of attributes
   Keys must provide sufficient discriminating
    power
Example of Records and Keys


 First    Last         Address            ID            Key

 Sal     Stolfo    123 First Street    45678987   STLSAL123FRST456

 Sal     Stolfo    123 First Street    45678987   STLSAL123FRST456

 Sal     Stolpho   123 First Street    45678987   STLSAL123FRST456

 Sal     Stiles    123 Forest Street   45654321   STLSAL123FRST456
Equational Theory

   The comparison during the merge phase is
    an inferential process
   Compares much more information than
    simply the key
   The more information there is, the better
    inferences can be made
Equational Theory - Example

   Two names are spelled nearly identically and
    have the same address
       It may be inferred that they are the same person
   Two social security numbers are the same
    but the names and addresses are totally
    different
       Could be the same person who moved
       Could be two different people and there is an
        error in the social security number
A simplified rule in English

Given two records, r1 and r2
IF the last name of r1 equals the last name
  of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of
  r2
THEN
  r1 is equivalent to r2
The distance function

   A “distance function” is used to compare
    pieces of data (usually text)
   Apply “distance function” to data that “differ
    slightly”
   Select a threshold to capture obvious
    typographical errors.
       Impacts number of successful matches and
        number of false positives
Examples of matched records

  SSN       Name (First, Initial, Last)        Address
334600443        Lisa Boardman               144 Wars St.
334600443          Lisa Brown                144 Ward St.
525520001        Ramon Bonilla               38 Ward St.
525520001       Raymond Bonilla              38 Ward St.
    0        Diana D. Ambrosion           40 Brik Church Av.
    0        Diana A. Dambrosion          40 Brick Church Av.
789912345         Kathi Kason                48 North St.
879912345         Kathy Kason                48 North St.
879912345         Kathy Kason                48 North St.
879912345         Kathy Smith                48 North St.
Building an equational theory

   The process of creating a good equational
    theory is similar to the process of creating a
    good knowledge-base for an expert system
   In complex problems, an expert’s assistance
    is needed to write the equational theory
Transitive Closure

   In general, no single pass (i.e. no single key)
    will be sufficient to catch all matching records
   An attribute that appears first in the key has
    higher discriminating power than those
    appearing after them
       If an employee has two records in a DB with SSN
        193456782 and 913456782, it’s unlikely they will
        fall under the same window
Transitive Closure

   To increase the number of similar records
    merged
       Widen the scanning window size, w
       Execute several independent runs of the SNM
            Use a different key each time
            Use a relatively small window
            Call this the Multi-Pass approach
Multi-pass approach

   Each independent run of the Multi-Pass
    approach will produce a set of pairs of
    records
   Although one field in a record may be in
    error, another field may not
   Transitive closure can be applied to those
    pairs to be merged
Multi-pass Matches

Pass 1 (Lastname discriminates)
KSNKAT48NRTH789 (Kathi Kason 789912345 )
KSNKAT48NRTH879 (Kathy Kason 879912345 )
Pass 2 (Firstname discriminates)
KATKSN48NRTH789 (Kathi Kason 789912345 )
KATKSN48NRTH879 (Kathy Kason 879912345 )
Pass 3 (Address discriminates)
48NRTH879KSNKAT (Kathy Kason 879912345 )
48NRTH879SMTKAT (Kathy Smith 879912345 )
Transitive Equality Example

IF A implies B
  AND B implies C
      THEN A implies C
From example:
789912345 Kathi Kason 48 North St. (A)
879912345 Kathy Kason 48 North St. (B)
879912345 Kathy Smith 48 North St. (C)
Another approach: Approximate
    String Joins [Gravano 2001]
   We want to join tuples with “similar” string fields
   Similarity measure: Edit Distance
   Each Insertion, Deletion, Replacement increases distance by
    one
          Service A                       Service B
          Jenny Stamatopoulou             Panos Ipirotis        K=1
          John Paul McDougal              Jonh Smith            K=2
          Aldridge Rodriguez              …
          Panos Ipeirotis                 Jenny Stamatopulou    K=1
          John Smith                      John P. McDougal      K=3
          …                               …
          …                               Al Dridge Rodriguez   K=1
Focus: Approximate String Joins
over Relational DBMSs

   Join two tables on string attributes and keep all
    pairs of strings with Edit Distance ≤ K

   Solve the problem in a database-friendly way
    (if possible with an existing RDBMS)
Current Approaches for Processing
Approximate String Joins

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions:
 Join data outside of DBMS
 Join data via user-defined functions (UDFs) inside
  the DBMS
Approximate String Joins outside
 of a DBMS
1.   Export data
2.   Join outside of DBMS
3.   Import the result

Main advantage:
   We can exploit any state-of-the-art string-matching
   algorithm, without restrictions from DBMS functionality
Disadvantages:
  Substantial amounts of data to be exported/imported
  Cannot be easily integrated with further processing
   steps in the DBMS
Approximate String Joins with UDFs
1.    Write a UDF to check if two strings match
     within distance K
2.    Write an SQL statement that applies the UDF
     to the string pairs

SELECT R.stringAttr, S.stringAttr
FROM R, S
WHERE edit_distance(R.stringAttr,
   S.stringAttr, K)

Main advantage:
     Ease of implementation
Main disadvantage:
     UDF applied to entire cross-product of relations
Our Approach: Approximate String
 Joins over an Unmodified RDBMS
1.    Preprocess data and generate auxiliary tables
2.    Perform join exploiting standard RDBMS
     capabilities

Advantages
 No modification of underlying RDBMS needed.

 Can leverage the RDBMS query optimizer.

 Much more efficient than the approach based on
  naive UDFs
Intuition and Roadmap
   Intuition:
     Similar strings have many common substrings
     Use exact joins to perform approximate joins
      (current DBMSs are good for exact joins)
     A good candidate set can be verified for false
      positives
   Roadmap:
     Break strings into substrings of length q (q-grams)
     Perform an exact join on the q-grams

     Find candidate string pairs based on the results

     Check only candidate pairs with a UDF to obtain
      final answer
What is a “Q-gram”?
   Q-gram: A sequence of q characters of the original
    string

                      Example for q=3
                          vacations
     {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

          String with length L → L + q - 1 q-grams

   Similar strings have a many common q-grams
Q-grams and Edit Distance Operations
    With no edits: L + q - 1 common q-grams

    Replacement: (L + q – 1) - q common q-grams
     Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}
      Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}


    Insertion: (Lmax + q – 1) - q common q-grams
      Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}
      Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}


    Deletion: (Lmax + q – 1) - q common q-grams
      Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}
      Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}
Number of Common Q-grams and
Edit Distance
   For Edit Distance = K, there could be at most K
    replacements, insertions, deletions

   Two strings S1 and S2 with Edit Distance ≤ K have
    at least [max(S1.len, S2.len) + q - 1] – Kq q-grams
    in common

   Useful filter: eliminate all string pairs without
    "enough" common q-grams (no false dismissals)
Using a DBMS for Q-gram Joins

   If we have the q-grams in the DBMS, we can
    perform this counting efficiently.

   Create auxiliary tables with tuples of the form:
                        <sid, qgram>
    and join these tables

   A GROUP BY – HAVING COUNT clause can
    perform the counting / filtering
Eliminating Candidate Pairs: COUNT
  FILTERING
SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sid
FROM R, S
WHERE R.qgram=S.qgram
GROUP BY R.sid, S.sid
HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) –
  K*q

The result is the pair of strings with sufficiently
  enough common q-grams to ensure that we will
  not have false negatives.
Eliminating Candidate Pairs
    Further: LENGTH FILTERING
Strings with length difference larger than K cannot
  be within Edit Distance K

SELECT R.sid, S.sid
FROM R, S
WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K
GROUP BY R.sid, S.sid
HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) –
  K*q


       We refer to this filter as LENGTH FILTERING
Exploiting Q-gram Positions for Filtering

     Consider strings aabbzzaacczz and aacczzaabbzz
     Strings are at edit distance 4
     Strings have identical q-grams for q=3

  Problem: Matching q-grams that are at different
    positions in both strings
      Either q-grams do not "originate" from same q-gram,
       or
      Too many edit operations "caused" spurious q-grams

       at various parts of strings to match
Example of positional q-grams
Positional q-grams of length q=3 for
                john_smith
{(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn_), (6,n_s), (7,_sm), (8,smi),
   (9,mit), (10, ith), (11,th$), (12, h$$)}

Positional q-grams of length q=3 for
                john_a_smith
{(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn_), (6,n_a), (7,_a_), (8,a_s),
   (9,_sm), (10,smi), (11,mit), (12,ith), (13,th$), (14,h$$)}

   Ignoring positional information, the two strings have 11
    n-grams in common
   Only the five first positional q-grams are common in
    both strings
   An additional six positional q-grams in the two strings
    differ in their position by just two positions
POSITION FILTERING -
    Filtering using positions
   Keep the position of the q-grams
    <sid, strlen, pos, qgram>
   Do not match q-grams that are more than K positions away

SELECT R.sid, S.sid
FROM R, S
WHERE R.qgram=S.qgram
      AND abs(R.strlen - S.strlen)<=K
      AND abs(R.pos - S.pos)<=K
GROUP BY R.sid, S.sid
HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q


         We refer to this filter as POSITION FILTERING
The Actual, Complete SQL Statement
 SELECT R1.string, S1.string, R1.sid, S1.sid
 FROM R1, S1, R, S,
 WHERE R1.sid=R.sid
       AND S1.sid=S.sid
       AND R.qgram=S.qgram
       AND abs(strlen(R1.string)–strlen(S1.string))<=K
       AND abs(R.pos - S.pos)<=K
 GROUP BY R1.sid, S1.sid, R1.string, S1.string
 HAVING COUNT(*) >=
       (max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q
References

   “Data Quality: Concepts, Methodologies and
    Techniques”, C. Batini and M. Scannapieco,
    Springer-Verlag, 2006.
   “Duplicate Record Detection: A Survey”, A.
    Elmagarmid, P. Ipeirotis, V. Verykios, IEEE
    Transactions on Knowledge and Data Engineering,
    Vol. 19, no. 1, Jan. 2007.
   Approximate String Joins in a Database (Almost) for
    Free, Gravano et al, VLDB 2001,
   Efficient merge and purge: Hernandez and Stolfo,
    SIGMOD 1995

More Related Content

What's hot

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKristof Jozsa
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentationAASTHA PANDEY
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBernard Marr
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big DataTeemu Heikkilä
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...yashbheda
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification303Computing
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersMelinda Thielbar
 
data mining with big data
data mining with big datadata mining with big data
data mining with big dataswathi78
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 

What's hot (20)

A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Big Data
Big DataBig Data
Big Data
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
Open Source Tools for Big Data
Open Source Tools for Big DataOpen Source Tools for Big Data
Open Source Tools for Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Ppt
Big data PptBig data Ppt
Big data Ppt
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
 
Thilga
ThilgaThilga
Thilga
 
Tools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl WintersTools and Methods for Big Data Analytics by Dahl Winters
Tools and Methods for Big Data Analytics by Dahl Winters
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
data mining with big data
data mining with big datadata mining with big data
data mining with big data
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big Data Hadoop
Big Data HadoopBig Data Hadoop
Big Data Hadoop
 

Similar to Data Cleaning and Transformation – Record Linkage Techniques

Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...inscit2006
 
Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Mariana Damova, Ph.D
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Entity Resolution in Large Graphs
Entity Resolution in Large GraphsEntity Resolution in Large Graphs
Entity Resolution in Large GraphsApurva Kumar
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search ComponentMario Flecha
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
FRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelFRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelMarcia Zeng
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with LydiaJae Hong Kil
 
Problems in Enterprise Integration: Schematic Heterogeneities between Semanti...
Problems in Enterprise Integration: Schematic Heterogeneities between Semanti...Problems in Enterprise Integration: Schematic Heterogeneities between Semanti...
Problems in Enterprise Integration: Schematic Heterogeneities between Semanti...Amit Sheth
 
So Far (Schematically) yet So Near (Semantically)
So Far (Schematically) yet So Near (Semantically)So Far (Schematically) yet So Near (Semantically)
So Far (Schematically) yet So Near (Semantically)Amit Sheth
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
(Talk in Powerpoint Format)
(Talk in Powerpoint Format)(Talk in Powerpoint Format)
(Talk in Powerpoint Format)butest
 

Similar to Data Cleaning and Transformation – Record Linkage Techniques (20)

Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Entity Resolution in Large Graphs
Entity Resolution in Large GraphsEntity Resolution in Large Graphs
Entity Resolution in Large Graphs
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 
Topical_Facets
Topical_FacetsTopical_Facets
Topical_Facets
 
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
FRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data modelFRSAD Functional Requirements for Subject Authority Data model
FRSAD Functional Requirements for Subject Authority Data model
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Problems in Enterprise Integration: Schematic Heterogeneities between Semanti...
Problems in Enterprise Integration: Schematic Heterogeneities between Semanti...Problems in Enterprise Integration: Schematic Heterogeneities between Semanti...
Problems in Enterprise Integration: Schematic Heterogeneities between Semanti...
 
So Far (Schematically) yet So Near (Semantically)
So Far (Schematically) yet So Near (Semantically)So Far (Schematically) yet So Near (Semantically)
So Far (Schematically) yet So Near (Semantically)
 
Cluster
ClusterCluster
Cluster
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
(Talk in Powerpoint Format)
(Talk in Powerpoint Format)(Talk in Powerpoint Format)
(Talk in Powerpoint Format)
 

More from amooool2000

Data Warehousing
Data WarehousingData Warehousing
Data Warehousingamooool2000
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousingamooool2000
 
Effective resuse and Sharing of Best Teaching Practices
Effective resuse and Sharing of Best Teaching PracticesEffective resuse and Sharing of Best Teaching Practices
Effective resuse and Sharing of Best Teaching Practicesamooool2000
 
Collaborative km-from-seci-model-tiim
Collaborative km-from-seci-model-tiimCollaborative km-from-seci-model-tiim
Collaborative km-from-seci-model-tiimamooool2000
 
Collaborative km-from-seci-model-tiim
Collaborative km-from-seci-model-tiimCollaborative km-from-seci-model-tiim
Collaborative km-from-seci-model-tiimamooool2000
 
The eigen rumor algorithm
The eigen rumor algorithmThe eigen rumor algorithm
The eigen rumor algorithmamooool2000
 
Professional Learning Communities and Student Achievement
Professional Learning Communities and Student Achievement Professional Learning Communities and Student Achievement
Professional Learning Communities and Student Achievement amooool2000
 
Phd proposal defense
Phd proposal defensePhd proposal defense
Phd proposal defenseamooool2000
 

More from amooool2000 (10)

Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Effective resuse and Sharing of Best Teaching Practices
Effective resuse and Sharing of Best Teaching PracticesEffective resuse and Sharing of Best Teaching Practices
Effective resuse and Sharing of Best Teaching Practices
 
Academic portal
Academic portalAcademic portal
Academic portal
 
Collaborative km-from-seci-model-tiim
Collaborative km-from-seci-model-tiimCollaborative km-from-seci-model-tiim
Collaborative km-from-seci-model-tiim
 
Collaborative km-from-seci-model-tiim
Collaborative km-from-seci-model-tiimCollaborative km-from-seci-model-tiim
Collaborative km-from-seci-model-tiim
 
The eigen rumor algorithm
The eigen rumor algorithmThe eigen rumor algorithm
The eigen rumor algorithm
 
Professional Learning Communities and Student Achievement
Professional Learning Communities and Student Achievement Professional Learning Communities and Student Achievement
Professional Learning Communities and Student Achievement
 
Phd proposal defense
Phd proposal defensePhd proposal defense
Phd proposal defense
 
Graph
GraphGraph
Graph
 

Data Cleaning and Transformation – Record Linkage Techniques

  • 1. Data Cleaning and Transformation – Record Linkage Helena Galhardas DEI IST (based on the slides: “A Survey of Data Quality Issues in Cooperative Information Systems”, Carlo Batini, Tiziana Catarci, Monica Scannapieco, 23rd International Conference on Conceptual Modelling (ER 2004) and “Searching and Integrating Information on the Web, Seminar 3”, Chen Li, Univ. Irvine)
  • 2. Agenda  Motivation  Introduction  Details on specific steps  Pre-processing  Comparison functions  Types of blocking methods  Sorted Neighborhood Method  Approximate String Joins over RDBMS
  • 3. Motivation  Correlate data from different data sources  Data is often dirty  Needs to be cleansed before being used  Example:  A hospital needs to merge patient records from different data sources  They have different formats, typos, and abbreviations
  • 4. Example Table R Table S Name SSN Addr Name SSN Addr Jack Lemmon 430-871-8294 Maple St Ton Hanks 234-162-1234 Main Street Harrison Ford 292-918-2913 Culver Blvd Kevin Spacey 928-184-2813 Frost Blvd Tom Hanks 234-762-1234 Main St Jack Lemon 430-817-8294 Maple Street … … … … … …  Find records from different datasets that could be the same entity
  • 5. Another Example  P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981)  Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981
  • 6. Type of data considered  (Two) formatted tables  Homogeneous in common attributes  Heterogeneous  Format (Full name vs Acronym)  Semantics (e.g. Age vs Date of Birth)  Errors  (Two) groups of tables  Dimensional hierarchy  (Two) XML documents
  • 7. Record linkage and its evolution towards object identification in databases …. Record linkage First record and second record represent the same aspect of reality? Crlo Batini 55 Carlo Btini 54 Object identification in databases First group of records and second group of records represent the same aspect of reality? Crlo Batini Pescara Pscara Itly Europe Carlo Batini Pscara Pscara Italy Europe
  • 8. …and object identification in semistructured documents <country> <name> United States of America </name> <cities> New York, Los Angeles, Chicago </cities> <lakes> <name> Lake Michigan </name> </lakes> </country> and <country> United States <city> New York </city> are the same <city> Los Angeles </city> object? <lakes> <lake> Lake Michigan </lake> </lakes>
  • 9. Record linkage Problem statement: “Given two relations, identify the potentially matched records  Efficiently and  Effectively”  Also known as: approximated duplicate detection, entity resolution/identification, etc
  • 10. Challenges  How to define good similarity functions?  Many functions proposed (edit distance, cosine similarity, …)  Domain knowledge is critical  Names: “Wall Street Journal” and “LA Times”  Address: “Main Street” versus “Main St”  Record-oriented: A pair of records with different fields is considered  How to do matching efficiently  Offline join version  Online (interactive) search  Nearest search  Range search  Record-set oriented: A potentially large set (or two sets )of records needs to be compared
  • 12. Relevant steps of record linkage techs. Assignment Input files Initial Search space Match Reduced search A Space S’ Search Decision Possible AxB space S’ A x B model matched reduction B Unmatch
  • 13. General strategy 0. Preprocessing  Standardize fields to compare and correct simple errors 1. Estabilish a search space reduction method (also called blocking method)  Given the search space S = A x B of the two files, find a new search space S’ contained in S, to apply further steps. 2. Choose comparison function  Choose the function/set of rules that express the distance between pairs of records in S’ 3. Choose decision model  Choose the method for assigning pairs in S’ to M, the set of matching records, U the set of unmatching records, and P the set of possible matches 4. Check effectiveness of method
  • 14. Phases - [Preprocessing] - Estabilish a blocking method - Compute comparison function - Apply decision model - [Check effectiveness of method]
  • 16. Data preprocessing (preparation)  Parsing: locates, identifies, and isolates individual fields in the source files  Ex: parsing of name and address components into consistent packets of information  Data transformation: simple conversions applied to te data in order for them to conform to the data types of the corresponding domains  Ex: data type conversions, range checking  Data standardization: normalization of the information represented in certain fields to a specific content format  Ex: address information, date and time formatting, name and title formatting
  • 17. Comparison functions  It is not easy to match duplicate records even after parsing, data standardization, and identification of similar fields  Misspellings and different conventions for recording the same information still result in different, multiple representations of a unique object in the database. Field matching techniques: for measuring the similarity of individual fields Detection of duplicate records: for measuring the similarity of entire records
  • 18. (String) field matching functions Character-based similarity metrics: handle typographical errors well Token-based similarity metrics: handle typographical conventions that lead to rearrangement of words  Ex: “John Smith” vs “Smith, John” Phonetic similarity measures: handle similarity between strings that are not similar at a character or token level  Ex: “Kageonne” and “Cajun”
  • 19. Character-based similarity metrics (1) Hamming distance - counts the number of mismatches between two numbers or text strings (fixed length strings) Ex: The Hamming distance between “00185” and “00155” is 1, because there is one mismatch Edit distance - the minimum cost to convert one of the strings to the other by a sequence of character insertions, deletions and replacements. Each one of these modifications is assigned a cost value. Ex: the edit distance between “Smith” and “Sitch” is 2, since “Smith” is obtained by adding “m” and deleting “c” from “Sitch”. Several variations/improvements: Affine Gap Distance, Smith- Waterman distance, etc
  • 20. Character-based similarity metrics (2) Jaro’s algorithm or distance finds the number of common characters and the number of transposed characters in the two strings. Jaro distance accounts for insertions, deletions, and transpositions. Common characters are all characters s1[i] and s2[j], for which s1[i] =s2[j] and |i-j|<= 1/2min(|s1|, |s2|) , which means characters that appear in both strings within a distance of half the length of the shorter string. A transposed character is a common character that appears in different positions. Ex: Comparing “Smith” and “Simth”, there are five common characters, two of which are transposed Jaro(s1, s2) = 1/3(Nc/|s1|+Nc/|s2|+(Nc-Nt/2)/Nc) Where Nc is the nb of comon characters between the two strings and Nt is the number of transpositions
  • 21. Character-based similarity metrics (3) N-grams comparison function forms a vector as the set of all the substrings of length n for each string. The n- gram representation of a string has a non-zero component vector for each n-letter substring it contains, where the magnitude is equal to the number of times the substring occurs. The distance between the two strings is defined as the square radix of the distance between the two vectors.
  • 22. Token-based similarity metrics (1) TF-IDF (Token Frequency-Inverse Document Frequency) or cosine similarity assigns higher weights to tokens appearing frequently in a document (TF weight) and lower weights to tokens that appear frequently in the whole set of documents (IDF weight). TFw is the number of times that w appears in the field IDFw = |D|/nw where nw is the number of records in the database D that contain the word w Separates each string s into words and assign a weight to each word w: vs(w) = log(TFw+1)*log(IDFw) The cosine similarity of s1 and s2 is: sim(s1, s2) = j=1|D| vs1(j)*vs2(j)/||vs1||*||vs2||
  • 23. Token-based similarity metrics (2)  Widely used for matching strings in documents  The cosine similarity works well for a large variety of entries and is insensitive to the location of words  Ex: “John Smith” is equivalent to “Smith, John”  The introduction of frequent words only minimally affects the similarity of two strings  Ex: “John Smith” and “Mr John Smith” have similarity close to one.  Does not capture word spelling errors  Ex: “Compter Science Departament” and “Deprtment of Computer Scence” wil have similarity equal to zero
  • 24. Phonetic similarity metrics Soundex: the most common phonetic coding scheme. Based on the assignment of identical code digits to phonetically similar groups of consonants and is used mainly to match surnames.
  • 25. Detection of duplicate records  Two categories of methods for matching records with multiple fields:  Approaches that rely on training data to learn how to match the records  Some probabilistic approaches  Ex: [Fellegi and Sunter 67]  Supervised machine learning techniques  Ex: [Bilenko et al 03]  Approaches that rely on domain knowledge or on generic distance metrics to match records  Approaches that use declarative languages for matching  Ex: AJAX  Approaches that devise distance metrics suitable for duplicate detection  Ex: [Monge and Elkan 96]
  • 26. General strategy 0. Preprocessing  Standardize fields to compare and correct simple errors  1. Estabilish a search space reduction method (also called blocking method)  Given the search space S = A x B of the two files, find a new search space S’ contained in S, to apply further steps. 2. Choose comparison function  Choose the function/set of rules that express the distance between pairs of records in S’ 3. Choose decision model  Choose the method for assigning pairs in S’ to M, the set of matching records, U the set of unmatching records, and P the set of possible matches 4. Check effectiveness of method
  • 27. Types of Blocking/Searching methods (1)  Blocking - partition of the file into mutually exclusive blocks.  Comparisons are restricted to records within each block.  Blocking can be implemented by:  Sorting the file according to a block key (chosen by an expert).  Hashing - a record is hashed according to its block key in a hash block. Only records in the same hash block are considered for comparison.
  • 28. Types of Blocking/Searching methods (2)  Sorted neighbour or Windowing – moving a window of a specific size w over the data file, comparing only the records that belong to this window  Multiple windowing - Several scans, each of which uses a different sorting key may be applied to increase the possibility of combining matched records  Windowing with choice of key based on the quality of the key
  • 29. Sorted Neighbour searching method 1. Create Key: Compute a key K for each record in the list by extracting relevant fields or portions of fields.  Relevance is decided by experts. 2. Sort Data: Sort the records in the data list using K 3. Merge: Move a fixed size window through the sequential list of records limiting the comparisons for matching records to those records in the window. If the size of the window is w records, then every new record entering the window is compared with the previous w – 1 records to find “matching” records
  • 30. SNM: Create key  Compute a key for each record by extracting relevant fields or portions of fields  Example: First Last Address ID Key Sal Stolfo 123 First Street 45678987 STLSAL123FRST456
  • 31. SNM: Sort Data  Sort the records in the data list using the key in step 1  This can be very time consuming  O(NlogN) for a good algorithm,  O(N2) for a bad algorithm
  • 32. SNM: Merge records  Move a fixed size window through the sequential list of records.  This limits the comparisons to the records in the window
  • 33. SNM: Considerations  What is the optimal window size while  Maximizing accuracy  Minimizing computational cost  Execution time for large DB will be bound by  Disk I/O  Number of passes over the data set
  • 34. Selection of Keys  The effectiveness of the SNM highly depends on the key selected to sort the records  A key is defined to be a sequence of a subset of attributes  Keys must provide sufficient discriminating power
  • 35. Example of Records and Keys First Last Address ID Key Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 Sal Stolpho 123 First Street 45678987 STLSAL123FRST456 Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456
  • 36. Equational Theory  The comparison during the merge phase is an inferential process  Compares much more information than simply the key  The more information there is, the better inferences can be made
  • 37. Equational Theory - Example  Two names are spelled nearly identically and have the same address  It may be inferred that they are the same person  Two social security numbers are the same but the names and addresses are totally different  Could be the same person who moved  Could be two different people and there is an error in the social security number
  • 38. A simplified rule in English Given two records, r1 and r2 IF the last name of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2 THEN r1 is equivalent to r2
  • 39. The distance function  A “distance function” is used to compare pieces of data (usually text)  Apply “distance function” to data that “differ slightly”  Select a threshold to capture obvious typographical errors.  Impacts number of successful matches and number of false positives
  • 40. Examples of matched records SSN Name (First, Initial, Last) Address 334600443 Lisa Boardman 144 Wars St. 334600443 Lisa Brown 144 Ward St. 525520001 Ramon Bonilla 38 Ward St. 525520001 Raymond Bonilla 38 Ward St. 0 Diana D. Ambrosion 40 Brik Church Av. 0 Diana A. Dambrosion 40 Brick Church Av. 789912345 Kathi Kason 48 North St. 879912345 Kathy Kason 48 North St. 879912345 Kathy Kason 48 North St. 879912345 Kathy Smith 48 North St.
  • 41. Building an equational theory  The process of creating a good equational theory is similar to the process of creating a good knowledge-base for an expert system  In complex problems, an expert’s assistance is needed to write the equational theory
  • 42. Transitive Closure  In general, no single pass (i.e. no single key) will be sufficient to catch all matching records  An attribute that appears first in the key has higher discriminating power than those appearing after them  If an employee has two records in a DB with SSN 193456782 and 913456782, it’s unlikely they will fall under the same window
  • 43. Transitive Closure  To increase the number of similar records merged  Widen the scanning window size, w  Execute several independent runs of the SNM  Use a different key each time  Use a relatively small window  Call this the Multi-Pass approach
  • 44. Multi-pass approach  Each independent run of the Multi-Pass approach will produce a set of pairs of records  Although one field in a record may be in error, another field may not  Transitive closure can be applied to those pairs to be merged
  • 45. Multi-pass Matches Pass 1 (Lastname discriminates) KSNKAT48NRTH789 (Kathi Kason 789912345 ) KSNKAT48NRTH879 (Kathy Kason 879912345 ) Pass 2 (Firstname discriminates) KATKSN48NRTH789 (Kathi Kason 789912345 ) KATKSN48NRTH879 (Kathy Kason 879912345 ) Pass 3 (Address discriminates) 48NRTH879KSNKAT (Kathy Kason 879912345 ) 48NRTH879SMTKAT (Kathy Smith 879912345 )
  • 46. Transitive Equality Example IF A implies B AND B implies C THEN A implies C From example: 789912345 Kathi Kason 48 North St. (A) 879912345 Kathy Kason 48 North St. (B) 879912345 Kathy Smith 48 North St. (C)
  • 47. Another approach: Approximate String Joins [Gravano 2001]  We want to join tuples with “similar” string fields  Similarity measure: Edit Distance  Each Insertion, Deletion, Replacement increases distance by one Service A Service B Jenny Stamatopoulou Panos Ipirotis K=1 John Paul McDougal Jonh Smith K=2 Aldridge Rodriguez … Panos Ipeirotis Jenny Stamatopulou K=1 John Smith John P. McDougal K=3 … … … Al Dridge Rodriguez K=1
  • 48. Focus: Approximate String Joins over Relational DBMSs  Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K  Solve the problem in a database-friendly way (if possible with an existing RDBMS)
  • 49. Current Approaches for Processing Approximate String Joins No native support for approximate joins in RDBMSs Two existing (straightforward) solutions:  Join data outside of DBMS  Join data via user-defined functions (UDFs) inside the DBMS
  • 50. Approximate String Joins outside of a DBMS 1. Export data 2. Join outside of DBMS 3. Import the result Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality Disadvantages:  Substantial amounts of data to be exported/imported  Cannot be easily integrated with further processing steps in the DBMS
  • 51. Approximate String Joins with UDFs 1. Write a UDF to check if two strings match within distance K 2. Write an SQL statement that applies the UDF to the string pairs SELECT R.stringAttr, S.stringAttr FROM R, S WHERE edit_distance(R.stringAttr, S.stringAttr, K) Main advantage: Ease of implementation Main disadvantage: UDF applied to entire cross-product of relations
  • 52. Our Approach: Approximate String Joins over an Unmodified RDBMS 1. Preprocess data and generate auxiliary tables 2. Perform join exploiting standard RDBMS capabilities Advantages  No modification of underlying RDBMS needed.  Can leverage the RDBMS query optimizer.  Much more efficient than the approach based on naive UDFs
  • 53. Intuition and Roadmap  Intuition:  Similar strings have many common substrings  Use exact joins to perform approximate joins (current DBMSs are good for exact joins)  A good candidate set can be verified for false positives  Roadmap:  Break strings into substrings of length q (q-grams)  Perform an exact join on the q-grams  Find candidate string pairs based on the results  Check only candidate pairs with a UDF to obtain final answer
  • 54. What is a “Q-gram”?  Q-gram: A sequence of q characters of the original string Example for q=3 vacations {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$} String with length L → L + q - 1 q-grams  Similar strings have a many common q-grams
  • 55. Q-grams and Edit Distance Operations  With no edits: L + q - 1 common q-grams  Replacement: (L + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}  Insertion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}  Deletion: (Lmax + q – 1) - q common q-grams Vacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$} Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}
  • 56. Number of Common Q-grams and Edit Distance  For Edit Distance = K, there could be at most K replacements, insertions, deletions  Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common  Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)
  • 57. Using a DBMS for Q-gram Joins  If we have the q-grams in the DBMS, we can perform this counting efficiently.  Create auxiliary tables with tuples of the form: <sid, qgram> and join these tables  A GROUP BY – HAVING COUNT clause can perform the counting / filtering
  • 58. Eliminating Candidate Pairs: COUNT FILTERING SQL for this filter: (parts omitted for clarity) SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.
  • 59. Eliminating Candidate Pairs Further: LENGTH FILTERING Strings with length difference larger than K cannot be within Edit Distance K SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as LENGTH FILTERING
  • 60. Exploiting Q-gram Positions for Filtering  Consider strings aabbzzaacczz and aacczzaabbzz  Strings are at edit distance 4  Strings have identical q-grams for q=3 Problem: Matching q-grams that are at different positions in both strings  Either q-grams do not "originate" from same q-gram, or  Too many edit operations "caused" spurious q-grams at various parts of strings to match
  • 61. Example of positional q-grams Positional q-grams of length q=3 for john_smith {(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn_), (6,n_s), (7,_sm), (8,smi), (9,mit), (10, ith), (11,th$), (12, h$$)} Positional q-grams of length q=3 for john_a_smith {(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn_), (6,n_a), (7,_a_), (8,a_s), (9,_sm), (10,smi), (11,mit), (12,ith), (13,th$), (14,h$$)}  Ignoring positional information, the two strings have 11 n-grams in common  Only the five first positional q-grams are common in both strings  An additional six positional q-grams in the two strings differ in their position by just two positions
  • 62. POSITION FILTERING - Filtering using positions  Keep the position of the q-grams <sid, strlen, pos, qgram>  Do not match q-grams that are more than K positions away SELECT R.sid, S.sid FROM R, S WHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=K AND abs(R.pos - S.pos)<=K GROUP BY R.sid, S.sid HAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q We refer to this filter as POSITION FILTERING
  • 63. The Actual, Complete SQL Statement SELECT R1.string, S1.string, R1.sid, S1.sid FROM R1, S1, R, S, WHERE R1.sid=R.sid AND S1.sid=S.sid AND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=K AND abs(R.pos - S.pos)<=K GROUP BY R1.sid, S1.sid, R1.string, S1.string HAVING COUNT(*) >= (max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q
  • 64. References  “Data Quality: Concepts, Methodologies and Techniques”, C. Batini and M. Scannapieco, Springer-Verlag, 2006.  “Duplicate Record Detection: A Survey”, A. Elmagarmid, P. Ipeirotis, V. Verykios, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, no. 1, Jan. 2007.  Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,  Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995