SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Time-completeness trade-offs in record linkage
            using Adaptive query Processing
         Paolo Missier, Alvaro A. A. Fernandes
    School of Computer Science, University of Manchester

            Roald Lengu, Giovanna Guerrini
              DISI, Universita' di Genova, Italy

                       Marco Mesiti
              DiCo, Universita' di Milano, Italy


                SEBD 2009, Camogli, Italy
                      June, 2009
Conclusions
• Optimistic approach to online record linkage
     – Based on implicit referential integrity assumption
     – When assumption not true, goes back to pessimistic


• Technical approach based on autonomic computing
     – Adaptive query processing
     – Mix of exact / approximate physical join operators


• Applications: on-the-fly integration scenarios (mashups,
  personal dataspaces, sensor data streams)

• Need to relax the referential integrity assumption -- use
  other sources for result size estimation


P. Missier, SEBD, June 2009, Camogli, Italy
Context: On-the-fly data integration
 • “Situational Applications”
 • Data mashups and personal dataspaces
 • slight mismatches in records lead to incomplete
   integration
      – due to different encodings, conventions
      – due to errors in data
                          R       A=B         S
     C        D       A                           B        A

                                          Madchester


     Y      X     Manchester
                                                               A case of record linkage
                                          Manchester       Z




          Y       X   Manchester        Manchester     Z

                                                                                   3
P. Missier, SEBD, June 2009, Camogli, Italy
Offline vs online linkage
  • Offline record linkage:
       – performed once before queries involving joins
       – 1. reconcile R and S on joining attributes R.A, S.B using
         your favourite record linkage technique
                                         R,S →    R’,S’

       – 2. perform regular equijoin on the transformed tables:

                                              R   S
       ➡ ok for tables that can be analysed ahead of the join
       ➡ ok when multiple queries issued on integrated tables




                                                                             4
P. Missier, SEBD, June 2009, Camogli, Italy
Offline vs online linkage
  • Offline record linkage:
       – performed once before queries involving joins
       – 1. reconcile R and S on joining attributes R.A, S.B using
         your favourite record linkage technique
                                         R,S →    R’,S’

       – 2. perform regular equijoin on the transformed tables:

                                              R   S
       ➡ ok for tables that can be analysed ahead of the join
       ➡ ok when multiple queries issued on integrated tables


             • Online linkage:
                  – performed while answering a query
                  – exact join similarity (or approximate)              join
                                                                               4
P. Missier, SEBD, June 2009, Camogli, Italy
Record linkage and similarity joins
     Historical timeline:




     from:
     N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and
     algorithms. Tutorial in SIGMOD '06.
                                                                                      5
P. Missier, SEBD, June 2009, Camogli, Italy
Record linkage and similarity joins
     Historical timeline:




     from:
     N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and
     algorithms. Tutorial in SIGMOD '06.
                                                                                      5
P. Missier, SEBD, June 2009, Camogli, Italy
Measuring string similarity using q-grams
• q-grams map string s to a set q(s) of substrings of length q:
   Ex.: 3-grams:

     q(“Manchester”) =
      {‘Man’, ‘anc’, ‘nch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}.

     q(“Madchester”) =
      {‘Mad’, ‘adc’, ‘dch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}.
                        |q(s1 ) ∩ q(s2 )|
        sim(s1 , s2 ) =                               (Jaccard coefficient)
                        |q(s1 ) ∪ q(s2 )|


                    sim(r1 , r2 ) < θ1 → not match
        θ2 < sim(r1 , r2 ) → match


P. Missier, SEBD, June 2009, Camogli, Italy
Similarity symmetric hash join
                 R                            S

            insert                                insert
  R hash table
                           probe                  S hash table
                                                                 set intersection computed
        m       x                                                using a variation of the
                           probe
        n        y
                                              y      r           symmetric hash join
                                              x      s
                                                                 new primitive join
                                                                 operator [CGK06]
                         [R.m,S.s]
                         [R.n, S.r]


    • Main sources of complexity:
         – overhead for storing and indexing q-grams
         – cost of computing set intersection

                                      Efficient relational representation:
                               [CGK06] S. Chaudhuri, V. Ganti and R. Kaushik,
                     “A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬   7
P. Missier, SEBD, June 2009, Camogli, Italy
Is full similarity join always necessary?
• Pessimistic: always pays full complexity cost
• Typical mismatch rate in real datasets around 5%

   Research Goal: explore optimistic approach
 • detect mismatches and react as you go
 • requires estimates of incremental join result size
 • statistical + reactive
           • expect to sacrifice join result completeness for
               faster execution

     Approach:
     - combine exact and approximate join operators
     using Adaptive Query Processing techniques

                                                                        8
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework




    [KC03] J. O. Kephart and D. M. Chess. The vision of
    autonomic computing. IEEE Computer, 36(1):41–50, 2003.
                                                                      9
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework


                                                    monitor




                                         respond              assess




                                                                           9
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework

                                                                monitor incremental
                                                                  join result size
                                                    monitor




                                         respond              assess




                                                                                      9
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework

                                                                monitor incremental
                                                                  join result size
                                                    monitor


                                                                               estimate
                                                                           join result size

                                         respond              assess

                                                                            compute
                                                                           divergence
                                                                            from exp




                                                                                              9
P. Missier, SEBD, June 2009, Camogli, Italy
Autonomic computing framework

                                                                monitor incremental
                                                                  join result size
                                                    monitor


                                                                               estimate
                                                                           join result size

                                         respond              assess

                                                                            compute
               swap                                                        divergence
               join                                                         from exp
             operators



   • when using exact join:
     • if observed/estimated sizes diverge “too much”, then switch to
     approximate join
   • when using approximate join:
     • if observed/estimated converge, then switch to exact join
                                                                                              9
P. Missier, SEBD, June 2009, Camogli, Italy
Technical approach and challenges


  • Assess:
        – Need additional assumption for result size estimation
        – Estimating result size at specific points during join execution


  • Respond:
        – When and how can physical join operators switched?
        – Can we avoid loss of work?
           • operator replacement in pipelined query plans [EFP06]




  [EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
  replacement of pipelined physical join operators in adaptive query processing. In EDBT
  Workshops 2006, LNCS 4254
                                                                                     10
P. Missier, SEBD, June 2009, Camogli, Italy
Assessment
• Expectation of matching records
• Used to derive simple result size estimation model
                      R (parent)              S (child)
                                                                        symmetric hash
                  c       x                                     n       join, again
                                                y         b
                  d       y
                                                x         a



• When there are no mismatches:
  after scanning n < |S| tuples on S:
    P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R|

    Thus, join result size On after n tuples is a binomial random
    variable:                           n
                                     On ∼ bin(n,                    )
                                                              |R|
                                                                                    11
P. Missier, SEBD, June 2009, Camogli, Italy
Detecting divergent observed result size
             ¯
Observation On is outlier wrt expected result size On
=> divergence             ¯
                Pn,p(n) (On ≤ O) ≤ θout
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)




                                                          12
Detecting divergent observed result size
             ¯
Observation On is outlier wrt expected result size On
=> divergence             ¯
                Pn,p(n) (On ≤ O) ≤ θout
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)




     outlier detection on experimental datasets with various
                        mismatch patterns



                                                               12
Detecting divergent observed result size
             ¯
Observation On is outlier wrt expected result size On
=> divergence             ¯
                Pn,p(n) (On ≤ O) ≤ θout
where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)




     outlier detection on experimental datasets with various
                        mismatch patterns

        Note: when the data does not follow our referential
         constraint hypothesis, the model leads to a pure
                          similarity join                      12
Condition for operator replacement
• Goal of AQP:
   – switch operators without loss of intermediate work
• sufficient condition: switch when the operator reaches
  a quiescent state [EPF06]




[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the
replacement of pipelined physical join operators in adaptive query processing. In EDBT
                                                                                   13
Workshops 2006, LNCS 4254
Adaptivity: responder state machine
         Each state corresponds to a combination of physical
                           join operators


              left: exact                                         left: approximate
             right: exact                                        right: approximate




          left: exact                                            left: approximate
     right: approximate                                              right: exact



                                                                                      14
P. Missier, SEBD, June 2009, Camogli, Italy
Rationale for state transitions


                                                      lex /
                                                      rex


  evidence that                               lex /           lap /   evidence that left
left and /or right                            rap             rex     and /or right input
 input perturbed                                                          no longer
                                                                          perturbed
                                                      lap /
                                                      rap



     this is formalized in terms of predicates on the
     observable variables
     - outlier detection and more (omitted)

P. Missier, SEBD, June 2009, Camogli, Italy
Experimental evaluation
• Marginal gain of hybrid algorithm:
     – level of completeness
                                       • R: result size for approx join only
                                       • r: result size for exact only
                                       • rabs: result size actually observed
                                                 grel = (rabs – r) / (R – r)‫‏‬




                                                                                16
P. Missier, SEBD, June 2009, Camogli, Italy
Experimental evaluation
• Marginal gain of hybrid algorithm:
     – level of completeness
                                       • R: result size for approx join only
                                       • r: result size for exact only
                                       • rabs: result size actually observed
                                                 grel = (rabs – r) / (R – r)‫‏‬
  • Marginal Cost:
       – baseline: exact join throughout
          • model marginal cost of hybrid algorithm
                                       unit cost of executing one step in one state
                                              – (experimental)
                                       • number of steps in each state
                                       • unit state transition cost (experimental)
                                       • number of state transitions over entire join
                                                                               16
P. Missier, SEBD, June 2009, Camogli, Italy
Test datasets
    Datasets chosen as representative of 4 distinct patterns




       we expect our results to vary:
   • uniform perturbation: evidence grows slowly => slow reaction
   • bursty perturbation: strong evidence => timely reaction
P. Missier, SEBD, June 2009, Camogli, Italy
Results




P. Missier, SEBD, June 2009, Camogli, Italy
Results




P. Missier, SEBD, June 2009, Camogli, Italy
Conclusions
• Optimistic approach to online record linkage
     – Based on implicit referential integrity assumption
     – When assumption not true, goes back to pessimistic


• Technical approach based on autonomic computing
     – Adaptive query processing
     – Mix of exact / approximate physical join operators


• Applications: on-the-fly integration scenarios (mashups,
  personal dataspaces, sensor data streams)

• Need to relax the referential integrity assumption -- use
  other sources for result size estimation


P. Missier, SEBD, June 2009, Camogli, Italy

Mais conteúdo relacionado

Mais de Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

Mais de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Paper presentation @ SEBD '09

  • 1. Time-completeness trade-offs in record linkage using Adaptive query Processing Paolo Missier, Alvaro A. A. Fernandes School of Computer Science, University of Manchester Roald Lengu, Giovanna Guerrini DISI, Universita' di Genova, Italy Marco Mesiti DiCo, Universita' di Milano, Italy SEBD 2009, Camogli, Italy June, 2009
  • 2. Conclusions • Optimistic approach to online record linkage – Based on implicit referential integrity assumption – When assumption not true, goes back to pessimistic • Technical approach based on autonomic computing – Adaptive query processing – Mix of exact / approximate physical join operators • Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams) • Need to relax the referential integrity assumption -- use other sources for result size estimation P. Missier, SEBD, June 2009, Camogli, Italy
  • 3. Context: On-the-fly data integration • “Situational Applications” • Data mashups and personal dataspaces • slight mismatches in records lead to incomplete integration – due to different encodings, conventions – due to errors in data R A=B S C D A B A Madchester Y X Manchester A case of record linkage Manchester Z Y X Manchester Manchester Z 3 P. Missier, SEBD, June 2009, Camogli, Italy
  • 4. Offline vs online linkage • Offline record linkage: – performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R,S → R’,S’ – 2. perform regular equijoin on the transformed tables: R S ➡ ok for tables that can be analysed ahead of the join ➡ ok when multiple queries issued on integrated tables 4 P. Missier, SEBD, June 2009, Camogli, Italy
  • 5. Offline vs online linkage • Offline record linkage: – performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using your favourite record linkage technique R,S → R’,S’ – 2. perform regular equijoin on the transformed tables: R S ➡ ok for tables that can be analysed ahead of the join ➡ ok when multiple queries issued on integrated tables • Online linkage: – performed while answering a query – exact join similarity (or approximate) join 4 P. Missier, SEBD, June 2009, Camogli, Italy
  • 6. Record linkage and similarity joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. 5 P. Missier, SEBD, June 2009, Camogli, Italy
  • 7. Record linkage and similarity joins Historical timeline: from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06. 5 P. Missier, SEBD, June 2009, Camogli, Italy
  • 8. Measuring string similarity using q-grams • q-grams map string s to a set q(s) of substrings of length q: Ex.: 3-grams: q(“Manchester”) = {‘Man’, ‘anc’, ‘nch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}. q(“Madchester”) = {‘Mad’, ‘adc’, ‘dch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}. |q(s1 ) ∩ q(s2 )| sim(s1 , s2 ) = (Jaccard coefficient) |q(s1 ) ∪ q(s2 )| sim(r1 , r2 ) < θ1 → not match θ2 < sim(r1 , r2 ) → match P. Missier, SEBD, June 2009, Camogli, Italy
  • 9. Similarity symmetric hash join R S insert insert R hash table probe S hash table set intersection computed m x using a variation of the probe n y y r symmetric hash join x s new primitive join operator [CGK06] [R.m,S.s] [R.n, S.r] • Main sources of complexity: – overhead for storing and indexing q-grams – cost of computing set intersection Efficient relational representation: [CGK06] S. Chaudhuri, V. Ganti and R. Kaushik, “A primitive operator for similarity joins in data cleaning” (ICDE’06)‫‏‬ 7 P. Missier, SEBD, June 2009, Camogli, Italy
  • 10. Is full similarity join always necessary? • Pessimistic: always pays full complexity cost • Typical mismatch rate in real datasets around 5% Research Goal: explore optimistic approach • detect mismatches and react as you go • requires estimates of incremental join result size • statistical + reactive • expect to sacrifice join result completeness for faster execution Approach: - combine exact and approximate join operators using Adaptive Query Processing techniques 8 P. Missier, SEBD, June 2009, Camogli, Italy
  • 11. Autonomic computing framework [KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003. 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 12. Autonomic computing framework monitor respond assess 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 13. Autonomic computing framework monitor incremental join result size monitor respond assess 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 14. Autonomic computing framework monitor incremental join result size monitor estimate join result size respond assess compute divergence from exp 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 15. Autonomic computing framework monitor incremental join result size monitor estimate join result size respond assess compute swap divergence join from exp operators • when using exact join: • if observed/estimated sizes diverge “too much”, then switch to approximate join • when using approximate join: • if observed/estimated converge, then switch to exact join 9 P. Missier, SEBD, June 2009, Camogli, Italy
  • 16. Technical approach and challenges • Assess: – Need additional assumption for result size estimation – Estimating result size at specific points during join execution • Respond: – When and how can physical join operators switched? – Can we avoid loss of work? • operator replacement in pipelined query plans [EFP06] [EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254 10 P. Missier, SEBD, June 2009, Camogli, Italy
  • 17. Assessment • Expectation of matching records • Used to derive simple result size estimation model R (parent) S (child) symmetric hash c x n join, again y b d y x a • When there are no mismatches: after scanning n < |S| tuples on S: P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R| Thus, join result size On after n tuples is a binomial random variable: n On ∼ bin(n, ) |R| 11 P. Missier, SEBD, June 2009, Camogli, Italy
  • 18. Detecting divergent observed result size ¯ Observation On is outlier wrt expected result size On => divergence ¯ Pn,p(n) (On ≤ O) ≤ θout where Pn,p(n) (.) is the binomial cdf with parameters n, p(n) 12
  • 19. Detecting divergent observed result size ¯ Observation On is outlier wrt expected result size On => divergence ¯ Pn,p(n) (On ≤ O) ≤ θout where Pn,p(n) (.) is the binomial cdf with parameters n, p(n) outlier detection on experimental datasets with various mismatch patterns 12
  • 20. Detecting divergent observed result size ¯ Observation On is outlier wrt expected result size On => divergence ¯ Pn,p(n) (On ≤ O) ≤ θout where Pn,p(n) (.) is the binomial cdf with parameters n, p(n) outlier detection on experimental datasets with various mismatch patterns Note: when the data does not follow our referential constraint hypothesis, the model leads to a pure similarity join 12
  • 21. Condition for operator replacement • Goal of AQP: – switch operators without loss of intermediate work • sufficient condition: switch when the operator reaches a quiescent state [EPF06] [EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT 13 Workshops 2006, LNCS 4254
  • 22. Adaptivity: responder state machine Each state corresponds to a combination of physical join operators left: exact left: approximate right: exact right: approximate left: exact left: approximate right: approximate right: exact 14 P. Missier, SEBD, June 2009, Camogli, Italy
  • 23. Rationale for state transitions lex / rex evidence that lex / lap / evidence that left left and /or right rap rex and /or right input input perturbed no longer perturbed lap / rap this is formalized in terms of predicates on the observable variables - outlier detection and more (omitted) P. Missier, SEBD, June 2009, Camogli, Italy
  • 24. Experimental evaluation • Marginal gain of hybrid algorithm: – level of completeness • R: result size for approx join only • r: result size for exact only • rabs: result size actually observed grel = (rabs – r) / (R – r)‫‏‬ 16 P. Missier, SEBD, June 2009, Camogli, Italy
  • 25. Experimental evaluation • Marginal gain of hybrid algorithm: – level of completeness • R: result size for approx join only • r: result size for exact only • rabs: result size actually observed grel = (rabs – r) / (R – r)‫‏‬ • Marginal Cost: – baseline: exact join throughout • model marginal cost of hybrid algorithm unit cost of executing one step in one state – (experimental) • number of steps in each state • unit state transition cost (experimental) • number of state transitions over entire join 16 P. Missier, SEBD, June 2009, Camogli, Italy
  • 26. Test datasets Datasets chosen as representative of 4 distinct patterns we expect our results to vary: • uniform perturbation: evidence grows slowly => slow reaction • bursty perturbation: strong evidence => timely reaction P. Missier, SEBD, June 2009, Camogli, Italy
  • 27. Results P. Missier, SEBD, June 2009, Camogli, Italy
  • 28. Results P. Missier, SEBD, June 2009, Camogli, Italy
  • 29. Conclusions • Optimistic approach to online record linkage – Based on implicit referential integrity assumption – When assumption not true, goes back to pessimistic • Technical approach based on autonomic computing – Adaptive query processing – Mix of exact / approximate physical join operators • Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams) • Need to relax the referential integrity assumption -- use other sources for result size estimation P. Missier, SEBD, June 2009, Camogli, Italy