SlideShare uma empresa Scribd logo
1 de 47
Baixar para ler offline
Duplicate Detection
            On the Web
          Finding Similar Looking Needles...
                   ...In Very Large Haystacks


    Sergei Vassilvitskii
Yahoo! Research, New York


   November 15, 2007
Duplicate Detection

        • Why detect duplicates?
        • Conserve resources
                 – reduced index size - less memory, faster computations, etc.

        • User Experience
                 – Diversity in Search Results
                 – 25-40% of the web is duplicate
                     •   Identical reviews with different boilerplate
                     •   mirrors - e.g. unix man pages
                     •   SPAM sites
                     •   etc etc




SET 11/15/2007                                                2         Sergei Vassilvitskii
What’s a Duplicate?

        • Easy: exact duplicates




SET 11/15/2007                     3   Sergei Vassilvitskii
Exact Duplicates




SET 11/15/2007       4   Sergei Vassilvitskii
Exact Duplicates




SET 11/15/2007       5   Sergei Vassilvitskii
Exact Duplicates




SET 11/15/2007       6   Sergei Vassilvitskii
What’s a Duplicate?

        • Easy: exact duplicates
        • Still easy: Different dates / signatures




SET 11/15/2007                     7                 Sergei Vassilvitskii
Almost Duplicates




SET 11/15/2007        8   Sergei Vassilvitskii
Almost Duplicates




SET 11/15/2007        9   Sergei Vassilvitskii
Almost Duplicates




SET 11/15/2007        10   Sergei Vassilvitskii
Almost Duplicates




SET 11/15/2007        11   Sergei Vassilvitskii
What’s a Duplicate?

        • Easy: exact duplicates
        • Still easy: Different dates / signatures
        • Harder: Slight edits, modifications




SET 11/15/2007                     12                Sergei Vassilvitskii
is this a duplicate?




SET 11/15/2007           13   Sergei Vassilvitskii
is this a duplicate?




SET 11/15/2007           14   Sergei Vassilvitskii
is this a duplicate?




SET 11/15/2007           15   Sergei Vassilvitskii
What’s a Duplicate?

        • Easy: exact duplicates
        • Still easy: Different dates / signatures
        • Harder: Slight edits, modifications
        • Hardest: Different versions, updates, etc.




SET 11/15/2007                     16                  Sergei Vassilvitskii
Similarity

       • Tokens - trigrams in a document:
             Once upon a midnight dreary, while I pondered weak and weary
             Once upon a
                  upon a midnight
                       a midnight dreary
                         midnight dreary   while

       • Represent a document as a set:
             {Once upon a, upon a midnight, a midnight dreary, ... }

                            |A ∩ B|
       • Similarity (A,B) =
                            |A ∪ B|


SET 11/15/2007                             17                    Sergei Vassilvitskii
Similarity

       • A: “Once     upon a midnight dreary, while I pondered”
             { Once upon a, upon a midnight, a midnight dreary, midnight
             dreary while, dreary while I, while I pondered }

       • B: “Once    upon a time, while I pondered”
             { Once upon a, upon a time, a time while, time while I, while
             I pondered }




SET 11/15/2007                           18                       Sergei Vassilvitskii
Similarity

       • A: “Once     upon a midnight dreary, while I pondered”
             { Once upon a, upon a midnight, a midnight dreary, midnight
             dreary while, dreary while I, while I pondered }

       • B: “Once    upon a time, while I pondered”
             { Once upon a, upon a time, a time while, time while I, while
             I pondered }



       • Overlap: |A ∩ B| = 2




SET 11/15/2007                           19                       Sergei Vassilvitskii
Similarity

       • A: “Once     upon a midnight dreary, while I pondered”
             { Once upon a, upon a midnight, a midnight dreary, midnight
             dreary while, dreary while I, while I pondered }

       • B: “Once    upon a time, while I pondered”
             { Once upon a, upon a time, a time while, time while I, while
             I pondered }



       • Overlap: |A ∩ B| = 2
       • Total: |A ∪ B| = 9




SET 11/15/2007                           20                       Sergei Vassilvitskii
Similarity

       • A: “Once     upon a midnight dreary, while I pondered”
             { Once upon a, upon a midnight, a midnight dreary, midnight
             dreary while, dreary while I, while I pondered }

       • B: “Once    upon a time, while I pondered”
             { Once upon a, upon a time, a time while, time while I, while
             I pondered }



       • Overlap: |A ∩ B| = 2
       • Total: |A ∪ B| = 9
       • Similarity 0.22


SET 11/15/2007                           21                       Sergei Vassilvitskii
Similarity

                            |A ∩ B|
        • Recall Sim(A,B) =
                            |A ∪ B|
                 – Also known as the Jaccard similarity

        • Is this a good similarity measure?
                 – Yes: Simple to describe, easy to compute
                 – No: Ignores repetition of trigrams:
                     •   e.g. “a rose       is a rose” and “a rose is a rose is a rose” have
                         100% similarity.
                     •   Bad at detecting small but semantically important edits




SET 11/15/2007                                             22                      Sergei Vassilvitskii
Outline

        • Motivation
        • Algorithms
        • Evaluation
        • Open Problems




SET 11/15/2007            23   Sergei Vassilvitskii
Algorithms

        • Hashing:
                 – Perfect for detecting exact duplicates. Doesn’t work for near
                   dupes.

        • Edit distance
                 – Perfect for detecting near dupes, does not scale.




SET 11/15/2007                                24                       Sergei Vassilvitskii
Scale

        • The size of the index is claimed to be 4B-16B
          pages


        • Exact estimation is an active research topic
                 – Also, bigger doesn’t always mean better



        • Safe to assume at least 1B pages in index.




SET 11/15/2007                               25              Sergei Vassilvitskii
Algorithms

        • Hashing:
                 – Perfect for detecting exact duplicates. Doesn’t work for near
                   dupes.

        • Edit distance
                 – Perfect for detecting near dupes, does not scale.
                 – Compare every time.

        • Shingling
                 – We will store 48 bytes per page, detect near duplicates
                   without examining all 1B pages.




SET 11/15/2007                                26                       Sergei Vassilvitskii
Shingling Idea

        • Computing Jaccard similarity is still expensive.
        • Idea: summarize each document in a short sketch.
        • Estimate the similarity based on the sketches.


        • Algorithm due to Broder et al. (WWW ’97), used in the
          Altavista search engine and all search engines since.




SET 11/15/2007                        27                     Sergei Vassilvitskii
Algorithm

       • Take a hashing function, H. Hash each shingle:
             { Once upon a, upon a midnight, a midnight dreary, midnight
             dreary while, dreary while I, while I pondered }
             { 357, 192, 755, 123, 987, 345 }

       • Store the minimum hash: {192}
             { Once upon a, upon a time, a time while, time while I, while
             I pondered }
             { 357, 143, 986, 743, 345 }

       • Store the minimum hash: {143}




SET 11/15/2007                             28                    Sergei Vassilvitskii
Algorithm

       • Repeat many times with different hash functions

                      hash-1   hash-2     hash-3   hash-4

             doc 1:   192      155        187       255

             doc 2:    143     179        187       155



       • Similarity - Percentage of times hashes agree


       • SIM(doc 1, doc 2) = 1/4



SET 11/15/2007                       29                     Sergei Vassilvitskii
Why Min-Hash

        • Theorem:
                                                      |A ∩ B|
                    P rob Min-Hash(A) = Min-Hash(B) =
                                                      |A ∪ B|

        • Therefore:
                 – % of time the hashes agree ~ Sim(A,B)




SET 11/15/2007                               30            Sergei Vassilvitskii
Why Min-Hash

       • Proof: Look at the elements:
             { Once upon a, upon a midnight, a midnight dreary, midnight
             dreary while, dreary while I, while I pondered, upon a time, a
             time while, time while I}

       • One of these will have the minimum hash value.
       • The two minima are the same if the corresponding trigram
         appears in both documents.
       • Min-hashes are equal if either of { Once upon    a, while I
         pondered } hashes to the minimum value.




SET 11/15/2007                           31                      Sergei Vassilvitskii
Sketches

        • So a sketch for a document is a collection of
          minimum hashes.
                 – In practice, use 84 hash functions.

        • sketch =          { 192, 155, 187, 255, ..., 101 }



        • Summarized each page in 672 bytes (84 8 byte
          values).




SET 11/15/2007                                32               Sergei Vassilvitskii
Sketches

        • To compare two documents, look at the
          percentage of min-hashes that agree.


        • Problem: Full Pairwise comparison
                 – 109 pages by 109 pages by 102 hashes = 1020 operations.



        • But we have many fast computers!
                 – 109 operations / second * 104 machines would still require
                   107 seconds - roughly 4 months.




SET 11/15/2007                                33                       Sergei Vassilvitskii
Super Shingles

        • Problem: Doing all pairwise comparisons still too expensive.
        • Solution: Since we care about only high similarity items,
          recurse:
        • sketch =   { 192, 155, 187, 255, 345, 171, 877, ... , 101 }

        • Group into non overlapping super-shingles:
                      {192, 155, 187, 255}
                                            {345, 171, 877, ...}

                                                           {..., 101}

        • Hash each super-shingle:   {1011, 6543, ..., 7327}

        • Only compare documents that agree on super-shingles.



SET 11/15/2007                         34                          Sergei Vassilvitskii
Super Shingles

                     S-Shingle 1   S-Shingle 2   S-Shingle 3

             Doc 1     1011         6543           7327

             Doc 2     4523         5498           8754

             Doc 3      5487         5498           8754




SET 11/15/2007                       35                    Sergei Vassilvitskii
Super Shingles

                        S-Shingle 1    S-Shingle 2    S-Shingle 3

             Doc 1         1011          6543            7327

             Doc 2         4523          5498            8754

             Doc 3         5487          5498            8754



       •     Declare Doc2 and Doc3 to be 2-similar.
       •     In practice - store the above table sorted by different
             columns. Only compare against neighboring rows.




SET 11/15/2007                           36                      Sergei Vassilvitskii
Super Shingles

       • Store the super shingle table sorted by columns

                     S-Shingle 1   S-Shingle 2   S-Shingle 3

             Doc 1     1011         6543           7327
             Doc 2     4523         5498           8754
             Doc 3     5487         5498           8754
             Doc 4     8766         1258           6255




                     S-Shingle 1   S-Shingle 2   S-Shingle 3

             Doc 4     8766         1258           6255
             Doc 2     4523         5498           8754
             Doc 3     5487         5498           8754
             Doc 1     1011         6543           7327



       • Only compare adjacent rows instead of all pairs

SET 11/15/2007                                    37           Sergei Vassilvitskii
Summary

        •        Text = Once upon a midnight dreary, while I pondered ...
        •        Trigrams = { Once upon a, upon a midnight, a midnight dreary,
                 midnight dreary while, dreary while I, while I
                 pondered, ... }
        •        Hashes(1) = { 357, 192, 755, 123, 987, 345, ... }
        •        Min Hash(1) = { 192 }
        •        Hashes(2) = { 132, 345, 487, 564, 778, 120, ... }
        •        Min Hash(2) = { 120 }
        •        ..
        •        Min Hash(84) = { 101 }
        •        Sketch = { 192, 120, ..., 101 }
        •        Super Shingles = {1011, 6543, 7327, 5422, 8764, 2344}
        •        Similar if exact overlap on 2 or more super shingles.




SET 11/15/2007                               38                      Sergei Vassilvitskii
Outline

        • Motivation
        • Algorithms
        • Evaluation
        • Open Problems




SET 11/15/2007            39   Sergei Vassilvitskii
Evaluation

        • Due to Henzinger, SIGIR ‘06
        • Start with 1.6B webpages - 46M hosts, on average 35
          pages per host.
        • Remove exact duplicates ( around 25% )


        • Impossible to check recall of the algorithms ( Why? )
        • To check precision: sample roughly 2000 pairs returned as
          duplicates and evaluate




SET 11/15/2007                       40                     Sergei Vassilvitskii
Evaluation

        • Pairs are decided to be near duplicate if:
                 – Text differs by timestamp, visitor count, etc.
                 – Difference is invisible to the visitor
                 – Entry pages to the same site
        • Not near duplicate if:
                 – Main items are different (e.g. shopping page for two
                   different items, but with identical boilerplate text)




SET 11/15/2007                                41                    Sergei Vassilvitskii
Evaluation

        • Undecided:
                 – Prefilled forms with different values
                 – A different ‘minor’ item - e.g. small text box
                 – Pairs that could not be evaluated (e.g. english speaking
                   evaluator looking at two Korean pages)




SET 11/15/2007                              42                      Sergei Vassilvitskii
Results


          Precision on the ‘duplicate’ set.

                            Pairs     Correct   Incorrect   Undecided

                  All       1910        0.38      0.53        0.09

                 2-sim      1032        0.24      0.68        0.08

                 3-sim       389        0.42      0.48         0.1

                 4-sim      240         0.55      0.36        0.09

                 5-sim       143        0.71      0.25        0.06

                 6-sim       106        0.85      0.05         0.1




SET 11/15/2007                          43                      Sergei Vassilvitskii
Outline

        • Motivation
        • Algorithms
        • Evaluation
        • Open Problems




SET 11/15/2007            44   Sergei Vassilvitskii
Open Problems

        • The devil is always in the details:
                 – Obtaining text from raw HTML is not as easy as it sounds
                     •   What to do with IMG ALT text? Targets of links? etc.

                 – Large boilerplate text with few seemingly minor differences

        • New Challenges
                 – Dynamic content?
                 – Flash, Ajax, and other not easily indexable content




SET 11/15/2007                                              45                  Sergei Vassilvitskii
References

        • Broder, Glassman, Manasse, Zweif. Syntactic clustering of
          the web. WWW ’97.
        • Fetterly, Manasse, Majork. Detecting phrase level
          duplication on the World Wide Web. SIGIR ’05.

        • Henzinger. Finding near duplicate Web pages: a large scale
          evaluation of algorithms. SIGIR ’06.




SET 11/15/2007                       46                       Sergei Vassilvitskii
Thank You!

Mais conteúdo relacionado

Destaque

Generation of cda xml schema from dicom images using hl7 standard 2
Generation of cda xml schema from dicom images using hl7 standard 2Generation of cda xml schema from dicom images using hl7 standard 2
Generation of cda xml schema from dicom images using hl7 standard 2
IAEME Publication
 
HL7 & Health Information Exchange in Thailand
HL7 & Health Information Exchange in ThailandHL7 & Health Information Exchange in Thailand
HL7 & Health Information Exchange in Thailand
Nawanan Theera-Ampornpunt
 

Destaque (16)

Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
A profit maximization scheme with guaranteed
A profit maximization scheme with guaranteedA profit maximization scheme with guaranteed
A profit maximization scheme with guaranteed
 
Deduplication
DeduplicationDeduplication
Deduplication
 
Desgn&imp authentctn.ppt by Jaseela
Desgn&imp authentctn.ppt by JaseelaDesgn&imp authentctn.ppt by Jaseela
Desgn&imp authentctn.ppt by Jaseela
 
Generation of cda xml schema from dicom images using hl7 standard 2
Generation of cda xml schema from dicom images using hl7 standard 2Generation of cda xml schema from dicom images using hl7 standard 2
Generation of cda xml schema from dicom images using hl7 standard 2
 
Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Healthcare Exchange Interoperability
Healthcare Exchange InteroperabilityHealthcare Exchange Interoperability
Healthcare Exchange Interoperability
 
Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...Conditional identity based broadcast proxy re-encryption and its application ...
Conditional identity based broadcast proxy re-encryption and its application ...
 
SecRBAC: Secure data in the Clouds
SecRBAC: Secure data in the CloudsSecRBAC: Secure data in the Clouds
SecRBAC: Secure data in the Clouds
 
Understanding clinical data exchange and cda (hl7 201)
Understanding clinical data exchange and cda (hl7 201)Understanding clinical data exchange and cda (hl7 201)
Understanding clinical data exchange and cda (hl7 201)
 
HL7 & Health Information Exchange in Thailand
HL7 & Health Information Exchange in ThailandHL7 & Health Information Exchange in Thailand
HL7 & Health Information Exchange in Thailand
 
Secure Cloud - Secure Big Data Processing in Untrusted Clouds
Secure Cloud - Secure Big Data Processing in Untrusted CloudsSecure Cloud - Secure Big Data Processing in Untrusted Clouds
Secure Cloud - Secure Big Data Processing in Untrusted Clouds
 
SecureCloud Project
SecureCloud ProjectSecureCloud Project
SecureCloud Project
 
Datamining
DataminingDatamining
Datamining
 
HL7 Clinical Document Architecture: Overview and Applications
HL7 Clinical Document Architecture: Overview and ApplicationsHL7 Clinical Document Architecture: Overview and Applications
HL7 Clinical Document Architecture: Overview and Applications
 
Fundamentals of HL7 CDA & Implementing HL7 CDA
Fundamentals of HL7 CDA & Implementing HL7 CDAFundamentals of HL7 CDA & Implementing HL7 CDA
Fundamentals of HL7 CDA & Implementing HL7 CDA
 

Mais de jonecx

tdt4260
tdt4260tdt4260
tdt4260
jonecx
 
Tdt4242
Tdt4242Tdt4242
Tdt4242
jonecx
 
SDL 2000 Tutorial
SDL 2000 TutorialSDL 2000 Tutorial
SDL 2000 Tutorial
jonecx
 
Owl syntax
Owl syntaxOwl syntax
Owl syntax
jonecx
 
Process Algebra
Process AlgebraProcess Algebra
Process Algebra
jonecx
 
Hickman threat modeling
Hickman threat modelingHickman threat modeling
Hickman threat modeling
jonecx
 
NTNU EiT evaluation guideline
NTNU EiT evaluation guidelineNTNU EiT evaluation guideline
NTNU EiT evaluation guideline
jonecx
 
K-means clustering exercise based on eucalidean distance
K-means clustering exercise based on eucalidean distanceK-means clustering exercise based on eucalidean distance
K-means clustering exercise based on eucalidean distance
jonecx
 
BPMN by Example
BPMN by ExampleBPMN by Example
BPMN by Example
jonecx
 

Mais de jonecx (11)

Latex tutorial
Latex tutorialLatex tutorial
Latex tutorial
 
BPMN
BPMNBPMN
BPMN
 
tdt4260
tdt4260tdt4260
tdt4260
 
Tdt4242
Tdt4242Tdt4242
Tdt4242
 
SDL 2000 Tutorial
SDL 2000 TutorialSDL 2000 Tutorial
SDL 2000 Tutorial
 
Owl syntax
Owl syntaxOwl syntax
Owl syntax
 
Process Algebra
Process AlgebraProcess Algebra
Process Algebra
 
Hickman threat modeling
Hickman threat modelingHickman threat modeling
Hickman threat modeling
 
NTNU EiT evaluation guideline
NTNU EiT evaluation guidelineNTNU EiT evaluation guideline
NTNU EiT evaluation guideline
 
K-means clustering exercise based on eucalidean distance
K-means clustering exercise based on eucalidean distanceK-means clustering exercise based on eucalidean distance
K-means clustering exercise based on eucalidean distance
 
BPMN by Example
BPMN by ExampleBPMN by Example
BPMN by Example
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Duplicate detection

  • 1. Duplicate Detection On the Web Finding Similar Looking Needles... ...In Very Large Haystacks Sergei Vassilvitskii Yahoo! Research, New York November 15, 2007
  • 2. Duplicate Detection • Why detect duplicates? • Conserve resources – reduced index size - less memory, faster computations, etc. • User Experience – Diversity in Search Results – 25-40% of the web is duplicate • Identical reviews with different boilerplate • mirrors - e.g. unix man pages • SPAM sites • etc etc SET 11/15/2007 2 Sergei Vassilvitskii
  • 3. What’s a Duplicate? • Easy: exact duplicates SET 11/15/2007 3 Sergei Vassilvitskii
  • 4. Exact Duplicates SET 11/15/2007 4 Sergei Vassilvitskii
  • 5. Exact Duplicates SET 11/15/2007 5 Sergei Vassilvitskii
  • 6. Exact Duplicates SET 11/15/2007 6 Sergei Vassilvitskii
  • 7. What’s a Duplicate? • Easy: exact duplicates • Still easy: Different dates / signatures SET 11/15/2007 7 Sergei Vassilvitskii
  • 8. Almost Duplicates SET 11/15/2007 8 Sergei Vassilvitskii
  • 9. Almost Duplicates SET 11/15/2007 9 Sergei Vassilvitskii
  • 10. Almost Duplicates SET 11/15/2007 10 Sergei Vassilvitskii
  • 11. Almost Duplicates SET 11/15/2007 11 Sergei Vassilvitskii
  • 12. What’s a Duplicate? • Easy: exact duplicates • Still easy: Different dates / signatures • Harder: Slight edits, modifications SET 11/15/2007 12 Sergei Vassilvitskii
  • 13. is this a duplicate? SET 11/15/2007 13 Sergei Vassilvitskii
  • 14. is this a duplicate? SET 11/15/2007 14 Sergei Vassilvitskii
  • 15. is this a duplicate? SET 11/15/2007 15 Sergei Vassilvitskii
  • 16. What’s a Duplicate? • Easy: exact duplicates • Still easy: Different dates / signatures • Harder: Slight edits, modifications • Hardest: Different versions, updates, etc. SET 11/15/2007 16 Sergei Vassilvitskii
  • 17. Similarity • Tokens - trigrams in a document: Once upon a midnight dreary, while I pondered weak and weary Once upon a upon a midnight a midnight dreary midnight dreary while • Represent a document as a set: {Once upon a, upon a midnight, a midnight dreary, ... } |A ∩ B| • Similarity (A,B) = |A ∪ B| SET 11/15/2007 17 Sergei Vassilvitskii
  • 18. Similarity • A: “Once upon a midnight dreary, while I pondered” { Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered } • B: “Once upon a time, while I pondered” { Once upon a, upon a time, a time while, time while I, while I pondered } SET 11/15/2007 18 Sergei Vassilvitskii
  • 19. Similarity • A: “Once upon a midnight dreary, while I pondered” { Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered } • B: “Once upon a time, while I pondered” { Once upon a, upon a time, a time while, time while I, while I pondered } • Overlap: |A ∩ B| = 2 SET 11/15/2007 19 Sergei Vassilvitskii
  • 20. Similarity • A: “Once upon a midnight dreary, while I pondered” { Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered } • B: “Once upon a time, while I pondered” { Once upon a, upon a time, a time while, time while I, while I pondered } • Overlap: |A ∩ B| = 2 • Total: |A ∪ B| = 9 SET 11/15/2007 20 Sergei Vassilvitskii
  • 21. Similarity • A: “Once upon a midnight dreary, while I pondered” { Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered } • B: “Once upon a time, while I pondered” { Once upon a, upon a time, a time while, time while I, while I pondered } • Overlap: |A ∩ B| = 2 • Total: |A ∪ B| = 9 • Similarity 0.22 SET 11/15/2007 21 Sergei Vassilvitskii
  • 22. Similarity |A ∩ B| • Recall Sim(A,B) = |A ∪ B| – Also known as the Jaccard similarity • Is this a good similarity measure? – Yes: Simple to describe, easy to compute – No: Ignores repetition of trigrams: • e.g. “a rose is a rose” and “a rose is a rose is a rose” have 100% similarity. • Bad at detecting small but semantically important edits SET 11/15/2007 22 Sergei Vassilvitskii
  • 23. Outline • Motivation • Algorithms • Evaluation • Open Problems SET 11/15/2007 23 Sergei Vassilvitskii
  • 24. Algorithms • Hashing: – Perfect for detecting exact duplicates. Doesn’t work for near dupes. • Edit distance – Perfect for detecting near dupes, does not scale. SET 11/15/2007 24 Sergei Vassilvitskii
  • 25. Scale • The size of the index is claimed to be 4B-16B pages • Exact estimation is an active research topic – Also, bigger doesn’t always mean better • Safe to assume at least 1B pages in index. SET 11/15/2007 25 Sergei Vassilvitskii
  • 26. Algorithms • Hashing: – Perfect for detecting exact duplicates. Doesn’t work for near dupes. • Edit distance – Perfect for detecting near dupes, does not scale. – Compare every time. • Shingling – We will store 48 bytes per page, detect near duplicates without examining all 1B pages. SET 11/15/2007 26 Sergei Vassilvitskii
  • 27. Shingling Idea • Computing Jaccard similarity is still expensive. • Idea: summarize each document in a short sketch. • Estimate the similarity based on the sketches. • Algorithm due to Broder et al. (WWW ’97), used in the Altavista search engine and all search engines since. SET 11/15/2007 27 Sergei Vassilvitskii
  • 28. Algorithm • Take a hashing function, H. Hash each shingle: { Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered } { 357, 192, 755, 123, 987, 345 } • Store the minimum hash: {192} { Once upon a, upon a time, a time while, time while I, while I pondered } { 357, 143, 986, 743, 345 } • Store the minimum hash: {143} SET 11/15/2007 28 Sergei Vassilvitskii
  • 29. Algorithm • Repeat many times with different hash functions hash-1 hash-2 hash-3 hash-4 doc 1: 192 155 187 255 doc 2: 143 179 187 155 • Similarity - Percentage of times hashes agree • SIM(doc 1, doc 2) = 1/4 SET 11/15/2007 29 Sergei Vassilvitskii
  • 30. Why Min-Hash • Theorem: |A ∩ B| P rob Min-Hash(A) = Min-Hash(B) = |A ∪ B| • Therefore: – % of time the hashes agree ~ Sim(A,B) SET 11/15/2007 30 Sergei Vassilvitskii
  • 31. Why Min-Hash • Proof: Look at the elements: { Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered, upon a time, a time while, time while I} • One of these will have the minimum hash value. • The two minima are the same if the corresponding trigram appears in both documents. • Min-hashes are equal if either of { Once upon a, while I pondered } hashes to the minimum value. SET 11/15/2007 31 Sergei Vassilvitskii
  • 32. Sketches • So a sketch for a document is a collection of minimum hashes. – In practice, use 84 hash functions. • sketch = { 192, 155, 187, 255, ..., 101 } • Summarized each page in 672 bytes (84 8 byte values). SET 11/15/2007 32 Sergei Vassilvitskii
  • 33. Sketches • To compare two documents, look at the percentage of min-hashes that agree. • Problem: Full Pairwise comparison – 109 pages by 109 pages by 102 hashes = 1020 operations. • But we have many fast computers! – 109 operations / second * 104 machines would still require 107 seconds - roughly 4 months. SET 11/15/2007 33 Sergei Vassilvitskii
  • 34. Super Shingles • Problem: Doing all pairwise comparisons still too expensive. • Solution: Since we care about only high similarity items, recurse: • sketch = { 192, 155, 187, 255, 345, 171, 877, ... , 101 } • Group into non overlapping super-shingles: {192, 155, 187, 255} {345, 171, 877, ...} {..., 101} • Hash each super-shingle: {1011, 6543, ..., 7327} • Only compare documents that agree on super-shingles. SET 11/15/2007 34 Sergei Vassilvitskii
  • 35. Super Shingles S-Shingle 1 S-Shingle 2 S-Shingle 3 Doc 1 1011 6543 7327 Doc 2 4523 5498 8754 Doc 3 5487 5498 8754 SET 11/15/2007 35 Sergei Vassilvitskii
  • 36. Super Shingles S-Shingle 1 S-Shingle 2 S-Shingle 3 Doc 1 1011 6543 7327 Doc 2 4523 5498 8754 Doc 3 5487 5498 8754 • Declare Doc2 and Doc3 to be 2-similar. • In practice - store the above table sorted by different columns. Only compare against neighboring rows. SET 11/15/2007 36 Sergei Vassilvitskii
  • 37. Super Shingles • Store the super shingle table sorted by columns S-Shingle 1 S-Shingle 2 S-Shingle 3 Doc 1 1011 6543 7327 Doc 2 4523 5498 8754 Doc 3 5487 5498 8754 Doc 4 8766 1258 6255 S-Shingle 1 S-Shingle 2 S-Shingle 3 Doc 4 8766 1258 6255 Doc 2 4523 5498 8754 Doc 3 5487 5498 8754 Doc 1 1011 6543 7327 • Only compare adjacent rows instead of all pairs SET 11/15/2007 37 Sergei Vassilvitskii
  • 38. Summary • Text = Once upon a midnight dreary, while I pondered ... • Trigrams = { Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered, ... } • Hashes(1) = { 357, 192, 755, 123, 987, 345, ... } • Min Hash(1) = { 192 } • Hashes(2) = { 132, 345, 487, 564, 778, 120, ... } • Min Hash(2) = { 120 } • .. • Min Hash(84) = { 101 } • Sketch = { 192, 120, ..., 101 } • Super Shingles = {1011, 6543, 7327, 5422, 8764, 2344} • Similar if exact overlap on 2 or more super shingles. SET 11/15/2007 38 Sergei Vassilvitskii
  • 39. Outline • Motivation • Algorithms • Evaluation • Open Problems SET 11/15/2007 39 Sergei Vassilvitskii
  • 40. Evaluation • Due to Henzinger, SIGIR ‘06 • Start with 1.6B webpages - 46M hosts, on average 35 pages per host. • Remove exact duplicates ( around 25% ) • Impossible to check recall of the algorithms ( Why? ) • To check precision: sample roughly 2000 pairs returned as duplicates and evaluate SET 11/15/2007 40 Sergei Vassilvitskii
  • 41. Evaluation • Pairs are decided to be near duplicate if: – Text differs by timestamp, visitor count, etc. – Difference is invisible to the visitor – Entry pages to the same site • Not near duplicate if: – Main items are different (e.g. shopping page for two different items, but with identical boilerplate text) SET 11/15/2007 41 Sergei Vassilvitskii
  • 42. Evaluation • Undecided: – Prefilled forms with different values – A different ‘minor’ item - e.g. small text box – Pairs that could not be evaluated (e.g. english speaking evaluator looking at two Korean pages) SET 11/15/2007 42 Sergei Vassilvitskii
  • 43. Results Precision on the ‘duplicate’ set. Pairs Correct Incorrect Undecided All 1910 0.38 0.53 0.09 2-sim 1032 0.24 0.68 0.08 3-sim 389 0.42 0.48 0.1 4-sim 240 0.55 0.36 0.09 5-sim 143 0.71 0.25 0.06 6-sim 106 0.85 0.05 0.1 SET 11/15/2007 43 Sergei Vassilvitskii
  • 44. Outline • Motivation • Algorithms • Evaluation • Open Problems SET 11/15/2007 44 Sergei Vassilvitskii
  • 45. Open Problems • The devil is always in the details: – Obtaining text from raw HTML is not as easy as it sounds • What to do with IMG ALT text? Targets of links? etc. – Large boilerplate text with few seemingly minor differences • New Challenges – Dynamic content? – Flash, Ajax, and other not easily indexable content SET 11/15/2007 45 Sergei Vassilvitskii
  • 46. References • Broder, Glassman, Manasse, Zweif. Syntactic clustering of the web. WWW ’97. • Fetterly, Manasse, Majork. Detecting phrase level duplication on the World Wide Web. SIGIR ’05. • Henzinger. Finding near duplicate Web pages: a large scale evaluation of algorithms. SIGIR ’06. SET 11/15/2007 46 Sergei Vassilvitskii