SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011
What we get




        Andr´ Santos andrefs@cpan.org
            e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
What this is really about




                    similarity



         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)
      Proper names (e.g. “Sherlock Holmes”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Measuring similarity




                                           |ALIEs ∩ BLIEs |
       similarity (A, B) =
                                           |ALIEs ∪ BLIEs |




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Measuring similarity




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
pairbooks

  Similarity values
     < 0.2 Documents                are    not related
     > 0.4 Documents                are    candidate pairs
     > 0.9 Documents                are    near duplicates
         1.0 Documents              are    duplicates

  Languages
  High similarity, same language: (Near) duplicates
  High similarity, different language: Candidate pairs

           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Behold, pairbooks!

   ~ $ pairbooks                PT_list.txt ES_list.txt
  PTBR__Umberto_EcoO_nome_da_rosa.txt
    (0.227) [6954,7382]   ES__Umberto_EcoEl_Nombre_de_la_Rosa(...)
    (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.018) [6954,5604]   ES__Umberto_EcoDiario_Minimo__2.txt(...)

  PTBR__Umberto_EcoO_Pendulo_de_Focault.txt
    (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...)
    (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt
  (...)




            Andr´ Santos andrefs@cpan.org
                e                           Identifying similar text documents
Perfect LIEs do not exist
  Year references
                         Can be confused with page numbers
                         Headers/footers can contain them
                         (publishing year, copyright, . . . )
  Proper names
                         Sometimes are translated (e.g. “S˜o
                                                          a
                         Tom´” “Judas Tom´” etc)
                              e,           e,
                         Some languages use different scripts
                         (e.g. Russian)
                         Some languages have declensions
        ...
              Andr´ Santos andrefs@cpan.org
                  e                           Identifying similar text documents
How to improve LIEs (future work)



     accept a list of equivalent words
     accept a list of stop words
     ...




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Give me one of those!


  CPAN
   http://search.cpan.org/perldoc?pairbooks

     Developer version
     requires Linux, Perl
     Incomplete documentation




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011

Mais conteúdo relacionado

Destaque

Steps to change the formatting of the text
Steps to change the formatting of the textSteps to change the formatting of the text
Steps to change the formatting of the textJeremy Dawes
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pintoandrefsantos
 
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce KullanımıMobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımıekinozcicekciler
 
Professional Certifications
Professional CertificationsProfessional Certifications
Professional Certificationseeakin79
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures Ensrrm7
 
Pps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morePps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morefilipj2000
 

Destaque (11)

Steps to change the formatting of the text
Steps to change the formatting of the textSteps to change the formatting of the text
Steps to change the formatting of the text
 
How we can help accountants clients save money
How we can help accountants clients save moneyHow we can help accountants clients save money
How we can help accountants clients save money
 
Final mh 101 for owls 2015(1)
Final mh 101 for owls 2015(1)Final mh 101 for owls 2015(1)
Final mh 101 for owls 2015(1)
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce KullanımıMobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
 
Professional Certifications
Professional CertificationsProfessional Certifications
Professional Certifications
 
Our Own Success Summit
Our Own Success SummitOur Own Success Summit
Our Own Success Summit
 
Kms 6 7 Newfeatures En
Kms 6 7 Newfeatures EnKms 6 7 Newfeatures En
Kms 6 7 Newfeatures En
 
La Excepción
La ExcepciónLa Excepción
La Excepción
 
Pps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and morePps delz@-budapest - i - left bank-the historic part and more
Pps delz@-budapest - i - left bank-the historic part and more
 
Alf Lizzio 2015
Alf Lizzio 2015Alf Lizzio 2015
Alf Lizzio 2015
 

Semelhante a Identifying similar text documents

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsIrum Malik
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerandrefsantos
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics introAlex Curtis
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 

Semelhante a Identifying similar text documents (7)

Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 

Mais de andrefsantos

Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesandrefsantos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...andrefsantos
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatosandrefsantos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesandrefsantos
 

Mais de andrefsantos (8)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Slides
SlidesSlides
Slides
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challenges
 
Bigorna
BigornaBigorna
Bigorna
 

Último

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Identifying similar text documents

  • 1. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011
  • 2. What we get Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 3. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 4. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 5. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 6. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 7. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 8. What this is really about similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 9. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 10. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 11. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Proper names (e.g. “Sherlock Holmes”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 12. Measuring similarity |ALIEs ∩ BLIEs | similarity (A, B) = |ALIEs ∪ BLIEs | Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 13. Measuring similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 14. pairbooks Similarity values < 0.2 Documents are not related > 0.4 Documents are candidate pairs > 0.9 Documents are near duplicates 1.0 Documents are duplicates Languages High similarity, same language: (Near) duplicates High similarity, different language: Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 15. Behold, pairbooks! ~ $ pairbooks PT_list.txt ES_list.txt PTBR__Umberto_EcoO_nome_da_rosa.txt (0.227) [6954,7382] ES__Umberto_EcoEl_Nombre_de_la_Rosa(...) (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.018) [6954,5604] ES__Umberto_EcoDiario_Minimo__2.txt(...) PTBR__Umberto_EcoO_Pendulo_de_Focault.txt (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...) (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt (...) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 16. Perfect LIEs do not exist Year references Can be confused with page numbers Headers/footers can contain them (publishing year, copyright, . . . ) Proper names Sometimes are translated (e.g. “S˜o a Tom´” “Judas Tom´” etc) e, e, Some languages use different scripts (e.g. Russian) Some languages have declensions ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 17. How to improve LIEs (future work) accept a list of equivalent words accept a list of stop words ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 18. Give me one of those! CPAN http://search.cpan.org/perldoc?pairbooks Developer version requires Linux, Perl Incomplete documentation Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 19. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011