SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Contributions for building a
  Corpora-Flow system

         Andr´ Santos
             e
      andrefs@cpan.org


       Informatics Engineering MSc
            University of Minho




          December 2011
Concepts
    Aligned parallel corpus: Set of parallel texts in
             which correspondences have been marked
             between blocks (paragraphs, sentences,
             words, . . . ) from each text.
    Corpora-flow: Adaptation of the concept of
             workflow to the several tasks, decisions
             and sequences of steps involved in the
             process of building a corpus.




1           Andr´ Santos, andrefs@cpan.org
                e                            Contributions for building a Corpora-Flow system
Concepts
    Aligned parallel corpus: Set of parallel texts in
             which correspondences have been marked
             between blocks (paragraphs, sentences,
             words, . . . ) from each text.
    Corpora-flow: Adaptation of the concept of
             workflow to the several tasks, decisions
             and sequences of steps involved in the
             process of building a corpus.

    This presentation and the underlying master thesis
    describe the implementation of several tools to be
    used in typical corpus building activities.
1            Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Context


    The work developed in the context of this master
    thesis was motivated and supported by
    Project Per-fide, an undergoing project in
    University of Minho which aims to build large
    parallel corpora between Portuguese and other six
    languages.




2            Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Corpora building challenges


     file format and format conversion
     finding duplicated files
     text encoding format
     structural residues
     section delimiters
     unpaired sections (parallel corpora)
     ...



3        Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Corpora building challenges


     Severe problems which often lead to bad results
     Many (most?) of them are hard/impossible to
     solve completely
     Find the problem and report it when it is not
     solvable automatically
     Provide intelligent ways of describing what was
     found and done



4        Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
5 key issues


      Book cleaning
      Duplicates and candidate pairs detection
      Book synchronization
      Alignment evaluation
      Corpora-flow system



5        Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Book processing problems – Motivation
    (...) d <92>’ entr´e, donnant acc`s dans la salle commune.
                      e              e
    Une l´g`re v´randa, qui en prot´-
         e e    e                   e
    M
                          <96>- 86 <96>-
     ^L geait la partie ant´rieure contre l <92>’ action
                             e
    des rayons solaires, reposait sur de sveltes bambous. (...)
                                           La Jangada, Jules Verne




6             Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Book processing problems – Motivation
    (...) d <92>’ entr´e, donnant acc`s dans la salle commune.
                      e              e
    Une l´g`re v´randa, qui en prot´-
         e e    e                   e
    M
                           <96>- 86 <96>-
     ^L geait la partie ant´rieure contre l <92>’ action
                             e
    des rayons solaires, reposait sur de sveltes bambous. (...)
                                           La Jangada, Jules Verne


                  <92>’ : right single quot. mark (CP1252)
                  <96>- : en dash (CP1252)
                      ^L : page break (0xC)

        prot´-(...)geait : transpagination
            e


6              Andr´ Santos, andrefs@cpan.org
                   e                            Contributions for building a Corpora-Flow system
Book processing problems – Motivation
    (...) d <92>’ entr´e, donnant acc`s dans la salle commune.
                      e              e
    Une l´g`re v´randa, qui en prot´-
         e e    e                   e
    M
                          <96>- 86 <96>-
     ^L geait la partie ant´rieure contre l <92>’ action
                             e
    des rayons solaires, reposait sur de sveltes bambous. (...)
                                           La Jangada, Jules Verne



    (...) d ’ entr´e, donnant acc`s dans la salle commune.
                   e             e
    Une l´g`re v´randa, qui en prot´geait _pb1_
         e e     e                  e
    la partie ant´rieure contre l ’ action
                 e
    des rayons solaires, reposait sur de sveltes bambous. (...)


6             Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Book cleaning
    Subdivided in several steps:




7            Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Sections ontology
                                                                chap
                                                                PT cap´tulo,
                                                                      ı
    contains common section types                                  cap, capitulo
                                                                FR chapitre, chap
    used to automatically generate                              EN chapter, chap
    the code to recognize section                               NT sec

    delimiters                                                  end
                                                                PT fim
    allows discussion/cooperation                               FR fin
                                                                EN the_end
    with people with no                                         BT _alone
    programming knowledge                                       scene
    code becomes more simple and                                PT cena
                                                                FR sc`ne
                                                                     e
    clean                                                       EN scene
                                                                RU глава
                                                                BT act



8        Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Duplicates and pairs detection
    Motivation
        Duplicates can result in a biased corpus
        Finding candidate pairs for alignment

    Language independent elements (LIEs)
        terms which are usually kept untranslated
               year references – “1973”
               proper names – “Hamlet”

    Measuring similarity                                     Thresholds
                                                                        < 0.2: unrelated
                        |ALIEs ∩ BLIEs |                                > 0.4: pair
    similarity (A, B) =
                        |ALIEs ∪ BLIEs |                                > 0.9: duplicates

9              Andr´ Santos, andrefs@cpan.org
                   e                            Contributions for building a Corpora-Flow system
Book synchronization
     Definition
     Structural alignment at section level, based on
     previously added section delimiting marks.

     Motivation
         Some aligners cannot handle large documents
         Section delimiters can act as anchor points
         Unpaired sections can be discarded

     Implementation
         match similar section delimiters
         synchronization points

10            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Output


     pair of files with
     synchronization
     marks
     pair of files divided
     into smaller pairs
     of chunks
     text report
     synchronization
     matrix


11            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Output


     pair of files with
     synchronization
     marks
     pair of files divided
     into smaller pairs
     of chunks
     text report
     synchronization
     matrix


11            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Alignment evaluation
     Motivation
         compare alignments of the same documents
         (performed by different tools, with different options, . . . )
         determine if an alignment was successful




12            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Alignment evaluation
     Motivation
         compare alignments of the same documents
         (performed by different tools, with different options, . . . )
         determine if an alignment was successful

     Comparing alignments
        parse TMX files and output the total number
        correspondences of each type
         0:1/1:0, 1:1, 2:1/1:2 and 2:2
         evaluate the other tools developed
         compare the performance of the available
         alignment tools
12            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Alignment evaluation


     Determine if an alignment was successful
         Summarize a TMX by sampling. Sampling can
         be performed based on:
             number of samples desired
             explicit sampling points
             translation units which match a given regular
             expression
         Output is a (much?) smaller TMX file



13           Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Alignment evaluation




                                             The Name of the Rose, Umberto Eco

14      Andr´ Santos, andrefs@cpan.org
            e                            Contributions for building a Corpora-Flow system
Alignment evaluation
                     AdsonDE = АдсоRU




                                             The Name of the Rose, Umberto Eco

14      Andr´ Santos, andrefs@cpan.org
            e                            Contributions for building a Corpora-Flow system
Alignment evaluation
                     AdsonDE = АдсоRU




                                             The Name of the Rose, Umberto Eco

14      Andr´ Santos, andrefs@cpan.org
            e                            Contributions for building a Corpora-Flow system
Alignment evaluation
                     AdsonDE = АдсоRU




                                             The Name of the Rose, Umberto Eco

14      Andr´ Santos, andrefs@cpan.org
            e                            Contributions for building a Corpora-Flow system
Distribution

      All the tools implemented as Perl modules:
          Text::Perfide::BookCleaner
          Text::Perfide::BookPairs
          Text::Perfide::BookSync
          Text::Perfide::TMX::Utils
      publicly available on CPAN
      including tests and documentation
      additional effort required to make code
      installable and usable by other people


15        Andr´ Santos, andrefs@cpan.org
              e                            Contributions for building a Corpora-Flow system
Corpora-flow
     Motivation
         building a corpus is a complex task
         linear pipeline is not powerful enough


     Workflow                     Makefiles
         states                     file-oriented
         actions                    timestamps and
         conditions                 dependencies
         context                    fail-fast and resumable
                                    execution
                                    parallelization
16            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Corpora-flow

            workflow + Makefiles = corpora-flow

     DSL (→ Slay::Makefile)
     workflow:         rule*
     rule:             pre-condition* action post-condition*
     action:           targets dependencies function
     condition:        filename function
     target:           pattern*
     dependencies:     pattern*
     function:         Perl code


17           Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Conclusions
     Evaluation of the tools has shown that they do
     help to solve problems
     Most of the methods devised can be applied in
     other contexts
     Working within a larger project:
         provides requirements and resources
         specific needs and priorities
     making code available to other people:
         requires additional effort
         gives meaning to the work
         external contributions
     Higher level objects help to organize and
     discuss
18       Andr´ Santos, andrefs@cpan.org
             e                            Contributions for building a Corpora-Flow system
Future work
     Document cleaners
         other types of documents (e.g. scientific
         articles)
         algorithm for finding section delimiters with
         notion of hierarchy
         create ebooks/bilingual books

     Duplicates and pair detection
         list of correspondences (e.g. Adson → Адсо,
         London → Londres)
         calculate best threshold values in real time
19            Andr´ Santos, andrefs@cpan.org
                  e                            Contributions for building a Corpora-Flow system
Future work

     Document synchronization
         interactive mode
         improvements on synchronization matrix and
         metrics
         hierarchical sections
         other section alignment algorithms

     Corpora-flow
         finish specification and implementation
         implement a corpora-flow for Project Per-fide

20           Andr´ Santos, andrefs@cpan.org
                 e                            Contributions for building a Corpora-Flow system
Contributions for building a
  Corpora-Flow system

         Andr´ Santos
             e
      andrefs@cpan.org


       Informatics Engineering MSc
            University of Minho




          December 2011

Mais conteúdo relacionado

Destaque (8)

Dibujo tecnico i
Dibujo tecnico iDibujo tecnico i
Dibujo tecnico i
 
Universal Design August Workshop
Universal Design August Workshop Universal Design August Workshop
Universal Design August Workshop
 
Sassycacuss
SassycacussSassycacuss
Sassycacuss
 
Pp infoo
Pp infooPp infoo
Pp infoo
 
Colchon flotable a luz solar
Colchon flotable a luz solarColchon flotable a luz solar
Colchon flotable a luz solar
 
Set a featured image of a page in WordPress
Set a featured image of a page in WordPressSet a featured image of a page in WordPress
Set a featured image of a page in WordPress
 
Business Lending
Business LendingBusiness Lending
Business Lending
 
Texto base
Texto baseTexto base
Texto base
 

Semelhante a Slides

A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningRoberto Pereira Silveira
 
The Bash Love
The Bash LoveThe Bash Love
The Bash Loveishwon
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIBayes Nets meetup London
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentFaculty of Computer Science
 
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...Facultad de Informática UCM
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshopTae-Gil Noh
 
"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017Neeran Karnik
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databaseslovingprince58
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linuxNorberto Angulo
 

Semelhante a Slides (11)

A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
The Bash Love
The Bash LoveThe Bash Love
The Bash Love
 
David Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AIDavid Barber - Deep Nets, Bayes and the story of AI
David Barber - Deep Nets, Bayes and the story of AI
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual Entailment
 
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
Like Alice in Wonderland: Unraveling Reasoning and Cognition Using Analogies ...
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshop
 
"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017"Hints" talk at Walchand College Sangli, March 2017
"Hints" talk at Walchand College Sangli, March 2017
 
Introduction to post_gis
Introduction to post_gisIntroduction to post_gis
Introduction to post_gis
 
Design Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databasesDesign Patterns For Distributed NO-reational databases
Design Patterns For Distributed NO-reational databases
 
Bash shell programming in linux
Bash shell programming in linuxBash shell programming in linux
Bash shell programming in linux
 

Mais de andrefsantos

Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pintoandrefsantos
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documentsandrefsantos
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerandrefsantos
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesandrefsantos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatosandrefsantos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesandrefsantos
 

Mais de andrefsantos (9)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challengesBigorna - a toolkit for orthography migration challenges
Bigorna - a toolkit for orthography migration challenges
 
Bigorna
BigornaBigorna
Bigorna
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Slides

  • 1. Contributions for building a Corpora-Flow system Andr´ Santos e andrefs@cpan.org Informatics Engineering MSc University of Minho December 2011
  • 2. Concepts Aligned parallel corpus: Set of parallel texts in which correspondences have been marked between blocks (paragraphs, sentences, words, . . . ) from each text. Corpora-flow: Adaptation of the concept of workflow to the several tasks, decisions and sequences of steps involved in the process of building a corpus. 1 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 3. Concepts Aligned parallel corpus: Set of parallel texts in which correspondences have been marked between blocks (paragraphs, sentences, words, . . . ) from each text. Corpora-flow: Adaptation of the concept of workflow to the several tasks, decisions and sequences of steps involved in the process of building a corpus. This presentation and the underlying master thesis describe the implementation of several tools to be used in typical corpus building activities. 1 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 4. Context The work developed in the context of this master thesis was motivated and supported by Project Per-fide, an undergoing project in University of Minho which aims to build large parallel corpora between Portuguese and other six languages. 2 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 5. Corpora building challenges file format and format conversion finding duplicated files text encoding format structural residues section delimiters unpaired sections (parallel corpora) ... 3 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 6. Corpora building challenges Severe problems which often lead to bad results Many (most?) of them are hard/impossible to solve completely Find the problem and report it when it is not solvable automatically Provide intelligent ways of describing what was found and done 4 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 7. 5 key issues Book cleaning Duplicates and candidate pairs detection Book synchronization Alignment evaluation Corpora-flow system 5 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 8. Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne 6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 9. Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne <92>’ : right single quot. mark (CP1252) <96>- : en dash (CP1252) ^L : page break (0xC) prot´-(...)geait : transpagination e 6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 10. Book processing problems – Motivation (...) d <92>’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´- e e e e M <96>- 86 <96>- ^L geait la partie ant´rieure contre l <92>’ action e des rayons solaires, reposait sur de sveltes bambous. (...) La Jangada, Jules Verne (...) d ’ entr´e, donnant acc`s dans la salle commune. e e Une l´g`re v´randa, qui en prot´geait _pb1_ e e e e la partie ant´rieure contre l ’ action e des rayons solaires, reposait sur de sveltes bambous. (...) 6 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 11. Book cleaning Subdivided in several steps: 7 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 12. Sections ontology chap PT cap´tulo, ı contains common section types cap, capitulo FR chapitre, chap used to automatically generate EN chapter, chap the code to recognize section NT sec delimiters end PT fim allows discussion/cooperation FR fin EN the_end with people with no BT _alone programming knowledge scene code becomes more simple and PT cena FR sc`ne e clean EN scene RU глава BT act 8 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 13. Duplicates and pairs detection Motivation Duplicates can result in a biased corpus Finding candidate pairs for alignment Language independent elements (LIEs) terms which are usually kept untranslated year references – “1973” proper names – “Hamlet” Measuring similarity Thresholds < 0.2: unrelated |ALIEs ∩ BLIEs | > 0.4: pair similarity (A, B) = |ALIEs ∪ BLIEs | > 0.9: duplicates 9 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 14. Book synchronization Definition Structural alignment at section level, based on previously added section delimiting marks. Motivation Some aligners cannot handle large documents Section delimiters can act as anchor points Unpaired sections can be discarded Implementation match similar section delimiters synchronization points 10 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 15. Output pair of files with synchronization marks pair of files divided into smaller pairs of chunks text report synchronization matrix 11 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 16. Output pair of files with synchronization marks pair of files divided into smaller pairs of chunks text report synchronization matrix 11 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 17. Alignment evaluation Motivation compare alignments of the same documents (performed by different tools, with different options, . . . ) determine if an alignment was successful 12 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 18. Alignment evaluation Motivation compare alignments of the same documents (performed by different tools, with different options, . . . ) determine if an alignment was successful Comparing alignments parse TMX files and output the total number correspondences of each type 0:1/1:0, 1:1, 2:1/1:2 and 2:2 evaluate the other tools developed compare the performance of the available alignment tools 12 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 19. Alignment evaluation Determine if an alignment was successful Summarize a TMX by sampling. Sampling can be performed based on: number of samples desired explicit sampling points translation units which match a given regular expression Output is a (much?) smaller TMX file 13 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 20. Alignment evaluation The Name of the Rose, Umberto Eco 14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 21. Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco 14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 22. Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco 14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 23. Alignment evaluation AdsonDE = АдсоRU The Name of the Rose, Umberto Eco 14 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 24. Distribution All the tools implemented as Perl modules: Text::Perfide::BookCleaner Text::Perfide::BookPairs Text::Perfide::BookSync Text::Perfide::TMX::Utils publicly available on CPAN including tests and documentation additional effort required to make code installable and usable by other people 15 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 25. Corpora-flow Motivation building a corpus is a complex task linear pipeline is not powerful enough Workflow Makefiles states file-oriented actions timestamps and conditions dependencies context fail-fast and resumable execution parallelization 16 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 26. Corpora-flow workflow + Makefiles = corpora-flow DSL (→ Slay::Makefile) workflow: rule* rule: pre-condition* action post-condition* action: targets dependencies function condition: filename function target: pattern* dependencies: pattern* function: Perl code 17 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 27. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 28. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 29. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 30. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 31. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 32. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 33. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 34. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 35. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 36. Conclusions Evaluation of the tools has shown that they do help to solve problems Most of the methods devised can be applied in other contexts Working within a larger project: provides requirements and resources specific needs and priorities making code available to other people: requires additional effort gives meaning to the work external contributions Higher level objects help to organize and discuss 18 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 37. Future work Document cleaners other types of documents (e.g. scientific articles) algorithm for finding section delimiters with notion of hierarchy create ebooks/bilingual books Duplicates and pair detection list of correspondences (e.g. Adson → Адсо, London → Londres) calculate best threshold values in real time 19 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 38. Future work Document synchronization interactive mode improvements on synchronization matrix and metrics hierarchical sections other section alignment algorithms Corpora-flow finish specification and implementation implement a corpora-flow for Project Per-fide 20 Andr´ Santos, andrefs@cpan.org e Contributions for building a Corpora-Flow system
  • 39. Contributions for building a Corpora-Flow system Andr´ Santos e andrefs@cpan.org Informatics Engineering MSc University of Minho December 2011