SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
Bigorna
                                                A Toolkit for Orthography Migration Challenges
                                                            Jos´ Jo˜o Almeida, Andr´ Santos, Alberto Sim˜es
                                                               e a                 e                    o


 Abstract                                                                                         Conversion examples
   Languages are born, evolve and, eventually, die. During this evolution their spelling          $ pt2ptao
 rules (and sometimes the syntactic and semantic ones) change, putting old documents              A adop¸ao do acordo implica a actualiza¸ao de ferramentas.
                                                                                                        c~                               c~
 out of use. In Portugal, a pair of political agreements with Brazil forced relevant                            ↓
 changes on the way the Portuguese language is written, the most recent one being the             A ado¸~o do acordo implica a atualiza¸ao de ferramentas.
                                                                                                       ca                              c~
 Portuguese Language Orthographic Agreement (PLOA) signed in 1990.
 Bigorna is a toolkit for the classification of language variants, their comparison and
 the conversion of texts in different language versions. As Bigorna relies on a set                Conversion examples
 of conversion rules we will also discuss how to infer conversion rules from a set of             $ br2brao
 documents (texts with different ages).                                                            Ele fez um v^o rasante sobre a ar´ia.
                                                                                                               o                   e
                                                                                                                ↓
 Contents                                                                                         Ele fez um voo rasante sobre a areia.

 1 Compiling OAKB                                                                          1

 2 Updating the dictionary vocabulary                                                      1
                                                                                                 4      Variant classifier tool
                                                                                                   After creating the language converters became clear the need to have a language classifier,
 3 Language conversion tools                                                               1     capable of detecting the variant of Portuguese in which a given text was written, allowing to
                                                                                                 automatically differentiate texts (possibly to further conversion).
 4 Variant classifier tool                                                                  1
                                                                                                   To build this tool two lists were generated: one with European Portuguese-only words and
 5 Lexical comparison tools                                                                1     another with Brazilian Portuguese ones.
                                                                                                  Calculating the lists
                                                                                                  Function calc_whichpt_lsts(dicpt,dicbr,oakb)
                                                                                                    for ( x ∈ dom(oakb)
                                                                                                       ∧ oakb[x].type = (normal or accent)
                                                                                                       ∧ oakb[x].pt_pt = oakb[x].pt_br
                                                                                                       ∧ oakb[x].pt_pt ∈ dom(dicpt)
                                                                                                       ∧ oakb[x].pt_br ∈ dom(dicbr))
                                                                                                     {
                                                                                                      wpt ← oakb[x].pt_pt
                                                                                                      wbr ← oakb[x].pt_br
                                                                                                      justpt ← justpt ∪ {x ∈ deriv(wpt,dicpt)| x ∈ dicbr}
                                                                                                                                                 /
                                                                                                      justbr ← justbr ∪ {x ∈ deriv(wbr,dicbr)| x ∈ dicpt}
                                                                                                                                                 /
                                                                                                     }

                                                                                                  Language classifier definition
                                                                                                  Function classify_pt(text)
                                                                                                    for ( x ← text )
                                                                                                       if ( x ∈ justpt ) PTcount++
                                                                                                       if ( x ∈ justbr ) BRcount++
1      Compiling OAKB                                                                                compare ( PTcount, BRcount)
  A table containing all the information about the word changes. This table was built based
on previously existing resources and proved to be crucial to the subsequent tasks performed.
 OAKB structure
                                                                                                 5      Lexical comparison tools
                                                                                                   There are other situations where there are no available lists of words, only documents with
 oakb = entry*
                                                                                                 different orthographic versions.
 entry =
                                                                                                   lexdiff, is able to compare two versions of a text with different spelling and detect (lin-
   pt_pt                : word
                                                                                                 guistic) differences. This may be used to help building tools (as the previously mentioned).
   pt_br                : word
   pt_oa                : word *                                                                  Lexdiff example: word level changes
   preferencial_pt      : word                                                                    $ lexdiff -wa AmPerd.ptBR AmPerd.ptPT
   preferencial_br      : word                                                                       32 acad^mico → acad´mico
                                                                                                            e             e
   type    : Capit      | Hyphen | Accent | Normal | Excep                                           14 id´ia
                                                                                                          e        → ideia
                                                                                                     12 redarg¨iu → redarguiu
                                                                                                              u
 aden´ide :: aden´ide :: adenoide :: adenoide :: adenoide :: Accent
     o           o                                                                                    7 g^nio
                                                                                                         e         → g´nio
                                                                                                                       e
 adjec¸ao :: adje¸~o :: adje¸ao :: adje¸ao :: adje¸~o :: Normal
      c~         ca          c~           c~         ca                                               4 refletiu → reflectiu
 Mar¸o
    c     :: mar¸o
                c     :: mar¸o
                            c     :: mar¸o
                                        c     :: mar¸o
                                                    c     :: Capit                                    ...

                                                                                                  Lexdiff example: char level changes
2      Updating the dictionary vocabulary                                                         $ lexdiff -cctx AmPerd.ptBR AmPerd.ptPT
  An existing European Portuguese spellchecker dictionary (jSpell) was updated. This dictio-       changed PT→BR (unchanged)         changed            BR→PT (unchan) Concl
nary was later used to generate lists to both the language conversion tools and the language
classifier.                                                                                        !    36    ect→et      (9)                   36      et →ect     (206)    BR →?PT
  From the 2 600 words in OAKB, just 960 were related directly with a lemma in jSpell’s           !    34    d´m→d^m
                                                                                                              e   e      (1)           !!      34      d^m→d´m
                                                                                                                                                        e   e               BR?↔ PT
dictionary. From these 960 lemmas jSpell generates a total of 11 500 words.                            18    dei→d´i
                                                                                                                  e      (164)         !!      18      d´i→dei
                                                                                                                                                        e                   BR → PT
                                                                                                       17    gui→g¨i
                                                                                                                  u      (88)          !!      17      g¨i→gui
                                                                                                                                                        u                   BR → PT
 Update function
                                                                                                       15    que→q¨e
                                                                                                                  u      (2417)        !!      15      q¨e→que
                                                                                                                                                        u                   BR → PT
 Function newdic(oakdb,dicjs)
                                                                                                  !!   11    g´n→g^n
                                                                                                              e   e                    !       11      g^n→g´n
                                                                                                                                                        e   e      (6)      BR → PT
   for ( x ∈ dom(oakb) ∧ oakb[x].type = normal
                                                                                                  !!   9     m´n→m^n
                                                                                                              o   o                    !!      9       m^n→m´n
                                                                                                                                                        o   o               BR ↔ PT
        ∧ x = oakb[x].preferpt ∧ x ∈ dom(dicjs))
                                                                                                  !    8     act→at      (1)                   8       at →act     (456)    BR ← PT
     {
                                                                                                  !!   7     ec¸→e¸
                                                                                                               c  c                            7       e¸ →ec¸
                                                                                                                                                        c    c     (77)     BR ← PT
       neww ← oakb[x].preferpt
                                                                                                  !!   6     ac¸→a¸
                                                                                                               c  c                            6       a¸ →ac¸
                                                                                                                                                        c    c     (431)    BR ← PT
       dicjs[neww] ← dicjs[x]
                                                                                                  !!   6     t´n→t^n
                                                                                                              o   o                    !!      6       t^n→t´n
                                                                                                                                                        o   o               BR ↔ PT
       delete dicjs[x]
     }
                                                                                                  Confusion matrix pt-br        → pt-pt
                                                                                                     et   → {     et →         206, ect → 36 },
                                                                                                     d´i → { dei →
                                                                                                      e                        18 },
3      Language conversion tools                                                                     g¨i → { gui →
                                                                                                      u                        17 },
  As many texts will need to be updated to the new spelling form, there was the need to create       at   → {     at →         456, act → 8 apt          → 1, apt      → 1},
automated conversion tools. Due to the multiple spelling cases, two versions were created: an        e¸
                                                                                                      c   → {     e¸ →
                                                                                                                   c           77, ec¸ → 7, ea¸
                                                                                                                                      c        c         → 2, ep¸
                                                                                                                                                                c      → 2},
European Portuguese converter and a Brazilian Portuguese one.

                                                                           http://natura.di.uminho.pt/

Mais conteúdo relacionado

Destaque (8)

Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
My Hero- Cody Schiller
My Hero- Cody SchillerMy Hero- Cody Schiller
My Hero- Cody Schiller
 
Nature and environments
Nature and environmentsNature and environments
Nature and environments
 
高性能网站建设指南
高性能网站建设指南高性能网站建设指南
高性能网站建设指南
 
CAS
CASCAS
CAS
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Poster - Bigorna, a toolkit for orthography migration challenges

  • 1. Bigorna A Toolkit for Orthography Migration Challenges Jos´ Jo˜o Almeida, Andr´ Santos, Alberto Sim˜es e a e o Abstract Conversion examples Languages are born, evolve and, eventually, die. During this evolution their spelling $ pt2ptao rules (and sometimes the syntactic and semantic ones) change, putting old documents A adop¸ao do acordo implica a actualiza¸ao de ferramentas. c~ c~ out of use. In Portugal, a pair of political agreements with Brazil forced relevant ↓ changes on the way the Portuguese language is written, the most recent one being the A ado¸~o do acordo implica a atualiza¸ao de ferramentas. ca c~ Portuguese Language Orthographic Agreement (PLOA) signed in 1990. Bigorna is a toolkit for the classification of language variants, their comparison and the conversion of texts in different language versions. As Bigorna relies on a set Conversion examples of conversion rules we will also discuss how to infer conversion rules from a set of $ br2brao documents (texts with different ages). Ele fez um v^o rasante sobre a ar´ia. o e ↓ Contents Ele fez um voo rasante sobre a areia. 1 Compiling OAKB 1 2 Updating the dictionary vocabulary 1 4 Variant classifier tool After creating the language converters became clear the need to have a language classifier, 3 Language conversion tools 1 capable of detecting the variant of Portuguese in which a given text was written, allowing to automatically differentiate texts (possibly to further conversion). 4 Variant classifier tool 1 To build this tool two lists were generated: one with European Portuguese-only words and 5 Lexical comparison tools 1 another with Brazilian Portuguese ones. Calculating the lists Function calc_whichpt_lsts(dicpt,dicbr,oakb) for ( x ∈ dom(oakb) ∧ oakb[x].type = (normal or accent) ∧ oakb[x].pt_pt = oakb[x].pt_br ∧ oakb[x].pt_pt ∈ dom(dicpt) ∧ oakb[x].pt_br ∈ dom(dicbr)) { wpt ← oakb[x].pt_pt wbr ← oakb[x].pt_br justpt ← justpt ∪ {x ∈ deriv(wpt,dicpt)| x ∈ dicbr} / justbr ← justbr ∪ {x ∈ deriv(wbr,dicbr)| x ∈ dicpt} / } Language classifier definition Function classify_pt(text) for ( x ← text ) if ( x ∈ justpt ) PTcount++ if ( x ∈ justbr ) BRcount++ 1 Compiling OAKB compare ( PTcount, BRcount) A table containing all the information about the word changes. This table was built based on previously existing resources and proved to be crucial to the subsequent tasks performed. OAKB structure 5 Lexical comparison tools There are other situations where there are no available lists of words, only documents with oakb = entry* different orthographic versions. entry = lexdiff, is able to compare two versions of a text with different spelling and detect (lin- pt_pt : word guistic) differences. This may be used to help building tools (as the previously mentioned). pt_br : word pt_oa : word * Lexdiff example: word level changes preferencial_pt : word $ lexdiff -wa AmPerd.ptBR AmPerd.ptPT preferencial_br : word 32 acad^mico → acad´mico e e type : Capit | Hyphen | Accent | Normal | Excep 14 id´ia e → ideia 12 redarg¨iu → redarguiu u aden´ide :: aden´ide :: adenoide :: adenoide :: adenoide :: Accent o o 7 g^nio e → g´nio e adjec¸ao :: adje¸~o :: adje¸ao :: adje¸ao :: adje¸~o :: Normal c~ ca c~ c~ ca 4 refletiu → reflectiu Mar¸o c :: mar¸o c :: mar¸o c :: mar¸o c :: mar¸o c :: Capit ... Lexdiff example: char level changes 2 Updating the dictionary vocabulary $ lexdiff -cctx AmPerd.ptBR AmPerd.ptPT An existing European Portuguese spellchecker dictionary (jSpell) was updated. This dictio- changed PT→BR (unchanged) changed BR→PT (unchan) Concl nary was later used to generate lists to both the language conversion tools and the language classifier. ! 36 ect→et (9) 36 et →ect (206) BR →?PT From the 2 600 words in OAKB, just 960 were related directly with a lemma in jSpell’s ! 34 d´m→d^m e e (1) !! 34 d^m→d´m e e BR?↔ PT dictionary. From these 960 lemmas jSpell generates a total of 11 500 words. 18 dei→d´i e (164) !! 18 d´i→dei e BR → PT 17 gui→g¨i u (88) !! 17 g¨i→gui u BR → PT Update function 15 que→q¨e u (2417) !! 15 q¨e→que u BR → PT Function newdic(oakdb,dicjs) !! 11 g´n→g^n e e ! 11 g^n→g´n e e (6) BR → PT for ( x ∈ dom(oakb) ∧ oakb[x].type = normal !! 9 m´n→m^n o o !! 9 m^n→m´n o o BR ↔ PT ∧ x = oakb[x].preferpt ∧ x ∈ dom(dicjs)) ! 8 act→at (1) 8 at →act (456) BR ← PT { !! 7 ec¸→e¸ c c 7 e¸ →ec¸ c c (77) BR ← PT neww ← oakb[x].preferpt !! 6 ac¸→a¸ c c 6 a¸ →ac¸ c c (431) BR ← PT dicjs[neww] ← dicjs[x] !! 6 t´n→t^n o o !! 6 t^n→t´n o o BR ↔ PT delete dicjs[x] } Confusion matrix pt-br → pt-pt et → { et → 206, ect → 36 }, d´i → { dei → e 18 }, 3 Language conversion tools g¨i → { gui → u 17 }, As many texts will need to be updated to the new spelling form, there was the need to create at → { at → 456, act → 8 apt → 1, apt → 1}, automated conversion tools. Due to the multiple spelling cases, two versions were created: an e¸ c → { e¸ → c 77, ec¸ → 7, ea¸ c c → 2, ep¸ c → 2}, European Portuguese converter and a Brazilian Portuguese one. http://natura.di.uminho.pt/