SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
Bigorna – a toolkit for orthography migration challenges

                               José João Almeida1 , André Santos1 , Alberto Simões2
                                1
                              Departamento de Informática, Universidade do Minho, Portugal
                2
                 Escola Superior de Estudos Industriais e de Gestão, Instituto Politécnico do Porto, Portugal
              jj@di.uminho.pt, pg15973@alunos.uminho.pt, alberto.simoes@eu.ipp.pt

                                                                Abstract
Languages are born, evolve and, eventually, die. During this evolution their spelling rules (and sometimes the syntactic and semantic
ones) change, putting old documents out of use. In Portugal, a pair of political agreements with Brazil forced relevant changes on the
way the Portuguese language is written.
In this article we will detail these two Orthographic Agreements (one in the thirties and the other more recently, in the nineties), and the
challenges present on the automatic migration of old documents spelling to their actual one.
We will reveal Bigorna, a toolkit for the classification of language variants, their comparison and the conversion of texts in different
language versions. These tools will be explained together with examples of migration issues. As Birgorna relies on a set of conversion
rules we will also discuss how to infer conversion rules from a set of documents (texts with different ages).
The document concludes with a brief evaluation on the conversion and classification tool results and their relevance in the current
Portuguese language scenario.


                     1.    Introduction                                 impact. In this article we are interested into the latest, that
                                                                        can be summarized as:
Languages evolve. This evolution can be natural and grad-
ual, when speakers change habits during decades, or can be                 • 1931: This was the first orthographic agreement be-
forced and drastic, when some kind of regulatory institu-                    tween Portugal and Brazil trying to approximate both
tion defines some rules (or heuristics) on how the language                   languages. This change abolished the silent s from
orthography will change.                                                     words such as sciência, scena, scéptico. It also
The case of the recent evolution on the Portuguese language                  changed the way compound verbs were written, from
is the latest. This change was mainly a political issue to                   dir-se há and amar-te hei to dir-se-á and amar-te-ei.
approximate the different countries that speak Portuguese.
Details on the story of this evolution can be found on sec-                • 1945: This new reform just hit the language in Por-
tion 1.1..                                                                   tugal, cleaning a set of different accents, like remov-
In any of the two presented evolution cases, documents lose                  ing completely the umlaut (trema, in Portuguese), and
importance and relevance as soon as their orthography is                     removing the circumflex accent in homograph words
outdated.                                                                    such as, acêrto and acerto, cêrca and cerca, côr and
With the latest changes on the Portuguese language, and                      cor, fôra and fora, dêsse and desse, and so on.
the amount of information that is typical of this era, it be-
came necessary to develop automatic methods for language                   • 1973: While not a political agreement, Portugal fol-
modernization.                                                               lowed Brazil and abolished accent marks in secondary
This document discusses Bigorna, a toolkit developed to                      stressed syllables (tètralogia and tetralogia).
help on the update of documents orthography. It was devel-
oped to be as language independent as possible, relying in                 • 1990/2009: On December the 16th of 1990 the Por-
a set of rules learned by the comparison of old and actual                   tuguese Language Orthographic Agreement (PLOA)
corpora orthographic differences.                                            (da República, 1991) was approved in Lisbon by An-
This section will present the main steps of Portuguese lan-                  gola, Cape Verde, Guinea-Bissau, Mozambique, Sao
guage evolution, the motivation for our work, what was                       Tome and Principe, Brazil, Portugal and a delegation
the initial state before the project beginnings, and Bigorna                 of galician observers. Its goal is to unify, as widely
main goals. It will be followed by a section on language                     as possible, the general vocabulary of the Portuguese
resources construction and tools development. Section 3.                     language.
presents a brief evaluation of the developed tools. It is fol-                The urge to bring closer all variants of Portuguese im-
lowed by some conclusions and future work issues.                             plies some changes on the structure and content of the
                                                                              language itself. On those content changes, the pho-
1.1.   Portuguese Language Evolution                                          netic criteria (or pronunciation one) was chosen in-
The Portuguese language evolved during the last century                       stead of the etymological one. This led to some cases
through successive orthographic agreements, especially be-                    of multiple spelling (whenever a word is pronounced
tween Portugal and Brazil. A full time-line of changes can                    differently according to the variant in use) like facto
be consulted in (Wikipedia:Reforms, 2010). Some of the                        and fato, or suntuoso and sumptuoso. The lack of con-
described changes are just political, some other had real                     sensus about whether some words are to be updated or
not, and the absence of a common orthographic vocab-              • therefore, it is important to create and maintain an Or-
       ulary (as foreseen in article 2 of the PLOA) gets this              thographic Agreement’s Knowledge Base — OAKB, a
       process even more complex.                                          table containing lemmas, changes and rules based on
                                                                           the currently existing lists, new examples found and
While this latest political change was accepted and should                 the results of processing texts. This table will be up-
be put in practice soon, real users of the language will con-              dated and evolve (one might never know when it is
tinue to write as they learned for some time, leading to a                 complete);
double orthographic form. This means that the two forms
(before-1990 and after-1990) are relevant at the moment.                 • the main problem is on how to determine which words
                                                                           are candidates to integrate the OAKB (what are the
1.2.    Motivation                                                         changing words).
The motivation to this work is two fold: to update old                 It was clear the need of a project dedicated to the migration
documents in the public domain (with more than 70 years                process. It was Bigorna project birth.
after author dead) to current used orthographic form, and
the need to update current documents to the future1 ortho-             1.4. Bigorna’s design goals
graphic form, as defined by PLOA:                                       In order to help in the migration process, it is necessary to
                                                                       have some resources and tools. The first set of requirements
  • With the actual tendency to public accessible repos-               included:
    itories or libraries in the Internet, and as a special               • collect all the resources related to PLOA that are al-
    case Project Gutenberg2 , a lot of books that are cur-                 ready available in the Internet;
    rently under public domain are being scanned and
    transcribed. While it is crucial to keep the original                • create a spell-checker dictionary for the new Por-
    orthographic form in the archive for linguistic stud-                  tuguese language;
    ies (for example), their modernization is also relevant,
                                                                         • build a tool to convert text from the pre-agreement
    namely for reading purposes.
                                                                           Portuguese to the updated Portuguese language;
       This leads to the necessity to develop a tool to per-
       form the systematic migration of language from before             • build a classification tool, to identify the variant of
       1930 to the currently used (before 1990/2009 reform).               Portuguese used in a given text;
                                                                         • make the previous tools work with as few dependen-
  • We are in the information era. A lot of different doc-                 cies as possible.
    uments are currently available and there will be the
    need to update them to the new orthographic form.                  We soon understood that the set of changing words was be-
                                                                       ing updated frequently as PLOA is not defined by exhaus-
       The PLOA implementation demands a revision and                  tion (it does not list all words, just defines a few heuristics).
       an update of the existing linguistic tools and the cre-         This lead to the following decisions:
       ation of new ones, and the complexity of these tasks
       increase due to the multiple spelling cases (see sec-             • center all the information about the word changes in
       tion 1.3.).                                                         an unique table (OAKB);
                                                                         • build a system, that given that table, would generate:
A great help to both processes would be the creation of a
set of tools capable of inferring migration rules through the                  – the new spell-checker dictionary;
comparison of texts. Tools like those would be useful to de-                   – the language migration tool;
velop applications that people could use to face the issues
                                                                               – the Portuguese language variant classifier
of PLOA. While this document will mainly focus PLOA
implementation, the development tools also have a role in                • develop tools to help in the construction and manipu-
updating old documents spelling, or in future language up-                 lation of OAKB:
dates.
                                                                               – tools to find lemma changing candidates from
1.3.    Summarizing                                                              texts;
The current scenario in the Portuguese language, as created                    – a tool to extract lexical differences from pairs of
by the PLOA:                                                                     texts;
                                                                         • make the previous tools as generic and reusable as pos-
  • the spelling changes dictated by the agreement cannot                  sible.
    be determined automatically, as they rely on phonetic
    criteria and may sometimes be ambiguous (different                               2.   Bigorna Development
    accents, for instance);                                            This section describes the Bigorna approach and its devel-
                                                                       opment details. We will first focus how resources were
    1
      We will use “current orthographic form” to refer to the ortho-   compiled into OAKB and then how the spell checker dic-
graphic form before 1990 agreement, and the “future form” for          tionary was updated. Follows a description about the cre-
the orthographic form after 1990 agreement.                            ated tools: conversion tools, the language classifier and the
    2
      http://www.gutenberg.org/                                        lexical comparison tools.
2.1.   Compiling OAKB                                              Function newdic(oakdb,dicjs)
                                                                     for ( x ∈ dom(oakb)
To start the compilation of resources to integrate the knowl-
                                                                          ∧ oakb[x].type = normal
edge base we started by searching for free resources related              ∧ x = oakb[x].preferpt
to the PLOA. We found several online dictionaries, spell-                 ∧ x ∈ dom(dicjs))
checkers, vocabularies and a growing number of publica-                {
tions about the subject.                                                 neww ← oakb[x].preferpt
From the vocabularies found, one stands out, a list of                   dicjs[neww] ← dicjs[x]
changed words, containing over 9 000 entries, published                  delete dicjs[x]
by Instituto de Linguística Teórica e Computacional (IL-               }
TEC, 2008). Another list, not as complete, was compiled
                                                                   From the 2 600 words in OAKB, just 960 were related di-
by Projecto Quintus (Quintus, 2008).
                                                                   rectly with a lemma in jSpell’s dictionary. Note that from
Each entry on this list contained both the European and
                                                                   these 960 lemmas jSpell generates a total of 11 500 words.
Brazilian Portuguese versions of a given word together with
that word’s updated form and a brief comment indicating            2.3. Language Conversion Tools
the preferential new form for each Portuguese variant.             Our second task was the development of a conversion tool,
All lines where there was no change accordingly to PLOA            to migrate texts from the pre-agreement spelling to the new
were deleted and the remaining ones were used to com-              one.
pile our first OAKB version. This resulted in approximately         Due to the multiple spelling cases which are, moreover, de-
2 600 words.                                                       pendent on the variant of Portuguese being used, it was not
After the reorganization process, the structure of OAKB is:        possible to create a single conversion tool. Therefore, we
                                                                   decided to create two different tools: one capable of up-
oakb = entry*
                                                                   dating European Portuguese (pt2ptao) and one other ca-
                                                                   pable of updating Brazilian Portuguese (br2brao). Note
entry =
  pt_pt                   :   word                                 that these tools convert the original variant to the after-
  pt_br                   :   word                                 agreement variant.
  pt_oa                   :   word *                               Once again, we started with OAKB and jSpell as our base
  preferencial_pt         :   word                                 tools. Typical entries in the jSpell’s dictionary include the
  preferencial_br         :   word                                 lemma, its morphologic properties, and the derivation rules
  type    : Capit         |   Hyphen | Accent | Normal             valid for that lemma:
                                                                                 acalentar/#vt/XYPLD/
The field type (of change) was used in order to make pos-
                                                                                 coiote/#nm/p/
sible to define different ways of dealing with different sit-
                                                                                 laico/#a/fidp/
uations (remotion of capitalization depends on context, re-
                                                                                 zinco/#nm//
motion of accent marks depends not). Presently we only
differentiate the hyphenation changes, but in the future the       OAKB was used to extract a list of the changing lem-
other types will be discriminated as well.                         mas. This list includes the European Portuguese lem-
                                                                   mas and their updated form (lemma_pt/lemma_ptao),
2.2.   Updating the dictionary vocabulary                          and from jSpell dictionary were extracted the deriva-
To prepare a spellchecker dictionary we decided to use             tion rules valid for that same Portuguese lemma
jSpell, an open-source morphological analyzer (Almeida             (lemma_pt/rules).
and Pinto, 1995; Simões and Almeida, 2002). jSpell has             By expanding each lemma_pt and each lemma_ptao in
the advantage that its updates can be easily propagated (us-       parallel (using the corresponding jSpell rules) we obtained
ing Chuveiro de Dicionários (dos Santos Vilela, 2009))             a new table where each entry is structured as follows:
to dictionaries for other spell-checkers engines (aspell,                           word_pt/word_ptao
ispell, hunspell and myspell), some of which are
used by well known open source applications like Firefox,          As we did not have the derivation rules for Brazilian Por-
Thunderbird or OpenOffice.                                          tuguese words (because jSpell’s Portuguese dictionary is
JSpell also has the ability to, when provided with a list of       European Portuguese specific), we decided to use the rules
lemmas and their derivation rules, generate an exhaustive          present on OAKB to derivate the corresponding Brazilian
list of all derived words. The relevance of this feature is that   Portuguese words, and their updated form. Then, by a pro-
the language updating can affect just the lemma and/or the         cess similar to the above, we obtained a table with the fol-
generation rules, and a set of words will be automatically         lowing structure:
fixed. Also, the generated lists can be compared with the
                                                                                    word_br/word_brao
entries into OAKB easily.
The update of the dictionary was performed by searching            The conversion tool was developed in Perl and we tried
the words from OAKB in jSpell’s European Portuguese dic-           to minimize its dependencies to make it easier to install
tionary and replacing them by their updated version. When          and use. For the very same reason, we decided to include
there was more than one possible form, we chose the ver-           the word table itself in the script file, resulting in a self-
sion recommended for European Portuguese.                          contained single file.
The tool performs the conversion by sweeping a given text        This tool was tested with two variants from the book Amor
looking for word_pt words and replacing them by the              de Perdição, from the Portuguese author Camilo Castelo-
matching word_ptao (word_br and word_brao in                     Branco, one written in European Portuguese and the other
the Brazilian version). We included some additional op-          in Brazilian Portuguese. The results were the expected.
tions which protect XML tags and LTEX commands within
                                  A
the text.                                                        $ whichPT AmorPerd.ptPT AmorPerd.ptBR
Follows some examples of the tools interaction.                  AmorPerd.ptPT     pt
                                                                 AmorPerd.ptBR     br
$ pt2ptao
A adopção do acordo implica a actualização                       2.5.    Lexical Comparison Tools
de ferramentas.                                                  The lists of words compiled above were based on previous
             ↓
                                                                 existing lists. But there are other situations where there are
A adoção do acordo implica a atualização
                                                                 no available lists of words. Although there are no lists, there
de ferramentas.
                                                                 are documents with different orthographic versions.
$ br2brao                                                        With that in mind, we built a tool to help on comparing two
Ele fez um vôo rasante sobre a aréia.                            versions of a text and detect (linguistic) differences between
             ↓                                                   two versions of a text with different spelling.
Ele fez um voo rasante sobre a areia.                            This script, named lexdiff, receives two files and com-
                                                                 pares them with the help of the Unix commands egrep
The different character encoding systems were the source         and diff.
of several problems and the target of a special care during      The result of lexdiff applied to previously mentioned
this project. Both versions of the conversion tool, as well      versions of Amor de Perdição is:
as the other programs developed by Bigorna, accept both
ISO-8859-1 and UTF-8 encoded text.                               $ lexdiff -ac AmPerd.ptBR AmPerd.ptPT
                                                                    32 acadêmico => académico
2.4.   Variant Classifier                                            14 idéia     => ideia
A relevant goal on Bigorna is the identification of the Por-         12 redargüiu => redarguiu
tuguese variant present on some text. This feature is crucial       7 gênio      => génio
                                                                    4 refletiu   => reflectiu
so we can use one of the tools presented above automati-
                                                                     ...
cally without human intervention.
Another tool, named whichPT, was developed to identify
                                                                 For each changed word lexdiff outputs the number of
whether the European or the Brazilian Portuguese variant is
                                                                 times that change ocurred on the given texts.
present on a document.
                                                                 The script is also able to calculate narrower differences.
The tool incorporates a list of (just) European Portuguese       This option allows us to detect changing patterns (se-
words and another of (just) Brazilian Portuguese ones,           quences of characters that get usually updated), indepen-
compiled with a subset of the conversion tools lists.            dently of the word. Below is the result of this narrower
                                                                 approach to the same pair of texts:
Function calc_whichpt_lsts(dicpt,dicbr,oakb)
  for ( x ∈ dom(oakb)
                                                                 $ lexdiff -m -ac AmPerd.ptBR AmPerd.ptPT
      ∧ oakb[x].type = (normal or accent)
                                                                     36 et => ect
      ∧ oakb[x].pt_pt = oakb[x].pt_br
                                                                     18 déi => dei
      ∧ oakb[x].pt_pt ∈ dom(dicpt)
                                                                     17 güi => gui
      ∧ oakb[x].pt_br ∈ dom(dicbr))
                                                                     8 at => act
   {
                                                                     7 eç => ecç
     wpt ← oakb[x].pt_pt
                                                                      ...
     wbr ← oakb[x].pt_br
     justpt ← justpt ∪
          {x ∈ deriv(wpt,dicpt)| x ∈ dicbr}
                                    /                            The option shown above creates a confusion matrix (CM)
     justbr ← justbr ∪                                           which relates each word with all the possible words it can
          {x ∈ deriv(wbr,dicbr)| x ∈ dicpt}
                                    /                            be. Follows an extract of the CM created in the previous
   }                                                             example:

The classification is performed by counting the number of                et    =>   {   ect   =>   36   },
words from the text that have a match in each of the lists; at          déi   =>   {   dei   =>   18   },
                                                                        güi   =>   {   gui   =>   17   },
the end, the result is the variant with the highest number of
                                                                        at    =>   {   aat   =>   1,
matches.
                                                                                       apt   =>   1,
This implementation allows to easily expand the script to                              act   =>   8    },
other variants (for example, ancient Portuguese). It is also            eç    => {     eaç   =>   2,
possible to perform the classification ignoring the words in-                           ecç   =>   7,
side XML tags or LTEX commands, to print at the end the
                     A                                                                 epç   =>   2    },
exact number of matches in each variant, or to output all
the words matching each variant.                                 There are many uses for lexdiff:
• it can be used to search for words that could be in-                   NrWords     pt2ptao   Priberam      PE      PdLP
    serted into OAKB and, consequently, in the updated                       Total        69        80          78       48
    dictionary, the converters and the classifier, or to see                Different      49        60          58       33
    how similar two given texts are;                                       Precision     1.0        1.0         1.0      1.0
                                                                            Recall       0.80      0.98        0.95     0.54
  • it can even be used as a spell-checker, although its
    main application will be the generation of tools like           Table 2: Comparison of the number of changed words using
    the ones previously described in this article.                  the four different conversion systems.
At the moment we are developing a lexpatch tool that,
given a changes file produced by lexdiff can update                  After some analysis of the words not updated by pt2ptao
texts according to those changes.                                   we noted that it is not yet dealing with the task of lowercas-
The combined action of both of these tools will allow the           ing the months and days of the week.
automatic generation of converters after a learning process
                                                                    We tried a similar approach to evaluate the Brazilian Por-
made from manual adaptations of texts.
                                                                    tuguese version of this tool (br2brao). The systems
                                                                    chosen for the comparison were Priberam (they also have
                     3.    Evaluation                               a conversion tool for Brazilian Portuguese) and Interney7
In order to evaluate the tools previously described, we used        conversion tools.
different strategies and resources.                                 We tested these systems with a Brazilian Portuguese ver-
In this section we will discuss the process used and the re-        sion of “Amor de Perdição”, but we could not interpret the
sults obtained in the evaluation of pt2ptao, br2brao                results; this task must be performed by someone more ex-
and whichPT.                                                        perienced with Brazilian Portuguese, as it requires a native-
Given that lexdiff is not dependent of specific language             level knowledge of the language which none of the authors
resources, its evaluation deals with a different problem: if        possesses.
the algorithm is correct, if its results are useful, and if it      During this process of evaluation we found some complex
can be used for other relevant tasks. While lexdiff was             aspects that need a careful approach:
used in the tests descried here, we will not deal with its
evaluation in this document.                                          • One may not want to convert words which are part
                                                                        of composite proper names (names of streets, family
3.1.   Evaluating the Conversion Tools                                  names, etc);
In order to evaluate both versions of the conversion tool
we decided to test other similar tools in order to make a             • The remotion of capitalization should not be applied
common evaluation and compare the results obtained using                to words in the beginning of sentences;
the different systems.
The systems compared to evaluate the European Portuguese              • The PLOA document is not clear on how/when to ap-
version (pt2ptao) were Priberam3 , Porto Editora4 (PE)                  ply the hyphenation rules.
and Portal da Língua Portuguesa5 (PdLP) conversion tools.
Results of the ortographic updating of “Amor de Perdição”           Therefore, a conversion tool needs to be able to identify
are summarized in Table 1 and Table 2.                              phrases and entities in the text to perform accurate conver-
Given the lack of work power to select manually the chang-          sions.
ing words from the list of words present on the book, we            This process of comparative evaluation, which was made
compiled the set of words correctly changed in the four sys-        possible with the cooperation of the teams responsible by
tems and used this set to compute the recall measure6 .             the other conversion tools, has already lead to some im-
                                                                    provements in some of these tools.
        NrWords       Amor de Perdição        Changed
         Total            48 011                81                  3.2.     Evaluating the Language Classifier
        Different          8 717                61
                                                                    In order to evaluate the classifier (whichPT) we used a set
Table 1: Comparison of the number of words in Amor de               of 100 books in different Portuguese dialects (pt-pt, pt-br).
Perdição and the number of changed words.                           The classifier gave the correct answer in all the cases. One
                                                                    of the books was difficult to evaluate as it had a mixure of
                                                                    both dialects.
   3
     http://www.priberam.pt/                                        We then decided to stop the classification after obtaining a
   4
     http://portoeditora.pt/                                        certain number of language clues (20). With this limitation
   5                                                                we obtained 2 ambigous books.
     http://www.portaldalinguaportuguesa.org/
Due to timing reasons, Portal da Lingua Portuguesa results were     This evaluation was also relevant as, after analyzing the de-
gathered using the online version of their conversion tool, which   tected clues, a type was found on our knowledge base. Cor-
may have affected their results.                                    recting it the classified answered correctly for all books.
   6
     The real number of words to be changed should be higher
(probably not all the words were detected by the four systems),
                                                                       7
what would make the recall value lower.                                    http://www.interney.net/
4.   Conclusions                             Projecto Quintus. 2008. Lista total (?) de palavras
We are now replicating the updates on jSpell’s dictionary to       da norma culta lusoafricana que são modificadas pelo
other spell-checkers, using Chuveiro de Dicionários. The           acordo ortográfico de 1990. Inclui lista de palavras que
resulting dictionaries will be included as the official dictio-     mudam, Projecto Quintus.
nary for Firefox, Thunderbird and OpenOffice in the near          Alberto M. Simões and J.J. Almeida. 2002. Jspell.pm –
future.                                                            um módulo de análise morfológica para uso em proces-
Tools construction was only possible after the construction        samento de linguagem natural. In Actas do XVII Encon-
of OAKB. For this purpose the resources already available          tro da Associação Portuguesa de Linguística, pages 485–
on the Internet were crucial. We would not have resources          495, Lisboa 2001.
for the hand construction of this list.                          Alberto Simões and Rita Farinha. 2010. Dicionário
With a proper knowledge base the identification of variant          Aberto: Um novo recurso para PLN. Vice-Versa, (16),
is simple and comparable to common language identifica-             April. forthcomming.
tion tasks.                                                      Wikipedia:Reforms.       2010.      Reforms of por-
The conversion tool requires, as already explained, some           tuguese orthography.     Wikipedia article, http:
more linguistic knowledge that is not yet present on our           //en.wikipedia.org/wiki/Reforms_of_
tools, like the mentioned entity recognition. Nevertheless,        Portuguese_orthography. [Online; accessed
current approach is already performing comparatively well.         10-March-2010].
The migration tools are being tested and should be imple-
mented as web-services and released as stand-alone tools.
The lexical difference tool is already being used for
other projects like Dicionário-Aberto (Simões and Farinha,
2010), where a definitions dictionary from 1913 was tran-
scribed and is being migrated to the actual orthography.
While the full orthographic actualization is almost impossi-
ble to perform in an automatic way (mostly given the abun-
dance of different words present on a dictionary), current
experiments show that it is possible to update 89.39% of
the words automatically.

                    5.   Future work
The converters will be tested and improved with the help of
lexdiff and lexpatch.
The lexdiff script and subsequent work could be used to
improve the tools built in the previous stages of this project
(for example, new converters can be developed merely by
the analysis of handmade transcripts).

               6.    Acknowledgments
We would like to thank both the Priberam and the Porto
Editora teams, who applied their commercial tools on our
test cases for evaluation purposes.

                    7.    References
J.J. Almeida and Ulisses Pinto. 1995. Jspell – um módulo
   para análise léxica genérica de linguagem natural. In Ac-
   tas do X Encontro da Associação Portuguesa de Linguís-
   tica, pages 1–15, Évora 1994.
Diário da República.          1991.     Acordo ortográfico
   da língua portuguesa, 1990.              Technical Re-
   port 193, série I-A, 23 de Agosto.               http:
   //www.portaldalinguaportuguesa.org/
   index.php?action=acordo&version=1990.
Rui Miguel Rodrigues dos Santos Vilela. 2009. Geração
   de dicionários para correcção ortográfica do português.
   Master’s thesis, Escola de Engenharia, Universidade do
   Minho, Braga, Outubro.
ILTEC. 2008. Vocabulário de mudança. Inclui listas
   de palavras que mudam, Portal da Língua Porguesa.
   http://www.portaldalinguaportuguesa.
   org/index.php?action=novoacordo.

Mais conteúdo relacionado

Destaque

Kwf 6 7 Newfeatures En
Kwf 6 7 Newfeatures EnKwf 6 7 Newfeatures En
Kwf 6 7 Newfeatures Ensrrm7
 
iPhone iOS6 Email Setup Guide
iPhone iOS6 Email Setup GuideiPhone iOS6 Email Setup Guide
iPhone iOS6 Email Setup GuideJeremy Dawes
 
Outlook express imap settings
Outlook express imap settingsOutlook express imap settings
Outlook express imap settingsJeremy Dawes
 
12 Linked In Starter Tips
12 Linked In Starter Tips12 Linked In Starter Tips
12 Linked In Starter Tipswendypgreene
 
Outlook 2010 imap settings
Outlook 2010 imap settingsOutlook 2010 imap settings
Outlook 2010 imap settingsJeremy Dawes
 
Dibujo tecnico i
Dibujo tecnico iDibujo tecnico i
Dibujo tecnico iDiego030809
 

Destaque (13)

Kwf 6 7 Newfeatures En
Kwf 6 7 Newfeatures EnKwf 6 7 Newfeatures En
Kwf 6 7 Newfeatures En
 
York residence learning plan apr192012
York residence learning plan apr192012York residence learning plan apr192012
York residence learning plan apr192012
 
iPhone iOS6 Email Setup Guide
iPhone iOS6 Email Setup GuideiPhone iOS6 Email Setup Guide
iPhone iOS6 Email Setup Guide
 
La Excepción
La ExcepciónLa Excepción
La Excepción
 
Outlook express imap settings
Outlook express imap settingsOutlook express imap settings
Outlook express imap settings
 
12 Linked In Starter Tips
12 Linked In Starter Tips12 Linked In Starter Tips
12 Linked In Starter Tips
 
Loan Wize - Who wants to be a Millionaire?
Loan Wize - Who wants to be a Millionaire?Loan Wize - Who wants to be a Millionaire?
Loan Wize - Who wants to be a Millionaire?
 
Outlook 2010 imap settings
Outlook 2010 imap settingsOutlook 2010 imap settings
Outlook 2010 imap settings
 
The CACUSS Identity Project
The CACUSS Identity ProjectThe CACUSS Identity Project
The CACUSS Identity Project
 
Photo mix aj
Photo mix ajPhoto mix aj
Photo mix aj
 
Alf Lizzio 5 senses theory presentation
Alf Lizzio 5 senses theory presentationAlf Lizzio 5 senses theory presentation
Alf Lizzio 5 senses theory presentation
 
Dr Natalia Haraszkiewicz-Birkemeier_KWF Kankerbestrijding
Dr Natalia Haraszkiewicz-Birkemeier_KWF KankerbestrijdingDr Natalia Haraszkiewicz-Birkemeier_KWF Kankerbestrijding
Dr Natalia Haraszkiewicz-Birkemeier_KWF Kankerbestrijding
 
Dibujo tecnico i
Dibujo tecnico iDibujo tecnico i
Dibujo tecnico i
 

Semelhante a Bigorna - a toolkit for orthography migration challenges

Top-down and bottom-up approaches to language planning.
Top-down and bottom-up approaches to language planning. Top-down and bottom-up approaches to language planning.
Top-down and bottom-up approaches to language planning. Saad Yaseen
 
A Design Proposal Of An Online Corpus-Driven Dictionary Of Portuguese For Uni...
A Design Proposal Of An Online Corpus-Driven Dictionary Of Portuguese For Uni...A Design Proposal Of An Online Corpus-Driven Dictionary Of Portuguese For Uni...
A Design Proposal Of An Online Corpus-Driven Dictionary Of Portuguese For Uni...Emily Smith
 
Language planning
Language planningLanguage planning
Language planningAyesha Mir
 
Virtual research project presentation 2013
Virtual research project presentation 2013Virtual research project presentation 2013
Virtual research project presentation 2013Estefii Cabrera Morales
 
From_Translation_to_Audiovisual_Translat.pdf
From_Translation_to_Audiovisual_Translat.pdfFrom_Translation_to_Audiovisual_Translat.pdf
From_Translation_to_Audiovisual_Translat.pdfCarles Uarg
 
Language Planning and Policy (LPP).pptx
Language Planning and Policy (LPP).pptxLanguage Planning and Policy (LPP).pptx
Language Planning and Policy (LPP).pptxsuhanazabri
 
358 644-1-pb
358 644-1-pb358 644-1-pb
358 644-1-pbdadacles
 
Language Planning and Policy
Language Planning and PolicyLanguage Planning and Policy
Language Planning and PolicyAssala Mihoubi
 
Langauge planning and policies
Langauge planning and policiesLangauge planning and policies
Langauge planning and policiesShehnaz Mehboob
 
Oral histories projects for l2 learners
Oral histories projects for l2 learnersOral histories projects for l2 learners
Oral histories projects for l2 learnersBrent Jones
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics kashmasardar
 
Language as a renewable resource: Import, dissipation, and absorption of inno...
Language as a renewable resource: Import, dissipation, and absorption of inno...Language as a renewable resource: Import, dissipation, and absorption of inno...
Language as a renewable resource: Import, dissipation, and absorption of inno...Waqas Tariq
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramAnnalina Caputo
 

Semelhante a Bigorna - a toolkit for orthography migration challenges (20)

Language planning
Language planning Language planning
Language planning
 
Final year project3
Final year project3Final year project3
Final year project3
 
Top-down and bottom-up approaches to language planning.
Top-down and bottom-up approaches to language planning. Top-down and bottom-up approaches to language planning.
Top-down and bottom-up approaches to language planning.
 
A Design Proposal Of An Online Corpus-Driven Dictionary Of Portuguese For Uni...
A Design Proposal Of An Online Corpus-Driven Dictionary Of Portuguese For Uni...A Design Proposal Of An Online Corpus-Driven Dictionary Of Portuguese For Uni...
A Design Proposal Of An Online Corpus-Driven Dictionary Of Portuguese For Uni...
 
Language planning
Language planningLanguage planning
Language planning
 
Final proyect estefania_cabrera
Final proyect estefania_cabreraFinal proyect estefania_cabrera
Final proyect estefania_cabrera
 
Virtual research project presentation 2013
Virtual research project presentation 2013Virtual research project presentation 2013
Virtual research project presentation 2013
 
From_Translation_to_Audiovisual_Translat.pdf
From_Translation_to_Audiovisual_Translat.pdfFrom_Translation_to_Audiovisual_Translat.pdf
From_Translation_to_Audiovisual_Translat.pdf
 
Language Planning and Policy (LPP).pptx
Language Planning and Policy (LPP).pptxLanguage Planning and Policy (LPP).pptx
Language Planning and Policy (LPP).pptx
 
358 644-1-pb
358 644-1-pb358 644-1-pb
358 644-1-pb
 
Language
LanguageLanguage
Language
 
Pedagogical uses of translation
Pedagogical uses of translationPedagogical uses of translation
Pedagogical uses of translation
 
Language Planning and Policy
Language Planning and PolicyLanguage Planning and Policy
Language Planning and Policy
 
First Stages and challenges of LibreOffice Translation in Hausa Language
First Stages and challenges  of LibreOffice Translation  in Hausa LanguageFirst Stages and challenges  of LibreOffice Translation  in Hausa Language
First Stages and challenges of LibreOffice Translation in Hausa Language
 
Langauge planning and policies
Langauge planning and policiesLangauge planning and policies
Langauge planning and policies
 
Oral histories projects for l2 learners
Oral histories projects for l2 learnersOral histories projects for l2 learners
Oral histories projects for l2 learners
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics
 
Language as a renewable resource: Import, dissipation, and absorption of inno...
Language as a renewable resource: Import, dissipation, and absorption of inno...Language as a renewable resource: Import, dissipation, and absorption of inno...
Language as a renewable resource: Import, dissipation, and absorption of inno...
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google Ngram
 

Mais de andrefsantos

Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pintoandrefsantos
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documentsandrefsantos
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerandrefsantos
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesandrefsantos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...andrefsantos
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatosandrefsantos
 

Mais de andrefsantos (11)

Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Building your own CPAN with Pinto
Building your own CPAN with PintoBuilding your own CPAN with Pinto
Building your own CPAN with Pinto
 
Slides
SlidesSlides
Slides
 
Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
 
Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleanerCleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books with Text::Perfide::BookCleaner
 
Poster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challengesPoster - Bigorna, a toolkit for orthography migration challenges
Poster - Bigorna, a toolkit for orthography migration challenges
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Bigorna
BigornaBigorna
Bigorna
 
Mojolicious lite
Mojolicious liteMojolicious lite
Mojolicious lite
 

Último

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 

Último (20)

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 

Bigorna - a toolkit for orthography migration challenges

  • 1. Bigorna – a toolkit for orthography migration challenges José João Almeida1 , André Santos1 , Alberto Simões2 1 Departamento de Informática, Universidade do Minho, Portugal 2 Escola Superior de Estudos Industriais e de Gestão, Instituto Politécnico do Porto, Portugal jj@di.uminho.pt, pg15973@alunos.uminho.pt, alberto.simoes@eu.ipp.pt Abstract Languages are born, evolve and, eventually, die. During this evolution their spelling rules (and sometimes the syntactic and semantic ones) change, putting old documents out of use. In Portugal, a pair of political agreements with Brazil forced relevant changes on the way the Portuguese language is written. In this article we will detail these two Orthographic Agreements (one in the thirties and the other more recently, in the nineties), and the challenges present on the automatic migration of old documents spelling to their actual one. We will reveal Bigorna, a toolkit for the classification of language variants, their comparison and the conversion of texts in different language versions. These tools will be explained together with examples of migration issues. As Birgorna relies on a set of conversion rules we will also discuss how to infer conversion rules from a set of documents (texts with different ages). The document concludes with a brief evaluation on the conversion and classification tool results and their relevance in the current Portuguese language scenario. 1. Introduction impact. In this article we are interested into the latest, that can be summarized as: Languages evolve. This evolution can be natural and grad- ual, when speakers change habits during decades, or can be • 1931: This was the first orthographic agreement be- forced and drastic, when some kind of regulatory institu- tween Portugal and Brazil trying to approximate both tion defines some rules (or heuristics) on how the language languages. This change abolished the silent s from orthography will change. words such as sciência, scena, scéptico. It also The case of the recent evolution on the Portuguese language changed the way compound verbs were written, from is the latest. This change was mainly a political issue to dir-se há and amar-te hei to dir-se-á and amar-te-ei. approximate the different countries that speak Portuguese. Details on the story of this evolution can be found on sec- • 1945: This new reform just hit the language in Por- tion 1.1.. tugal, cleaning a set of different accents, like remov- In any of the two presented evolution cases, documents lose ing completely the umlaut (trema, in Portuguese), and importance and relevance as soon as their orthography is removing the circumflex accent in homograph words outdated. such as, acêrto and acerto, cêrca and cerca, côr and With the latest changes on the Portuguese language, and cor, fôra and fora, dêsse and desse, and so on. the amount of information that is typical of this era, it be- came necessary to develop automatic methods for language • 1973: While not a political agreement, Portugal fol- modernization. lowed Brazil and abolished accent marks in secondary This document discusses Bigorna, a toolkit developed to stressed syllables (tètralogia and tetralogia). help on the update of documents orthography. It was devel- oped to be as language independent as possible, relying in • 1990/2009: On December the 16th of 1990 the Por- a set of rules learned by the comparison of old and actual tuguese Language Orthographic Agreement (PLOA) corpora orthographic differences. (da República, 1991) was approved in Lisbon by An- This section will present the main steps of Portuguese lan- gola, Cape Verde, Guinea-Bissau, Mozambique, Sao guage evolution, the motivation for our work, what was Tome and Principe, Brazil, Portugal and a delegation the initial state before the project beginnings, and Bigorna of galician observers. Its goal is to unify, as widely main goals. It will be followed by a section on language as possible, the general vocabulary of the Portuguese resources construction and tools development. Section 3. language. presents a brief evaluation of the developed tools. It is fol- The urge to bring closer all variants of Portuguese im- lowed by some conclusions and future work issues. plies some changes on the structure and content of the language itself. On those content changes, the pho- 1.1. Portuguese Language Evolution netic criteria (or pronunciation one) was chosen in- The Portuguese language evolved during the last century stead of the etymological one. This led to some cases through successive orthographic agreements, especially be- of multiple spelling (whenever a word is pronounced tween Portugal and Brazil. A full time-line of changes can differently according to the variant in use) like facto be consulted in (Wikipedia:Reforms, 2010). Some of the and fato, or suntuoso and sumptuoso. The lack of con- described changes are just political, some other had real sensus about whether some words are to be updated or
  • 2. not, and the absence of a common orthographic vocab- • therefore, it is important to create and maintain an Or- ulary (as foreseen in article 2 of the PLOA) gets this thographic Agreement’s Knowledge Base — OAKB, a process even more complex. table containing lemmas, changes and rules based on the currently existing lists, new examples found and While this latest political change was accepted and should the results of processing texts. This table will be up- be put in practice soon, real users of the language will con- dated and evolve (one might never know when it is tinue to write as they learned for some time, leading to a complete); double orthographic form. This means that the two forms (before-1990 and after-1990) are relevant at the moment. • the main problem is on how to determine which words are candidates to integrate the OAKB (what are the 1.2. Motivation changing words). The motivation to this work is two fold: to update old It was clear the need of a project dedicated to the migration documents in the public domain (with more than 70 years process. It was Bigorna project birth. after author dead) to current used orthographic form, and the need to update current documents to the future1 ortho- 1.4. Bigorna’s design goals graphic form, as defined by PLOA: In order to help in the migration process, it is necessary to have some resources and tools. The first set of requirements • With the actual tendency to public accessible repos- included: itories or libraries in the Internet, and as a special • collect all the resources related to PLOA that are al- case Project Gutenberg2 , a lot of books that are cur- ready available in the Internet; rently under public domain are being scanned and transcribed. While it is crucial to keep the original • create a spell-checker dictionary for the new Por- orthographic form in the archive for linguistic stud- tuguese language; ies (for example), their modernization is also relevant, • build a tool to convert text from the pre-agreement namely for reading purposes. Portuguese to the updated Portuguese language; This leads to the necessity to develop a tool to per- form the systematic migration of language from before • build a classification tool, to identify the variant of 1930 to the currently used (before 1990/2009 reform). Portuguese used in a given text; • make the previous tools work with as few dependen- • We are in the information era. A lot of different doc- cies as possible. uments are currently available and there will be the need to update them to the new orthographic form. We soon understood that the set of changing words was be- ing updated frequently as PLOA is not defined by exhaus- The PLOA implementation demands a revision and tion (it does not list all words, just defines a few heuristics). an update of the existing linguistic tools and the cre- This lead to the following decisions: ation of new ones, and the complexity of these tasks increase due to the multiple spelling cases (see sec- • center all the information about the word changes in tion 1.3.). an unique table (OAKB); • build a system, that given that table, would generate: A great help to both processes would be the creation of a set of tools capable of inferring migration rules through the – the new spell-checker dictionary; comparison of texts. Tools like those would be useful to de- – the language migration tool; velop applications that people could use to face the issues – the Portuguese language variant classifier of PLOA. While this document will mainly focus PLOA implementation, the development tools also have a role in • develop tools to help in the construction and manipu- updating old documents spelling, or in future language up- lation of OAKB: dates. – tools to find lemma changing candidates from 1.3. Summarizing texts; The current scenario in the Portuguese language, as created – a tool to extract lexical differences from pairs of by the PLOA: texts; • make the previous tools as generic and reusable as pos- • the spelling changes dictated by the agreement cannot sible. be determined automatically, as they rely on phonetic criteria and may sometimes be ambiguous (different 2. Bigorna Development accents, for instance); This section describes the Bigorna approach and its devel- opment details. We will first focus how resources were 1 We will use “current orthographic form” to refer to the ortho- compiled into OAKB and then how the spell checker dic- graphic form before 1990 agreement, and the “future form” for tionary was updated. Follows a description about the cre- the orthographic form after 1990 agreement. ated tools: conversion tools, the language classifier and the 2 http://www.gutenberg.org/ lexical comparison tools.
  • 3. 2.1. Compiling OAKB Function newdic(oakdb,dicjs) for ( x ∈ dom(oakb) To start the compilation of resources to integrate the knowl- ∧ oakb[x].type = normal edge base we started by searching for free resources related ∧ x = oakb[x].preferpt to the PLOA. We found several online dictionaries, spell- ∧ x ∈ dom(dicjs)) checkers, vocabularies and a growing number of publica- { tions about the subject. neww ← oakb[x].preferpt From the vocabularies found, one stands out, a list of dicjs[neww] ← dicjs[x] changed words, containing over 9 000 entries, published delete dicjs[x] by Instituto de Linguística Teórica e Computacional (IL- } TEC, 2008). Another list, not as complete, was compiled From the 2 600 words in OAKB, just 960 were related di- by Projecto Quintus (Quintus, 2008). rectly with a lemma in jSpell’s dictionary. Note that from Each entry on this list contained both the European and these 960 lemmas jSpell generates a total of 11 500 words. Brazilian Portuguese versions of a given word together with that word’s updated form and a brief comment indicating 2.3. Language Conversion Tools the preferential new form for each Portuguese variant. Our second task was the development of a conversion tool, All lines where there was no change accordingly to PLOA to migrate texts from the pre-agreement spelling to the new were deleted and the remaining ones were used to com- one. pile our first OAKB version. This resulted in approximately Due to the multiple spelling cases which are, moreover, de- 2 600 words. pendent on the variant of Portuguese being used, it was not After the reorganization process, the structure of OAKB is: possible to create a single conversion tool. Therefore, we decided to create two different tools: one capable of up- oakb = entry* dating European Portuguese (pt2ptao) and one other ca- pable of updating Brazilian Portuguese (br2brao). Note entry = pt_pt : word that these tools convert the original variant to the after- pt_br : word agreement variant. pt_oa : word * Once again, we started with OAKB and jSpell as our base preferencial_pt : word tools. Typical entries in the jSpell’s dictionary include the preferencial_br : word lemma, its morphologic properties, and the derivation rules type : Capit | Hyphen | Accent | Normal valid for that lemma: acalentar/#vt/XYPLD/ The field type (of change) was used in order to make pos- coiote/#nm/p/ sible to define different ways of dealing with different sit- laico/#a/fidp/ uations (remotion of capitalization depends on context, re- zinco/#nm// motion of accent marks depends not). Presently we only differentiate the hyphenation changes, but in the future the OAKB was used to extract a list of the changing lem- other types will be discriminated as well. mas. This list includes the European Portuguese lem- mas and their updated form (lemma_pt/lemma_ptao), 2.2. Updating the dictionary vocabulary and from jSpell dictionary were extracted the deriva- To prepare a spellchecker dictionary we decided to use tion rules valid for that same Portuguese lemma jSpell, an open-source morphological analyzer (Almeida (lemma_pt/rules). and Pinto, 1995; Simões and Almeida, 2002). jSpell has By expanding each lemma_pt and each lemma_ptao in the advantage that its updates can be easily propagated (us- parallel (using the corresponding jSpell rules) we obtained ing Chuveiro de Dicionários (dos Santos Vilela, 2009)) a new table where each entry is structured as follows: to dictionaries for other spell-checkers engines (aspell, word_pt/word_ptao ispell, hunspell and myspell), some of which are used by well known open source applications like Firefox, As we did not have the derivation rules for Brazilian Por- Thunderbird or OpenOffice. tuguese words (because jSpell’s Portuguese dictionary is JSpell also has the ability to, when provided with a list of European Portuguese specific), we decided to use the rules lemmas and their derivation rules, generate an exhaustive present on OAKB to derivate the corresponding Brazilian list of all derived words. The relevance of this feature is that Portuguese words, and their updated form. Then, by a pro- the language updating can affect just the lemma and/or the cess similar to the above, we obtained a table with the fol- generation rules, and a set of words will be automatically lowing structure: fixed. Also, the generated lists can be compared with the word_br/word_brao entries into OAKB easily. The update of the dictionary was performed by searching The conversion tool was developed in Perl and we tried the words from OAKB in jSpell’s European Portuguese dic- to minimize its dependencies to make it easier to install tionary and replacing them by their updated version. When and use. For the very same reason, we decided to include there was more than one possible form, we chose the ver- the word table itself in the script file, resulting in a self- sion recommended for European Portuguese. contained single file.
  • 4. The tool performs the conversion by sweeping a given text This tool was tested with two variants from the book Amor looking for word_pt words and replacing them by the de Perdição, from the Portuguese author Camilo Castelo- matching word_ptao (word_br and word_brao in Branco, one written in European Portuguese and the other the Brazilian version). We included some additional op- in Brazilian Portuguese. The results were the expected. tions which protect XML tags and LTEX commands within A the text. $ whichPT AmorPerd.ptPT AmorPerd.ptBR Follows some examples of the tools interaction. AmorPerd.ptPT pt AmorPerd.ptBR br $ pt2ptao A adopção do acordo implica a actualização 2.5. Lexical Comparison Tools de ferramentas. The lists of words compiled above were based on previous ↓ existing lists. But there are other situations where there are A adoção do acordo implica a atualização no available lists of words. Although there are no lists, there de ferramentas. are documents with different orthographic versions. $ br2brao With that in mind, we built a tool to help on comparing two Ele fez um vôo rasante sobre a aréia. versions of a text and detect (linguistic) differences between ↓ two versions of a text with different spelling. Ele fez um voo rasante sobre a areia. This script, named lexdiff, receives two files and com- pares them with the help of the Unix commands egrep The different character encoding systems were the source and diff. of several problems and the target of a special care during The result of lexdiff applied to previously mentioned this project. Both versions of the conversion tool, as well versions of Amor de Perdição is: as the other programs developed by Bigorna, accept both ISO-8859-1 and UTF-8 encoded text. $ lexdiff -ac AmPerd.ptBR AmPerd.ptPT 32 acadêmico => académico 2.4. Variant Classifier 14 idéia => ideia A relevant goal on Bigorna is the identification of the Por- 12 redargüiu => redarguiu tuguese variant present on some text. This feature is crucial 7 gênio => génio 4 refletiu => reflectiu so we can use one of the tools presented above automati- ... cally without human intervention. Another tool, named whichPT, was developed to identify For each changed word lexdiff outputs the number of whether the European or the Brazilian Portuguese variant is times that change ocurred on the given texts. present on a document. The script is also able to calculate narrower differences. The tool incorporates a list of (just) European Portuguese This option allows us to detect changing patterns (se- words and another of (just) Brazilian Portuguese ones, quences of characters that get usually updated), indepen- compiled with a subset of the conversion tools lists. dently of the word. Below is the result of this narrower approach to the same pair of texts: Function calc_whichpt_lsts(dicpt,dicbr,oakb) for ( x ∈ dom(oakb) $ lexdiff -m -ac AmPerd.ptBR AmPerd.ptPT ∧ oakb[x].type = (normal or accent) 36 et => ect ∧ oakb[x].pt_pt = oakb[x].pt_br 18 déi => dei ∧ oakb[x].pt_pt ∈ dom(dicpt) 17 güi => gui ∧ oakb[x].pt_br ∈ dom(dicbr)) 8 at => act { 7 eç => ecç wpt ← oakb[x].pt_pt ... wbr ← oakb[x].pt_br justpt ← justpt ∪ {x ∈ deriv(wpt,dicpt)| x ∈ dicbr} / The option shown above creates a confusion matrix (CM) justbr ← justbr ∪ which relates each word with all the possible words it can {x ∈ deriv(wbr,dicbr)| x ∈ dicpt} / be. Follows an extract of the CM created in the previous } example: The classification is performed by counting the number of et => { ect => 36 }, words from the text that have a match in each of the lists; at déi => { dei => 18 }, güi => { gui => 17 }, the end, the result is the variant with the highest number of at => { aat => 1, matches. apt => 1, This implementation allows to easily expand the script to act => 8 }, other variants (for example, ancient Portuguese). It is also eç => { eaç => 2, possible to perform the classification ignoring the words in- ecç => 7, side XML tags or LTEX commands, to print at the end the A epç => 2 }, exact number of matches in each variant, or to output all the words matching each variant. There are many uses for lexdiff:
  • 5. • it can be used to search for words that could be in- NrWords pt2ptao Priberam PE PdLP serted into OAKB and, consequently, in the updated Total 69 80 78 48 dictionary, the converters and the classifier, or to see Different 49 60 58 33 how similar two given texts are; Precision 1.0 1.0 1.0 1.0 Recall 0.80 0.98 0.95 0.54 • it can even be used as a spell-checker, although its main application will be the generation of tools like Table 2: Comparison of the number of changed words using the ones previously described in this article. the four different conversion systems. At the moment we are developing a lexpatch tool that, given a changes file produced by lexdiff can update After some analysis of the words not updated by pt2ptao texts according to those changes. we noted that it is not yet dealing with the task of lowercas- The combined action of both of these tools will allow the ing the months and days of the week. automatic generation of converters after a learning process We tried a similar approach to evaluate the Brazilian Por- made from manual adaptations of texts. tuguese version of this tool (br2brao). The systems chosen for the comparison were Priberam (they also have 3. Evaluation a conversion tool for Brazilian Portuguese) and Interney7 In order to evaluate the tools previously described, we used conversion tools. different strategies and resources. We tested these systems with a Brazilian Portuguese ver- In this section we will discuss the process used and the re- sion of “Amor de Perdição”, but we could not interpret the sults obtained in the evaluation of pt2ptao, br2brao results; this task must be performed by someone more ex- and whichPT. perienced with Brazilian Portuguese, as it requires a native- Given that lexdiff is not dependent of specific language level knowledge of the language which none of the authors resources, its evaluation deals with a different problem: if possesses. the algorithm is correct, if its results are useful, and if it During this process of evaluation we found some complex can be used for other relevant tasks. While lexdiff was aspects that need a careful approach: used in the tests descried here, we will not deal with its evaluation in this document. • One may not want to convert words which are part of composite proper names (names of streets, family 3.1. Evaluating the Conversion Tools names, etc); In order to evaluate both versions of the conversion tool we decided to test other similar tools in order to make a • The remotion of capitalization should not be applied common evaluation and compare the results obtained using to words in the beginning of sentences; the different systems. The systems compared to evaluate the European Portuguese • The PLOA document is not clear on how/when to ap- version (pt2ptao) were Priberam3 , Porto Editora4 (PE) ply the hyphenation rules. and Portal da Língua Portuguesa5 (PdLP) conversion tools. Results of the ortographic updating of “Amor de Perdição” Therefore, a conversion tool needs to be able to identify are summarized in Table 1 and Table 2. phrases and entities in the text to perform accurate conver- Given the lack of work power to select manually the chang- sions. ing words from the list of words present on the book, we This process of comparative evaluation, which was made compiled the set of words correctly changed in the four sys- possible with the cooperation of the teams responsible by tems and used this set to compute the recall measure6 . the other conversion tools, has already lead to some im- provements in some of these tools. NrWords Amor de Perdição Changed Total 48 011 81 3.2. Evaluating the Language Classifier Different 8 717 61 In order to evaluate the classifier (whichPT) we used a set Table 1: Comparison of the number of words in Amor de of 100 books in different Portuguese dialects (pt-pt, pt-br). Perdição and the number of changed words. The classifier gave the correct answer in all the cases. One of the books was difficult to evaluate as it had a mixure of both dialects. 3 http://www.priberam.pt/ We then decided to stop the classification after obtaining a 4 http://portoeditora.pt/ certain number of language clues (20). With this limitation 5 we obtained 2 ambigous books. http://www.portaldalinguaportuguesa.org/ Due to timing reasons, Portal da Lingua Portuguesa results were This evaluation was also relevant as, after analyzing the de- gathered using the online version of their conversion tool, which tected clues, a type was found on our knowledge base. Cor- may have affected their results. recting it the classified answered correctly for all books. 6 The real number of words to be changed should be higher (probably not all the words were detected by the four systems), 7 what would make the recall value lower. http://www.interney.net/
  • 6. 4. Conclusions Projecto Quintus. 2008. Lista total (?) de palavras We are now replicating the updates on jSpell’s dictionary to da norma culta lusoafricana que são modificadas pelo other spell-checkers, using Chuveiro de Dicionários. The acordo ortográfico de 1990. Inclui lista de palavras que resulting dictionaries will be included as the official dictio- mudam, Projecto Quintus. nary for Firefox, Thunderbird and OpenOffice in the near Alberto M. Simões and J.J. Almeida. 2002. Jspell.pm – future. um módulo de análise morfológica para uso em proces- Tools construction was only possible after the construction samento de linguagem natural. In Actas do XVII Encon- of OAKB. For this purpose the resources already available tro da Associação Portuguesa de Linguística, pages 485– on the Internet were crucial. We would not have resources 495, Lisboa 2001. for the hand construction of this list. Alberto Simões and Rita Farinha. 2010. Dicionário With a proper knowledge base the identification of variant Aberto: Um novo recurso para PLN. Vice-Versa, (16), is simple and comparable to common language identifica- April. forthcomming. tion tasks. Wikipedia:Reforms. 2010. Reforms of por- The conversion tool requires, as already explained, some tuguese orthography. Wikipedia article, http: more linguistic knowledge that is not yet present on our //en.wikipedia.org/wiki/Reforms_of_ tools, like the mentioned entity recognition. Nevertheless, Portuguese_orthography. [Online; accessed current approach is already performing comparatively well. 10-March-2010]. The migration tools are being tested and should be imple- mented as web-services and released as stand-alone tools. The lexical difference tool is already being used for other projects like Dicionário-Aberto (Simões and Farinha, 2010), where a definitions dictionary from 1913 was tran- scribed and is being migrated to the actual orthography. While the full orthographic actualization is almost impossi- ble to perform in an automatic way (mostly given the abun- dance of different words present on a dictionary), current experiments show that it is possible to update 89.39% of the words automatically. 5. Future work The converters will be tested and improved with the help of lexdiff and lexpatch. The lexdiff script and subsequent work could be used to improve the tools built in the previous stages of this project (for example, new converters can be developed merely by the analysis of handmade transcripts). 6. Acknowledgments We would like to thank both the Priberam and the Porto Editora teams, who applied their commercial tools on our test cases for evaluation purposes. 7. References J.J. Almeida and Ulisses Pinto. 1995. Jspell – um módulo para análise léxica genérica de linguagem natural. In Ac- tas do X Encontro da Associação Portuguesa de Linguís- tica, pages 1–15, Évora 1994. Diário da República. 1991. Acordo ortográfico da língua portuguesa, 1990. Technical Re- port 193, série I-A, 23 de Agosto. http: //www.portaldalinguaportuguesa.org/ index.php?action=acordo&version=1990. Rui Miguel Rodrigues dos Santos Vilela. 2009. Geração de dicionários para correcção ortográfica do português. Master’s thesis, Escola de Engenharia, Universidade do Minho, Braga, Outubro. ILTEC. 2008. Vocabulário de mudança. Inclui listas de palavras que mudam, Portal da Língua Porguesa. http://www.portaldalinguaportuguesa. org/index.php?action=novoacordo.