Apertium: a unique free/open-source MT system for related languages [but not only]

Presentation for the talk given at LocWorld34 Barcelona by Mikel L. Forcada and Gema Ramírez-Sánchez

  1. 1. #LocWorld34 Apertium: a Unique Free/Open-Source MT System for Related Languages [but not only] Gema Ramírez Sánchez1 Mikel L. Forcada1,2 1 Prompsit Language Engineering, Elx, Spain 1,2 Universitat d’Alacant, Alacant, Spain
  2. 2. #LocWorld34 Outline ● Apertium components ● Ready-to-use Apertium products ● Machine translation — but not only! ● Licensing — free/open-source ● The Apertium community ● Research and business with Apertium ● Languages and language pairs ● Success cases ● Funding
  3. 3. #LocWorld34 Apertium components Since 2005, Apertium provides the three key components of machine translation: ● An engine ● Data ● Tools
  4. 4. #LocWorld34 Apertium components: the engine /1 ● A fast, free/open-source, modular, shallow-transfer, language-independent machine translation engine with: ○ text format management, ○ translation memory querying, ○ finite-state lexical processing, ○ statistical and constraint-based lexical disambiguation, and ○ shallow structural transfer based on finite-state pattern matching
  5. 5. #LocWorld34 Apertium components: the engine /2 ● Most of the engine was developed inside the Apertium project but some external technologies are used: ○ Helsinki Finite-State toolkit (for some morphologically-rich languages), ○ VISL CG-3 (constraint grammars for rule-based lexical disambiguation).
  6. 6. #LocWorld34 Apertium components: the data ● Free/open-source language data in well-specified XML formats for a variety of languages and language pairs.
  7. 7. #LocWorld34 Apertium components: the data. A typical language pair Language pair organization 2 monolingual packages (A, B) ▪ 1 monolingual dictionary (monodix) ▪ 1 tagset + probabilities ▪ 1 plain/tagged corpus ▪ 1 postgeneration “dictionary” 1 bilingual package (A–B) ▪ 1 bilingual dictionary (bidix) ▪ 2 sets of structural transfer (grammar) rules (levels 1–3) Format: typically XML-based (sometimes text-based) files Sizes: Monodixes: 10k–90k lemmata; 100k–23M surf. forms, 85–97% cover. Bidixes: 8k--–90k bilingual lema correspondences Rules: 100 (one level) – 300 (3 level) per translation direction
  8. 8. #LocWorld34 Apertium components: the tools ● Free/open-source tools: ○ compilers to turn linguistic data into a fast and compact form used by the engine and ○ software to learn disambiguation or translation rules from corpora.
  9. 9. #LocWorld34 Ready-to-use Apertium products ● A stand-alone Java application for the desktop: apertium-caffeine. ● An Android version for handhelds. ● A stand-alone version (Apertium Simpleton) for Windows and MacOS. ● Plug-ins and support for CAT platforms: OmegaT, MateCat, MemoQ, Trados Studio. ● Available as a PPA repository for GNU/Linux users.
  10. 10. #LocWorld34 Apertium extras: mobile app Full offline mode!Over 60 translation directions! On Android!
  11. 11. #LocWorld34 No need to install: web access www.apertium.org
  12. 12. #LocWorld34 No need to install: web access www.apertium.org ● Text box: short plain texts ● Document translation: ○ plain text ○ HTML, XML (.xliff) ○ OpenDocument (.odt, .odp, .ods) ○ Office “-x” formats: .docx, .xlsx, .pptx ○ LaTeX ● A nice feature: with/without marks for unknown words
  13. 13. #LocWorld34 No need to install: web/API access ● Other portals with all Apertium languages: ○ Prompsit’s portal: + TMX + navigate&translate ○ iTranslate4.eu portal: multiengine ● Other portals with some Apertium languages: ○ UOC, UPV, UA (+ TMX + terminology support + more formats) ○ GiellaTekno portal ○ etc. ● Also API access and connectors to translation tools are marketed
  14. 14. #LocWorld34 Machine translation — but not only! /1
  15. 15. #LocWorld34 Machine translation — but not only! /2 Monodix Tagset+prob Rules Monodix Bidix t o o l s t o o l s Post-dix Morphological analyser PoS tagger Lexical transfer Full MT Morphological generator Structural transfer Post-generator
  16. 16. #LocWorld34 Machine translation — but not only! /3 ● Apertium is a rule-based machine translation system but the pipeline contains many monolingual modules that can be used for other human-language technology tasks (such as anonymization or factored output) ● Most modules are based on finite-state technology; HMMs are used for part-of-speech tagging and an interpreted language is used to write structural transfer rules.
  17. 17. #LocWorld34 Licensing: free/open-source /1 Apertium language data and code are both licensed under the GNU General Public License: ● a free/open-source license allowing free distribution of unmodified and modified versions ● a copylefted license: it avoids private appropriation and encourages giving improvements back to the project (it creates a software commons).
  18. 18. #LocWorld34 Licensing: free/open-source /2 ● The free/open-source model creates a community which effectively connects researchers, developers, vendors, and users in a continuum.
  19. 19. #LocWorld34 The Apertium community ● Very active group of hundreds of developers ● Contributions to Apertium at Sourceforge ● Wiki documentation (wiki.apertium.org) ● Easy entry: Apertium linguistic modelling is simple, no need to program. ● IRC channel #apertium in freenode.net ● Mailing lists: apertium-stuff@lists.sf.net and other lists
  20. 20. #LocWorld34 The Apertium community [A search for Apertium faces in Google Images]
  21. 21. #LocWorld34This is Francis Tyers (spectie)! The Apertium community [A search for Apertium faces in Google Images]
  22. 22. #LocWorld34 The Apertium community Community in Sourceforge (May 2017) Contributors 7 admins, 428 developers Contributions +10k from May ‘16 to May ‘17 +78k commits altogether
  23. 23. #LocWorld34 The Apertium community: activities ● President and project management committee election according to bylaws ● Support: mail, chat, online meetings ● Maintenance: pairs, web, mobile app ● Manuals & documentation: wiki, manuals, how-to’s, training materials ● Organization of Google Summer of Code and Google Code-In activity ● Outreach activities: conferences, workshops ● Language-related groups
  24. 24. #LocWorld34 Research and business with Apertium Apertium is already an active research and business platform: ● Research: 40+ publications, 2 PhD thesis, 4 master's theses. ● Business: companies (Prompsit, Eleka, Imaxin Software, etc.) offering services to customers such as Autodesk, Adobe, the Government of Catalonia, 2 daily newspapers in Spain, freelancers and LSPs
  25. 25. #LocWorld34 Languages and language pairs /1 ● Language data is encoded mostly in XML, but some language pairs contain data encoded in other text-based formats. ● Stable language pairs (bilingual data) are currently more than 40.
  26. 26. #LocWorld34 Languages and language pairs /2
  27. 27. #LocWorld34 Languages and language pairs /3
  28. 28. #LocWorld34 Languages and language pairs /4 Year Milestone Language pairs 2004 The Spanish Ministry of Industry funds a consortium to build FOSS MT for the languages of Spain ---------------------------- 2005 Apertium RBMT plaftorm is launched providing engine, tools and data under free licenses 3 pairs: es–ca, es–gl and es–pt 2005-2009 Language pair-driven innovation, still very European-focused language pairs +19: fr, en, eo, ro, eu, oc, cy, nn, nb, sv, da, is, mk, bg, ast, br 2010 Five years on! 22 pairs!!! 2011-2015 Consolidated community, support for non-European languages, new tools and reorganisation of data +19: af, nl, hr, sr, mt, sl, arg, sme, urd, hin, kaz, tat, id, ms, ar 2017 Twelve years on! 43 pairs!!!
  29. 29. #LocWorld34 Apertium loves small languages ● Breton→French ● Aragonese↔Spanish/Catalan ● Occitan↔Catalan/Spanish ● Italian→Sardinian ● North Sámi↔Norwegian ● Icelandic↔Swedish ● Spanish→Spanish Sign Language
  30. 30. #LocWorld34 Language pairs with approx. 95% text coverage Language Lemmata Inflection models Surface forms HBS 97,445 1,429 23,348,650 English 60,543 312 108,119 Spanish 46,003 442 4,737,777 Catalan 41,116 559 7,088,585 Galician 29,818 333 14,247,591 Asturian 46,550 443 18,541,752 Occitan 21,602 527 6,084,575 Aragonese 26,068 544 12,870,976 Portuguese 14,436 316 10,514,672
  31. 31. #LocWorld34 Apertium language-pair life cycles ● For new pairs: ○ resource compilation ○ basic system creation (85% coverage, most frequent structural phenomena) ○ evaluation ○ typically takes 3–6 months ● For existing pairs: ○ testing, enhancement, evaluation ○ typically takes 1–3 months
  32. 32. #LocWorld34 A related-languages pair performance: apertium-es-pt From Masselot et al., 2010 (Using the Apertium Spanish–Brazilian Portuguese MT system for localization): ● Post-editing effort (word error rate): 20% ● Post-editing speed: average 4,500 words/day Updated 2017 (also for software localisation): ● Post-editing effort (word error rate): 14% ● Post-editing speed: average 6,500 words/day
  33. 33. #LocWorld34 Related language-pair post-editing experience /1 Original Spanish MT output Portuguese final Completa documentación 2D. Completa documentação 2D. Documentação 2D abrangente.
  34. 34. #LocWorld34 Related language-pair post-editing experience /2 Original Spanish Apertium output Portuguese final Cree documentación y dibujos 2D con un completo conjunto de herramientas de dibujo, edición y anotación. Crê documentação e desenhos 2D com um completo conjunto de ferramentas de desenho, edição e anotação. Produza desenhos e documentação 2D com um conjunto abrangente de ferramentas de desenho, edição e anotação. Apertium output for closely-related languages is: ● Easy and fast to post-edit ● Rather mechanical, but reliable ● Predictable
  35. 35. #LocWorld34 Nearby LocWorld Barcelona... ● Apertium makes two daily newspaper bilingual: Levante (Catalan) and La Voz de Galicia (Galician). ● Universities in the Catalan speaking area use Apertium to help in the generation of courseware and academic information; ● Apertium is used in PLATA, the Spanish government platform for webpage translation. Some success cases /1
  36. 36. #LocWorld34 Also by-products: ● Same-language machine translation for local flavours/flavors: AltLang.net ○ available for English, Spanish, French and Portuguese varieties. ○ performs spelling, lexical, grammar and style changes. Some success cases /2 Based on Apertium
  37. 37. #LocWorld34 Some other success cases/3 In Wikimedia Content Translation, Apertium translates Wikipedia content
  38. 38. #LocWorld34 Wikimedia Content Translation into Norwegian Nynorsk Co-funded project on MT for Scandinavian languages including community outreach starts Most of the translations are from Norwegian Bokmål. 85% are done using Apertium.
  39. 39. #LocWorld34 Before Content Translation: main use for Bokmål–Nynorsk was “homework”
  40. 40. #LocWorld34 ● Translators Without Borders develop crisis-specific, portable machine translation from English to Kurdish languages (Kurmanji, Sorani) on Apertium. ● Apertium and language experts help promote a unified standard for Occitan by defining and selecting it for Spanish→Occitan and Catalan→Occitan MT Other success cases involving interaction with other communities
  41. 41. #LocWorld34 Funding /1 ● The Ministry of Industry, Tourism and Commerce of Spain (also, the Ministries of Education and Science and of Science and Technology of Spain) ● The Secretariat for Technology and the Information Society of the Government of Catalonia ● The European Commission (DGT training and Abu-Matran project) ● The Ministry of Foreign Affairs of Romania
  42. 42. #LocWorld34 Funding /2 ● Universitat d'Alacant and Universitat Oberta de Catalunya ● Ofis Publik ar Brezhoneg (Breton Language Board) ● Ministry of Education and Science of the Republic of Kazakhstan ● Google Summer of Code scholarships (2009–2014, 2016, 2017) and Google Code-In donations (2010–2016). ● And many other private companies
  43. 43. #LocWorld34 ● If you want to build, integrate, or customize fast, reliable, predictable machine translation for your application. ● If you’d rather understand application-oriented dictionaries and rules rather than deal with the “magic” of embeddings, decoders, phrase tables, convolutions, or probabilities. ● If there’s no way you can amass and curate millions of translated words to train a system for your language or application. Then come and talk to us (we are at booth 121). You can be part of it!
  44. 44. #LocWorld34 © 2017 Mikel L. Forcada i Gema Ramírez-Sánchez This work may be distributed under the terms of any of these two licenses: ● Creative Commons Attribution–Share Alike: http://creativecommons.org/licenses/by-sa/3.0/deed.e n ● GNU GPL v. 3.0: http://www.gnu.org/licenses/gpl.html Sharing