SlideShare uma empresa Scribd logo
1 de 14
Baixar para ler offline
Accentuate Us!
Kevin Scannell and Michael Schade
       November 10, 2010
Language Death
About 7000 languages spoken in the world
More than 90% are expected to disappear before 2100
Every language is a repository of the culture, traditions, and
world view of its community
The death of a language is an irrevocable loss, comparable
to the extinction of a plant or animal species
Many endangered language communities are looking to the
internet and technology to help revitalize their language
Endangered Language Technology
Only about 50 languages (0.71%) have fully-localized
desktop computer systems
Firefox 3.5 available in 68 languages (0.97%)
Spellcheckers for 117 languages (1.67%)
For this talk, we're looking at something even more basic:
keyboard input
Keyboard Input
The majority of languages are oral - no written tradition
Good news: among those that have writing systems, almost
all scripts are available in Unicode (but not Maldivian,
Khamti, ...)
Yet even Unicode-encoded languages often lack
appropriate input methods, or free fonts
When electronic texts do exist, they are often entered as
plain ASCII, either by transliteration (Cherokee,           →
galvquodiyu), omitting diacritics (Lingala, likɔngá → likonga),
or ad hoc approaches (Irish, béal → be/al)
Omitted diacritics matter! Leads to ambiguities,
misunderstandings (leite vs. léite).
Diacritic Restoration
Software that takes plain ASCII text in some language as
input, and outputs the text with all diacritics or extended
characters in place
Examples
Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan →
Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan
Uwe setin suen gha gu emwa ni sike rue ghae emwi esi ne uwe rue →
Uwẹ sẹtin suẹn gha gu emwa ni sikẹ ruẹ ghae emwi esi ne uwẹ ruẹ
Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao →
Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo  a me ka hōʻike ʻana i ka manaʻo
Tout moun gen dwa a libete lide yo ak lapawol yo →
Tout moun gen dwa a libète lide yo ak lapawòl yo
Moi nguoi deu co quyen tu do ngon luan va bay to quan diem →
Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm
Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→
Ẹnì kọ̀ọ̀kan ló ní ẹ̀tọ́ sí òmì nira láti ní ìmọ̀ràn tí ó wù ú, kí ó sì sọ irú ìmọ̀ràn bẹ́ẹ̀ jáde
Statistical Machine Learning
Given an ASCII input, every character that allows a diacritic
or an extended form represents a "classification problem"
We use a machine learning approach; the program learns
where the diacritics belong by gathering statistics from a
"training corpus" of texts with the diacritics in place
Remembers words seen in training data; statistics on co-
occurring words to deal with ambiguous cases (Irish "leite"
vs. "léite" or even English "resume" vs. "résumé")
For never-before seen words, uses statistics of 3-character
sequences in a neighborhood of the character in question
(French initial "cera" vs. "cerc", "cabl" vs. "cabo"). This is
the generic case for under-resourced languages
Training texts crawled from the web; 114 in all!
API
Protocol: JSON
Calls
   langs
   lift
   feedback
Sample Call
   { "call": "charlifter.lift"
     , "lang": "ht"
     , "text": "Bon, la fe sa apre demen pito, le la we mwen
   andey."
     , "locale": "ht"
   }
Full documentation at http://accentuate.us/api
Service Architecture
Geographically Distributed
API Servers




Load-Balancing Proxy




Clients
HTTP Communication (Proxy)
Cache-Control: no-cache
Connection: keep-alive
Pragma: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding: gzip,deflate
Accept-Language: en-us,en;q=0.5
Host: ht.api.accentuate.us:8080
User-Agent: Accentuate.us/0.9b3 Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.1
Content-Length: 113
Content-Type: application/json; charset=utf-8
Keep-Alive: 115

{"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la
we mwen andey.","locale":"ht"}
HTTP Communication (API)
Cache-Control: no-cache
Connection: close
Pragma: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding: gzip,deflate
Accept-Language: en-us,en;q=0.5
Host: ht
User-Agent: Accentuate.us/distribution
Content-Length: 113
Content-Type: application/json; charset=utf-8

{"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la
we mwen andey.","locale":"ht"}
Clients
Perl
$ ./sf-client.pl -r -l ga -i "Is i an Ghaeilge an chead teanga oifigiuil."
Is í an Ghaeilge an chéad teanga oifigiúil.
Python
Vim
OS X Service
OpenOffice.org
Mozilla Firefox
UI and Localization Decisions
Implement API calls
   Langs
        Silent
   Feedback
        Opt-in
        Improve language models
   Lift
         Complicated!
Looking ahead
Demos
Thank You!

Mais conteúdo relacionado

Destaque

Музей интересных коллекций
Музей интересных коллекцийМузей интересных коллекций
Музей интересных коллекцийmguseva1
 
творческий отчёт 2008
творческий отчёт 2008творческий отчёт 2008
творческий отчёт 2008mguseva1
 
FRT Report 2016 Published-The Pulse of Technology
FRT Report 2016 Published-The Pulse of TechnologyFRT Report 2016 Published-The Pulse of Technology
FRT Report 2016 Published-The Pulse of TechnologyPeter Zehren, XMPA (LION)
 

Destaque (7)

Presentation1
Presentation1Presentation1
Presentation1
 
feasib
feasibfeasib
feasib
 
Музей интересных коллекций
Музей интересных коллекцийМузей интересных коллекций
Музей интересных коллекций
 
творческий отчёт 2008
творческий отчёт 2008творческий отчёт 2008
творческий отчёт 2008
 
Peter zehren ~ ftr 2015.power in palm
Peter zehren ~ ftr 2015.power in palmPeter zehren ~ ftr 2015.power in palm
Peter zehren ~ ftr 2015.power in palm
 
SMART Response PGO
SMART Response PGOSMART Response PGO
SMART Response PGO
 
FRT Report 2016 Published-The Pulse of Technology
FRT Report 2016 Published-The Pulse of TechnologyFRT Report 2016 Published-The Pulse of Technology
FRT Report 2016 Published-The Pulse of Technology
 

Semelhante a Accentuate Us!

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of PatentsIconic Translation Machines
 
Intro computer fundamentals
Intro computer fundamentalsIntro computer fundamentals
Intro computer fundamentalsPrabhu Govind
 
Introduction to Google's Go programming language
Introduction to Google's Go programming languageIntroduction to Google's Go programming language
Introduction to Google's Go programming languageMario Castro Contreras
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionStephen Marquard
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
RestMS Introduction
RestMS IntroductionRestMS Introduction
RestMS Introductionpieterh
 
introduction for computers
introduction for computersintroduction for computers
introduction for computersYogesh Chaure
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7alaa223
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Iconic Translation Machines
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Olga Lavrentieva
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In PythonMarwan Osman
 
Introduction to Speech Interfaces for Web Applications
Introduction to Speech Interfaces for Web ApplicationsIntroduction to Speech Interfaces for Web Applications
Introduction to Speech Interfaces for Web ApplicationsKevin Hakanson
 
Virtual eye vision with HoloLens
Virtual eye vision with HoloLensVirtual eye vision with HoloLens
Virtual eye vision with HoloLensStefano Tempesta
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 

Semelhante a Accentuate Us! (20)

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 
Intro computer fundamentals
Intro computer fundamentalsIntro computer fundamentals
Intro computer fundamentals
 
Introduction to Google's Go programming language
Introduction to Google's Go programming languageIntroduction to Google's Go programming language
Introduction to Google's Go programming language
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Swift vs. Language X
Swift vs. Language XSwift vs. Language X
Swift vs. Language X
 
RestMS Introduction
RestMS IntroductionRestMS Introduction
RestMS Introduction
 
introduction for computers
introduction for computersintroduction for computers
introduction for computers
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
 
Bringing UX to the Backend
Bringing UX to the BackendBringing UX to the Backend
Bringing UX to the Backend
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
Practical Kerberos
Practical KerberosPractical Kerberos
Practical Kerberos
 
Introduction to Speech Interfaces for Web Applications
Introduction to Speech Interfaces for Web ApplicationsIntroduction to Speech Interfaces for Web Applications
Introduction to Speech Interfaces for Web Applications
 
Virtual eye vision with HoloLens
Virtual eye vision with HoloLensVirtual eye vision with HoloLens
Virtual eye vision with HoloLens
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Intro compute
Intro computeIntro compute
Intro compute
 

Último

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Accentuate Us!

  • 1. Accentuate Us! Kevin Scannell and Michael Schade November 10, 2010
  • 2. Language Death About 7000 languages spoken in the world More than 90% are expected to disappear before 2100 Every language is a repository of the culture, traditions, and world view of its community The death of a language is an irrevocable loss, comparable to the extinction of a plant or animal species Many endangered language communities are looking to the internet and technology to help revitalize their language
  • 3. Endangered Language Technology Only about 50 languages (0.71%) have fully-localized desktop computer systems Firefox 3.5 available in 68 languages (0.97%) Spellcheckers for 117 languages (1.67%) For this talk, we're looking at something even more basic: keyboard input
  • 4. Keyboard Input The majority of languages are oral - no written tradition Good news: among those that have writing systems, almost all scripts are available in Unicode (but not Maldivian, Khamti, ...) Yet even Unicode-encoded languages often lack appropriate input methods, or free fonts When electronic texts do exist, they are often entered as plain ASCII, either by transliteration (Cherokee, → galvquodiyu), omitting diacritics (Lingala, likɔngá → likonga), or ad hoc approaches (Irish, béal → be/al) Omitted diacritics matter! Leads to ambiguities, misunderstandings (leite vs. léite).
  • 5. Diacritic Restoration Software that takes plain ASCII text in some language as input, and outputs the text with all diacritics or extended characters in place Examples Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan → Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan Uwe setin suen gha gu emwa ni sike rue ghae emwi esi ne uwe rue → Uwẹ sẹtin suẹn gha gu emwa ni sikẹ ruẹ ghae emwi esi ne uwẹ ruẹ Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao → Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo  a me ka hōʻike ʻana i ka manaʻo Tout moun gen dwa a libete lide yo ak lapawol yo → Tout moun gen dwa a libète lide yo ak lapawòl yo Moi nguoi deu co quyen tu do ngon luan va bay to quan diem → Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→ Ẹnì kọ̀ọ̀kan ló ní ẹ̀tọ́ sí òmì nira láti ní ìmọ̀ràn tí ó wù ú, kí ó sì sọ irú ìmọ̀ràn bẹ́ẹ̀ jáde
  • 6. Statistical Machine Learning Given an ASCII input, every character that allows a diacritic or an extended form represents a "classification problem" We use a machine learning approach; the program learns where the diacritics belong by gathering statistics from a "training corpus" of texts with the diacritics in place Remembers words seen in training data; statistics on co- occurring words to deal with ambiguous cases (Irish "leite" vs. "léite" or even English "resume" vs. "résumé") For never-before seen words, uses statistics of 3-character sequences in a neighborhood of the character in question (French initial "cera" vs. "cerc", "cabl" vs. "cabo"). This is the generic case for under-resourced languages Training texts crawled from the web; 114 in all!
  • 7. API Protocol: JSON Calls langs lift feedback Sample Call { "call": "charlifter.lift" , "lang": "ht" , "text": "Bon, la fe sa apre demen pito, le la we mwen andey." , "locale": "ht" } Full documentation at http://accentuate.us/api
  • 8. Service Architecture Geographically Distributed API Servers Load-Balancing Proxy Clients
  • 9. HTTP Communication (Proxy) Cache-Control: no-cache Connection: keep-alive Pragma: no-cache Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Accept-Encoding: gzip,deflate Accept-Language: en-us,en;q=0.5 Host: ht.api.accentuate.us:8080 User-Agent: Accentuate.us/0.9b3 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.1 Content-Length: 113 Content-Type: application/json; charset=utf-8 Keep-Alive: 115 {"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la we mwen andey.","locale":"ht"}
  • 10. HTTP Communication (API) Cache-Control: no-cache Connection: close Pragma: no-cache Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Accept-Encoding: gzip,deflate Accept-Language: en-us,en;q=0.5 Host: ht User-Agent: Accentuate.us/distribution Content-Length: 113 Content-Type: application/json; charset=utf-8 {"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la we mwen andey.","locale":"ht"}
  • 11. Clients Perl $ ./sf-client.pl -r -l ga -i "Is i an Ghaeilge an chead teanga oifigiuil." Is í an Ghaeilge an chéad teanga oifigiúil. Python Vim OS X Service OpenOffice.org
  • 12. Mozilla Firefox UI and Localization Decisions Implement API calls Langs Silent Feedback Opt-in Improve language models Lift Complicated! Looking ahead
  • 13. Demos