SlideShare uma empresa Scribd logo
1 de 16
Building a 3-gram model for Language Identification
Kepa J. Rodriguez
RDD Colloquium
05.06.2013
Anlass der Präsentation (Fußzeile)
Outline
• Motivation
• N-grams: a short formal introduction
• Description of the model
• Experiments and results
• Conclusions
Anlass der Präsentation (Fußzeile)
Motivation: why do we need language identification?

The EHRI project aims to integrate text information in different
languages
– Around 30 different languages?
• We can find pieces of text in different languages inside the same
collection
– As cites in the description
– File descriptions and documents in different languages
• Language identification is needed for
– Learning of statistical models of content
– Use of machine translation appications
– Information retrieval tasks
Anlass der Präsentation (Fußzeile)
Our task today

Task: learn and evaluate a corpus based language model for
language identification

Learn: data in different languages from text corpora

26 languages

4 alphabets: latin, cyrilic, semitic and greek.

Evaluate: test the model using examples of different sizes

10, 20, 30... 100 words.
Anlass der Präsentation (Fußzeile)
An introduction to n-grams (1)

n-grams are contiguous sequence of n items from a given
sequence

Sequence of text, speech, biological material, etc.

n-grams are used in:

Computational linguistics

Statistical language modelling

Bio-informatics:

protein sequencing

DNA sequencing...

etc

n is a natural number:

1-gram, 2-gram, 3-gram, 4-gram....
Anlass der Präsentation (Fußzeile)
An introduction to n-grams (2)

We can build the model using n-grams of:

Words

Charachters

Advantages of the use of characters

Reduction of the complexity keeping the information:

All combinations of 3 letters are less than all the words in all the
languages

We can extract the 3-grams from text or from word

We extract them from words after a pre-processing

In other case we have to handle with punctuation marks

But it should be more precise, if needed it will be tested in
further experiments
Anlass der Präsentation (Fußzeile)
Example of 3-grams in our model

Word with more than 2 characters: what

#wh, wha, hat, at#

1-character words: a

#a#

2-character words: or

*or
Anlass der Präsentation (Fußzeile)
Construction of the model

Extract all words for each language

Extract all 3-grams from the words and count them

Select for each language the 2000 more frequent 3-grams

Compute Term Frequency and normalize it to a number between 0 and 1

Build a vector space model (18,717 dimensions)
nl    en#:1    *de:0.446570282194783    an#:0.363387472486255    et#:0.352653410717282    
#he:0.294256413293457    #va:0.273426164460632    van:0.273130052411833    
ing:0.216783630925942    oor:0.207505453396899    er#:0.201139044347715    
ver:0.19740679873264    het:0.191538844965602    ie#:0.18158331112493    
at#:0.181232911867184    #ge:0.180348277121396    #be:0.178972589894683    
een:0.176327322258743    gen:0.169800519183126    *en:0.165320590644833    
nde:0.158940609793412    ten:0.158123834058808    #da:0.157651288580932    
ng#:0.155546425434051    den:0.152582837345652    #vo:0.151864765627313
Anlass der Präsentation (Fußzeile)
Use of the model

Query is represented in the vector space

Predicted language is the language with a higher cosine similarity
to the data
Anlass der Präsentation (Fußzeile)
Learn material
●
Datasets extracted from:
●
Leipzig Corpora Collection:
●
texts from Wikipedia, news and web.
●
Europarl: European Parliament Parallel Corpus.
●
Translated proceedings of the European Parliament.
●
Data selection:
●
For each language different datasets were merged.
●
Order of lines in the text randomized.
●
Selected 200,000 lines (around 3,500,000 words) for each language.
Anlass der Präsentation (Fußzeile)
Test sets
• Language data extracted from the same corpora than the training set.
• 10 test sets with 100 examples for each language.
• Each set contains samples of different length:
– 10 words
– 20 words
– 30 words
– ….
– 100 words
• Experiment: map each example with its language
Anlass der Präsentation (Fußzeile)
Overal performance
• Performance for all languages:
– 10 words: 91% correct
– 20 words: 95.6% correct
– …
– 40 words: 97% correct
• Most of the errors for the same language: Norwegian
– Difficulties to distinguish from other Germanic and Slavic languages
– Very low recall
• In best case: P=0.7, R=0.33, F1=0.44
Anlass der Präsentation (Fußzeile)
Overal performance: Latin alphabet
• Languages with latin alphabet
– 21 languages with very different typology
– EHRI-relevant and no
• We are not yet sure, which languages will be needed
• Performance:
– 10 words: 89.57%
– 20 words: 94,62%
– ….
– 40 words: 96,5%
• Without Norwegian
– 10 words: 92.9%
– 20 words: 98.6%
– 30 words: 99.65%
Anlass der Präsentation (Fußzeile)
Overal performance: Cyrilic alphabet
• 3 languages:
– Russian
– Belarussian
– Bulgarian
• Very good results:
– 10 words: 97.3%
– 20 words: 99.3%
– 30 words (and more): 100%
Anlass der Präsentation (Fußzeile)
Conclusions
• The representational power of a 3-gram based language model is
enough to be used for language identification.
• Easy techniques as vector space and cosine similarity offer good
results with the only exception of a language.
• And.... questions??? discusion?
Anlass der Präsentation (Fußzeile)
Thanks!!!

Mais conteúdo relacionado

Destaque

Ampliando conhecimentos com mapas conceituais no ensino de língua inglesa
Ampliando conhecimentos com mapas conceituais no ensino de língua inglesaAmpliando conhecimentos com mapas conceituais no ensino de língua inglesa
Ampliando conhecimentos com mapas conceituais no ensino de língua inglesa
Shirlene Bemfica de Oliveira
 
Curso de língua inglesa
Curso de língua inglesaCurso de língua inglesa
Curso de língua inglesa
nilsangela
 
A importância do conhecimento da língua inglesa
A importância do conhecimento da língua inglesaA importância do conhecimento da língua inglesa
A importância do conhecimento da língua inglesa
rhhernandes
 
CREATIVE GRAMMAR WORKSHOP
CREATIVE GRAMMAR WORKSHOPCREATIVE GRAMMAR WORKSHOP
CREATIVE GRAMMAR WORKSHOP
davidaaduarte
 

Destaque (19)

Easy Fluency English
Easy Fluency EnglishEasy Fluency English
Easy Fluency English
 
Automatic Language Identification
Automatic Language IdentificationAutomatic Language Identification
Automatic Language Identification
 
Alive highoverview
Alive highoverviewAlive highoverview
Alive highoverview
 
Ampliando conhecimentos com mapas conceituais no ensino de língua inglesa
Ampliando conhecimentos com mapas conceituais no ensino de língua inglesaAmpliando conhecimentos com mapas conceituais no ensino de língua inglesa
Ampliando conhecimentos com mapas conceituais no ensino de língua inglesa
 
Lets talk english convers networking
Lets talk english convers networkingLets talk english convers networking
Lets talk english convers networking
 
Curso de língua inglesa
Curso de língua inglesaCurso de língua inglesa
Curso de língua inglesa
 
Técnicas de leitura em inglês dicas
Técnicas de leitura em inglês   dicasTécnicas de leitura em inglês   dicas
Técnicas de leitura em inglês dicas
 
Tecnicas de leitura
Tecnicas de leituraTecnicas de leitura
Tecnicas de leitura
 
Before, During, and After Reading Strategies
Before, During, and After Reading StrategiesBefore, During, and After Reading Strategies
Before, During, and After Reading Strategies
 
Metodologias Para O Ensino De LíNguas
Metodologias Para O Ensino De LíNguasMetodologias Para O Ensino De LíNguas
Metodologias Para O Ensino De LíNguas
 
Análise da Prova do IFRJ 2011- Língua Inglesa
Análise da Prova do IFRJ 2011- Língua InglesaAnálise da Prova do IFRJ 2011- Língua Inglesa
Análise da Prova do IFRJ 2011- Língua Inglesa
 
Key Reading Strategies
Key Reading StrategiesKey Reading Strategies
Key Reading Strategies
 
A importância do conhecimento da língua inglesa
A importância do conhecimento da língua inglesaA importância do conhecimento da língua inglesa
A importância do conhecimento da língua inglesa
 
CREATIVE GRAMMAR WORKSHOP
CREATIVE GRAMMAR WORKSHOPCREATIVE GRAMMAR WORKSHOP
CREATIVE GRAMMAR WORKSHOP
 
Grammar
GrammarGrammar
Grammar
 
Business English
Business EnglishBusiness English
Business English
 
Face2 face 2d edition pre_intermeadie workbook
Face2 face 2d edition pre_intermeadie workbookFace2 face 2d edition pre_intermeadie workbook
Face2 face 2d edition pre_intermeadie workbook
 
Face2 face starter student's book
Face2 face starter student's bookFace2 face starter student's book
Face2 face starter student's book
 
Asking and answering questions over 100 basic English questions
 Asking and answering questions over 100 basic English questions Asking and answering questions over 100 basic English questions
Asking and answering questions over 100 basic English questions
 

Semelhante a Building a 3-gram model for Language Identification

Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Lifeng (Aaron) Han
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
DhruvKushwaha12
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 
Evaluation of language identification methods
Evaluation of language identification methodsEvaluation of language identification methods
Evaluation of language identification methods
edma2
 
Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...
CILIP MDG
 
System Programming Unit III
System Programming Unit IIISystem Programming Unit III
System Programming Unit III
Manoj Patil
 

Semelhante a Building a 3-gram model for Language Identification (20)

Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Fasttext 20170720 yjy
Fasttext 20170720 yjyFasttext 20170720 yjy
Fasttext 20170720 yjy
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminaries
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Evaluation of language identification methods
Evaluation of language identification methodsEvaluation of language identification methods
Evaluation of language identification methods
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
System Programming Overview
System Programming OverviewSystem Programming Overview
System Programming Overview
 
Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Fusing Modeling and Programming into Language-Oriented Programming
Fusing Modeling and Programming into Language-Oriented ProgrammingFusing Modeling and Programming into Language-Oriented Programming
Fusing Modeling and Programming into Language-Oriented Programming
 
System Programming Unit III
System Programming Unit IIISystem Programming Unit III
System Programming Unit III
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 

Mais de Kepa J. Rodriguez (6)

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish Studies
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical Research
 
Active Annotation of Corpora.
Active Annotation of Corpora.Active Annotation of Corpora.
Active Annotation of Corpora.
 
Resources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionResources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora Resolution
 
Cross Document Coreference
Cross Document CoreferenceCross Document Coreference
Cross Document Coreference
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Building a 3-gram model for Language Identification

  • 1. Building a 3-gram model for Language Identification Kepa J. Rodriguez RDD Colloquium 05.06.2013
  • 2. Anlass der Präsentation (Fußzeile) Outline • Motivation • N-grams: a short formal introduction • Description of the model • Experiments and results • Conclusions
  • 3. Anlass der Präsentation (Fußzeile) Motivation: why do we need language identification?  The EHRI project aims to integrate text information in different languages – Around 30 different languages? • We can find pieces of text in different languages inside the same collection – As cites in the description – File descriptions and documents in different languages • Language identification is needed for – Learning of statistical models of content – Use of machine translation appications – Information retrieval tasks
  • 4. Anlass der Präsentation (Fußzeile) Our task today  Task: learn and evaluate a corpus based language model for language identification  Learn: data in different languages from text corpora  26 languages  4 alphabets: latin, cyrilic, semitic and greek.  Evaluate: test the model using examples of different sizes  10, 20, 30... 100 words.
  • 5. Anlass der Präsentation (Fußzeile) An introduction to n-grams (1)  n-grams are contiguous sequence of n items from a given sequence  Sequence of text, speech, biological material, etc.  n-grams are used in:  Computational linguistics  Statistical language modelling  Bio-informatics:  protein sequencing  DNA sequencing...  etc  n is a natural number:  1-gram, 2-gram, 3-gram, 4-gram....
  • 6. Anlass der Präsentation (Fußzeile) An introduction to n-grams (2)  We can build the model using n-grams of:  Words  Charachters  Advantages of the use of characters  Reduction of the complexity keeping the information:  All combinations of 3 letters are less than all the words in all the languages  We can extract the 3-grams from text or from word  We extract them from words after a pre-processing  In other case we have to handle with punctuation marks  But it should be more precise, if needed it will be tested in further experiments
  • 7. Anlass der Präsentation (Fußzeile) Example of 3-grams in our model  Word with more than 2 characters: what  #wh, wha, hat, at#  1-character words: a  #a#  2-character words: or  *or
  • 8. Anlass der Präsentation (Fußzeile) Construction of the model  Extract all words for each language  Extract all 3-grams from the words and count them  Select for each language the 2000 more frequent 3-grams  Compute Term Frequency and normalize it to a number between 0 and 1  Build a vector space model (18,717 dimensions) nl    en#:1    *de:0.446570282194783    an#:0.363387472486255    et#:0.352653410717282     #he:0.294256413293457    #va:0.273426164460632    van:0.273130052411833     ing:0.216783630925942    oor:0.207505453396899    er#:0.201139044347715     ver:0.19740679873264    het:0.191538844965602    ie#:0.18158331112493     at#:0.181232911867184    #ge:0.180348277121396    #be:0.178972589894683     een:0.176327322258743    gen:0.169800519183126    *en:0.165320590644833     nde:0.158940609793412    ten:0.158123834058808    #da:0.157651288580932     ng#:0.155546425434051    den:0.152582837345652    #vo:0.151864765627313
  • 9. Anlass der Präsentation (Fußzeile) Use of the model  Query is represented in the vector space  Predicted language is the language with a higher cosine similarity to the data
  • 10. Anlass der Präsentation (Fußzeile) Learn material ● Datasets extracted from: ● Leipzig Corpora Collection: ● texts from Wikipedia, news and web. ● Europarl: European Parliament Parallel Corpus. ● Translated proceedings of the European Parliament. ● Data selection: ● For each language different datasets were merged. ● Order of lines in the text randomized. ● Selected 200,000 lines (around 3,500,000 words) for each language.
  • 11. Anlass der Präsentation (Fußzeile) Test sets • Language data extracted from the same corpora than the training set. • 10 test sets with 100 examples for each language. • Each set contains samples of different length: – 10 words – 20 words – 30 words – …. – 100 words • Experiment: map each example with its language
  • 12. Anlass der Präsentation (Fußzeile) Overal performance • Performance for all languages: – 10 words: 91% correct – 20 words: 95.6% correct – … – 40 words: 97% correct • Most of the errors for the same language: Norwegian – Difficulties to distinguish from other Germanic and Slavic languages – Very low recall • In best case: P=0.7, R=0.33, F1=0.44
  • 13. Anlass der Präsentation (Fußzeile) Overal performance: Latin alphabet • Languages with latin alphabet – 21 languages with very different typology – EHRI-relevant and no • We are not yet sure, which languages will be needed • Performance: – 10 words: 89.57% – 20 words: 94,62% – …. – 40 words: 96,5% • Without Norwegian – 10 words: 92.9% – 20 words: 98.6% – 30 words: 99.65%
  • 14. Anlass der Präsentation (Fußzeile) Overal performance: Cyrilic alphabet • 3 languages: – Russian – Belarussian – Bulgarian • Very good results: – 10 words: 97.3% – 20 words: 99.3% – 30 words (and more): 100%
  • 15. Anlass der Präsentation (Fußzeile) Conclusions • The representational power of a 3-gram based language model is enough to be used for language identification. • Easy techniques as vector space and cosine similarity offer good results with the only exception of a language. • And.... questions??? discusion?
  • 16. Anlass der Präsentation (Fußzeile) Thanks!!!