Building a 3-gram model for Language Identification

Building a 3-gram model for Language Identification
Kepa J. Rodriguez
RDD Colloquium
05.06.2013

Anlass der Präsentation (Fußzeile)
Outline
• Motivation
• N-grams: a short formal introduction
• Description of the model
• Experiments and results
• Conclusions

Motivation: why do we need language identification?

The EHRI project aims to integrate text information in different
languages
– Around 30 different languages?
• We can find pieces of text in different languages inside the same
collection
– As cites in the description
– File descriptions and documents in different languages
• Language identification is needed for
– Learning of statistical models of content
– Use of machine translation appications
– Information retrieval tasks

Our task today

Task: learn and evaluate a corpus based language model for
language identification

Learn: data in different languages from text corpora

26 languages

4 alphabets: latin, cyrilic, semitic and greek.

Evaluate: test the model using examples of different sizes

10, 20, 30... 100 words.

An introduction to n-grams (1)

n-grams are contiguous sequence of n items from a given
sequence

Sequence of text, speech, biological material, etc.

n-grams are used in:

Computational linguistics

Statistical language modelling

Bio-informatics:

protein sequencing

DNA sequencing...

etc

n is a natural number:

1-gram, 2-gram, 3-gram, 4-gram....

An introduction to n-grams (2)

We can build the model using n-grams of:

Words

Charachters

Advantages of the use of characters

Reduction of the complexity keeping the information:

All combinations of 3 letters are less than all the words in all the
languages

We can extract the 3-grams from text or from word

We extract them from words after a pre-processing

In other case we have to handle with punctuation marks

But it should be more precise, if needed it will be tested in
further experiments

Example of 3-grams in our model

Word with more than 2 characters: what

#wh, wha, hat, at#

1-character words: a

#a#

2-character words: or

*or

Construction of the model

Extract all words for each language

Extract all 3-grams from the words and count them

Select for each language the 2000 more frequent 3-grams

Compute Term Frequency and normalize it to a number between 0 and 1

Build a vector space model (18,717 dimensions)
nl    en#:1    *de:0.446570282194783    an#:0.363387472486255    et#:0.352653410717282
#he:0.294256413293457    #va:0.273426164460632    van:0.273130052411833
ing:0.216783630925942    oor:0.207505453396899    er#:0.201139044347715
ver:0.19740679873264    het:0.191538844965602    ie#:0.18158331112493
at#:0.181232911867184    #ge:0.180348277121396    #be:0.178972589894683
een:0.176327322258743    gen:0.169800519183126    *en:0.165320590644833
nde:0.158940609793412    ten:0.158123834058808    #da:0.157651288580932
ng#:0.155546425434051    den:0.152582837345652    #vo:0.151864765627313

Use of the model

Query is represented in the vector space

Predicted language is the language with a higher cosine similarity
to the data

Learn material
●
Datasets extracted from:
●
Leipzig Corpora Collection:
●
texts from Wikipedia, news and web.
●
Europarl: European Parliament Parallel Corpus.
●
Translated proceedings of the European Parliament.
●
Data selection:
●
For each language different datasets were merged.
●
Order of lines in the text randomized.
●
Selected 200,000 lines (around 3,500,000 words) for each language.

Test sets
• Language data extracted from the same corpora than the training set.
• 10 test sets with 100 examples for each language.
• Each set contains samples of different length:
– 10 words
– 20 words
– 30 words
– ….
– 100 words
• Experiment: map each example with its language

Overal performance
• Performance for all languages:
– 10 words: 91% correct
– 20 words: 95.6% correct
– …
– 40 words: 97% correct
• Most of the errors for the same language: Norwegian
– Difficulties to distinguish from other Germanic and Slavic languages
– Very low recall
• In best case: P=0.7, R=0.33, F1=0.44

Overal performance: Latin alphabet
• Languages with latin alphabet
– 21 languages with very different typology
– EHRI-relevant and no
• We are not yet sure, which languages will be needed
• Performance:
– 10 words: 89.57%
– 20 words: 94,62%
– ….
– 40 words: 96,5%
• Without Norwegian
– 10 words: 92.9%
– 20 words: 98.6%
– 30 words: 99.65%

Overal performance: Cyrilic alphabet
• 3 languages:
– Russian
– Belarussian
– Bulgarian
• Very good results:
– 10 words: 97.3%
– 20 words: 99.3%
– 30 words (and more): 100%

Conclusions
• The representational power of a 3-gram based language model is
enough to be used for language identification.
• Easy techniques as vector space and cosine similarity offer good
results with the only exception of a language.
• And.... questions??? discusion?

Thanks!!!

Building a 3-gram model for Language Identification

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (19)

Semelhante a Building a 3-gram model for Language Identification

Semelhante a Building a 3-gram model for Language Identification (20)

Mais de Kepa J. Rodriguez

Mais de Kepa J. Rodriguez (6)

Último

Último (20)

Building a 3-gram model for Language Identification