Collection descriptions of archives can contain text in more than one language without the use of language identifiers. An example are cites of texts from the documents of the collections or from other collections and materials. This missed information is crucial for tasks like statistical modeling of the content of an institution and in general for several information retrieval tasks.
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Building a 3-gram model for Language Identification
1. Building a 3-gram model for Language Identification
Kepa J. Rodriguez
RDD Colloquium
05.06.2013
2. Anlass der Präsentation (Fußzeile)
Outline
• Motivation
• N-grams: a short formal introduction
• Description of the model
• Experiments and results
• Conclusions
3. Anlass der Präsentation (Fußzeile)
Motivation: why do we need language identification?
The EHRI project aims to integrate text information in different
languages
– Around 30 different languages?
• We can find pieces of text in different languages inside the same
collection
– As cites in the description
– File descriptions and documents in different languages
• Language identification is needed for
– Learning of statistical models of content
– Use of machine translation appications
– Information retrieval tasks
4. Anlass der Präsentation (Fußzeile)
Our task today
Task: learn and evaluate a corpus based language model for
language identification
Learn: data in different languages from text corpora
26 languages
4 alphabets: latin, cyrilic, semitic and greek.
Evaluate: test the model using examples of different sizes
10, 20, 30... 100 words.
5. Anlass der Präsentation (Fußzeile)
An introduction to n-grams (1)
n-grams are contiguous sequence of n items from a given
sequence
Sequence of text, speech, biological material, etc.
n-grams are used in:
Computational linguistics
Statistical language modelling
Bio-informatics:
protein sequencing
DNA sequencing...
etc
n is a natural number:
1-gram, 2-gram, 3-gram, 4-gram....
6. Anlass der Präsentation (Fußzeile)
An introduction to n-grams (2)
We can build the model using n-grams of:
Words
Charachters
Advantages of the use of characters
Reduction of the complexity keeping the information:
All combinations of 3 letters are less than all the words in all the
languages
We can extract the 3-grams from text or from word
We extract them from words after a pre-processing
In other case we have to handle with punctuation marks
But it should be more precise, if needed it will be tested in
further experiments
7. Anlass der Präsentation (Fußzeile)
Example of 3-grams in our model
Word with more than 2 characters: what
#wh, wha, hat, at#
1-character words: a
#a#
2-character words: or
*or
8. Anlass der Präsentation (Fußzeile)
Construction of the model
Extract all words for each language
Extract all 3-grams from the words and count them
Select for each language the 2000 more frequent 3-grams
Compute Term Frequency and normalize it to a number between 0 and 1
Build a vector space model (18,717 dimensions)
nl en#:1 *de:0.446570282194783 an#:0.363387472486255 et#:0.352653410717282
#he:0.294256413293457 #va:0.273426164460632 van:0.273130052411833
ing:0.216783630925942 oor:0.207505453396899 er#:0.201139044347715
ver:0.19740679873264 het:0.191538844965602 ie#:0.18158331112493
at#:0.181232911867184 #ge:0.180348277121396 #be:0.178972589894683
een:0.176327322258743 gen:0.169800519183126 *en:0.165320590644833
nde:0.158940609793412 ten:0.158123834058808 #da:0.157651288580932
ng#:0.155546425434051 den:0.152582837345652 #vo:0.151864765627313
9. Anlass der Präsentation (Fußzeile)
Use of the model
Query is represented in the vector space
Predicted language is the language with a higher cosine similarity
to the data
10. Anlass der Präsentation (Fußzeile)
Learn material
●
Datasets extracted from:
●
Leipzig Corpora Collection:
●
texts from Wikipedia, news and web.
●
Europarl: European Parliament Parallel Corpus.
●
Translated proceedings of the European Parliament.
●
Data selection:
●
For each language different datasets were merged.
●
Order of lines in the text randomized.
●
Selected 200,000 lines (around 3,500,000 words) for each language.
11. Anlass der Präsentation (Fußzeile)
Test sets
• Language data extracted from the same corpora than the training set.
• 10 test sets with 100 examples for each language.
• Each set contains samples of different length:
– 10 words
– 20 words
– 30 words
– ….
– 100 words
• Experiment: map each example with its language
12. Anlass der Präsentation (Fußzeile)
Overal performance
• Performance for all languages:
– 10 words: 91% correct
– 20 words: 95.6% correct
– …
– 40 words: 97% correct
• Most of the errors for the same language: Norwegian
– Difficulties to distinguish from other Germanic and Slavic languages
– Very low recall
• In best case: P=0.7, R=0.33, F1=0.44
13. Anlass der Präsentation (Fußzeile)
Overal performance: Latin alphabet
• Languages with latin alphabet
– 21 languages with very different typology
– EHRI-relevant and no
• We are not yet sure, which languages will be needed
• Performance:
– 10 words: 89.57%
– 20 words: 94,62%
– ….
– 40 words: 96,5%
• Without Norwegian
– 10 words: 92.9%
– 20 words: 98.6%
– 30 words: 99.65%
14. Anlass der Präsentation (Fußzeile)
Overal performance: Cyrilic alphabet
• 3 languages:
– Russian
– Belarussian
– Bulgarian
• Very good results:
– 10 words: 97.3%
– 20 words: 99.3%
– 30 words (and more): 100%
15. Anlass der Präsentation (Fußzeile)
Conclusions
• The representational power of a 3-gram based language model is
enough to be used for language identification.
• Easy techniques as vector space and cosine similarity offer good
results with the only exception of a language.
• And.... questions??? discusion?