Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Enriching Transliteration Lexicon Using
Automatic Transliteration Extraction
Sarvnaz Karimi
School of Computer Science and IT
RMIT University
Supervisors: Dr. Falk Scholer and Dr. Andrew Turpin
Keywords: Transliteration, Parallel Corpus

Machine Transliteration
• Machine transliteration transforms a word from a source language to a
target language with preserved pronunciation.
• Machine translation, cross-lingual information retrieval and cross-lingual
question answering are the main areas that automatic transliteration is
applicable.
• Transliteration has been studied in two major areas: transliteration
generation and transliteration extraction.
• Transliteration generation gets an input source word in source language
(e.g. Sydney in English) and generates its transliteration in target
language (e.g.
©
Â
ª
in Persian).
• Transliteration extraction is discovering transliteration pairs (e.g.
(Sydney,
©
Â
ª
) in bilingual texts.

Transliteration Extraction So Far!
Discovery of transliteration methods in literature consider:
• Extraction from parallel corpus:
– Statistical methods are beneficial, particularly because the
sentences/words can be aligned.
– Yet parallel corpus is hard to find for many less-computerised
languages.
• Extraction from comparable corpus:
– More evidence than just statistical information are required to extract
pairs (e.g. temporal, phonetic information or Web-count).
– Comparable corpora are easier to construct and find than parallel one.
Most studies use name entity (NE) recogniser to separate proper nouns that
are subject to transliteration from other words.

Persian and English Transliteration
• Transliteration generation has been studied using n-gram based and
consonant-vowel based approaches.
• Transliteration extraction is not previously studied for this language pair,
mainly due to lack of any parallel or comparable corpus.
• Transliteration extraction has been studied using co-occurrence, temporal,
edit distance measures or phonetic similarities. We aim to apply our
transliteration generation methods as a basis for this task.

Proposed Method: Application of Transliteration
Generation in Extraction
1. For each document in each language we perform a pre-processing
to generate a bag of words from each document (tokenise) and
also remove stop-words
2. Each word in source language is matched against a dictionary,
if not found then it is an out-of-dictionary word that needs
transliteration in target document.
3. A ranked list of possible transliterations for each source
word is generated by transliteration system.
4. Those transliterations matching with the target document
potential words are considered as a potential pair.
5. A score can be given to these pairs based on the rank of the
transliteration and number of times they are paired.

Experimental Setup
• An English-Persian comparable corpus of news texts is constructed
consisting of 3,474 documents.
• An English machine-readable dictionary was applied which contains
120,177 entries.
• Experiments:
– Accuracy of transliterations extracted (Fixed Training Collection).
Different methods of matching experienced (1-English documents are
parsed to extract their out-of-dictionary words using dictionary look-up
and stemming. 2- A parsing on Persian documents is performed by
rendering the words that contain allophones characters to one unique
character. 3- Repeating the previous experiment including capital
characters knowledge.)
– Impact of seed transliteration lexicon.

Experiments and Results
Experiment 1 : Accuracy
#Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.
1 4.6 3.6 (81.3) 5.8 68.9 8.5 11.3 1662 2860 70.3
2 4.1 3.6 (90.2) 5.9 69.7 8.5 11.3 1641 2496 80.4
3 6.6 5.9 (89.2) 6.9 66.8 8.4 22.6 1725 3694 75.2
Experiment 2 : Train Size
Train #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.
200 2.6 2.2 (86.5) 2.8 84.2 9.1 24.0 1322 1287 78.2
300 3.1 2.5 (82.5) 4.2 71.9 8.8 23.5 1494 1579 73.3
400 3.1 2.6 (84.5) 4.6 71.3 8.8 23.4 1483 1569 74.1
500 2.8 2.5 (89.5) 4.3 78.0 8.9 23.5 1459 1507 79.0

Conclusions and Further Work
• Transliteration extraction can be helpful in automatically generating
transliteration lexicons.
• Transliteration lexicon as a dictionary of transliteration of a proper noun or
technical terms that are not translated are beneﬁcial in dictionary-based
machine translation applications.
• We investigated a method of applying the current yet incomplete
transliteration lexicons in enriching them using comparable corpora.
• In future, role of NE-recogniser will be investigated to compare with a
simple dictionary look-up.

Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (17)

Semelhante a Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Semelhante a Enriching Transliteration Lexicon Using Automatic Transliteration Extraction (20)

Mais de Sarvnaz Karimi

Mais de Sarvnaz Karimi (7)

Último

Último (20)

Enriching Transliteration Lexicon Using Automatic Transliteration Extraction