Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT)

Bilingual Data Mining for the
English-Amharic Statistical
Machine Translation (EASMT)

Mulu Gebreegziabher
Addis Ababa, Ethiopia: IT Doctoral Program, Addis Ababa University
Prof. Laurent Besacier
Grenoble, France: University Joseph Fourier
Dr. Girma Taye & Dr. Dereje Teferi
Addis Ababa, Ethiopia: Addis Ababa University
December 2, 2011

Presentation Outline
• Introduction
• Objectives
• Experiment on the English-Amharic bilingual corpus
• ENA English-Amharic parallel news corpus
• Parliamentary English-Amharic parallel proclamation corpus
• Sentence level aligned English-Amharic parallel corpora
• Way Forward

Introduction MT is the application
of computers to
translate text from
one natural language
to another.
Machine Translation Systems

Machine Assisted Fully Automated
Translation Translation

Human Aided Machine Aided Rule-based
Empirical Systems Systems

Statistical Machine Example-based

Introduction Contd…
• SMT systems are data driven that rely on bilingual
parallel aligned corpus.
• The performance of a SMT systems depends on the
size of the available training corpus.
• The larger the corpus, the better is the
performance of the SMT system.
• To develop EASMT, parallel data has to be collected
from English-Amharic bilingual sentence pairs.
• The experiment is to be conducted on at least a
corpus of size 2M word pairs (40K sentence pairs).

English-Amharic Statistical Machine Translation (EASMT)
• Translation between two disparate languages
Amharic English

Language Family Afro-Asiatic Indo-European

Morphology Complex Less inflected

Syntactic Structure SOV SVO

Writing System Geez Alphabet Latin Letters

Parallel Corpus
• Parallel corpus is a collection of text paired with
translations into another language.
• The experiment is conducted on training corpus of
both languages based on expressions that are found
in parallel Amharic-English news, parliamentary and
constitutional documents.
• The parallel ENA news contains sentences of day-to-
day usage:
– Direct translations of each other
– Indirect translations written on the same topic in different
languages called comparable corpora.

Objectives

The objective of the research is to study and
develop an English-Amharic Statistical
Machine Translation (EASMT) system and to
improve the translation quality by integrating
linguistic knowledge into the system.

Experiment on the English-Amharic
bilingual corpus
Mining the parallel corpus
• There are five steps to process a bilingual text corpus
used for SMT system. (by Besacier et.al, 2009):
– Raw data collection: proclamation and parallel
news corpora have been collected
– Document alignment: manual & automatic
– Tokenization: splitting and trimming
– Sentence splitting: done using the punct. [?!. ፡፡ ]
– Sentence alignment: almost completed

ENA English-Amharic parallel news corpus
• News coverage: Aug 21, 2006 - January 06, 2008

News Corpus Counts Total

Domestic Language 10,116
Amharic 23,771
Regional 13,655
English Foreign Language 11,276 11,276
Monitoring 494
Amharic-English 3,610
Information 3,116

Table 1: ENA news corpus

• Count Summary: ENA news corpus

Collected Amharic English Total
Documents 23,771 11,276 35,047
Sentences 322,673 212,050 534,723
Counts of Raw 5,277,711 3,704,644 8,982,355
Words
Vocabularies 270,786 130,803 401,589
Documents 1,036 1,036 2,072
Sentences 26,112 25,834 51,946
Counts of Aligned 207,200 198,461 405,661
Words
Vocabularies 36,519 17,987 54,506

Table 2: The status of English-Amharic parallel news corpus on May 25, 2011

• Manual alignment at document level: Challenges
– Easy: preprocessing including exporting from SQL
database to word, converting to Unicode using
Zilla word to text converter
– Time consuming: difficult to align at document
level, since the files are stored in different folders
with no structure, the date
difference, punctuation, heading information
differs (parallel/comparable corpus)
– Document level alignment is done by looking at
the heading and pick the news id from the folders

• Automatically aligned English-Amharic Sample ENA
news corpora at document level
• The aligner takes the following into consideration to
align the news items:
– Start from the English corpus (constitute 32%).
– Match news items that have different story language.
– Limit the match with neighboring Amharic corpus to look 80
files around the current file.
– A scoring method is used that gives equal weights to all
matching columns.

• The output result of the automatic aligner.

Aligned Corpus Counts Cumulative %
1-1 383 383 0.37
1-2 155 538 0.52
1-3 498 1,036 1.00
Total Exact Matches 880 0.85

Unique Amharic Corpus 968 0.93

Unique English Corpus 1,036 1.00

Table 4: Automatically Aligned English-Amharic Sample
ENA news items


• Some of the sample English Documents were
better aligned with not seen document, e.g.
– 41827  41791 (manual 41827  41826)
• 85% matches have been exactly automatically
aligned similar to the manual alignment.
• Thus, 15% is a new match that does not
indicate to an error.

Table ENA: Aligned Sample English/Amharic News corpus

• Extended to automatically align the whole English-
Amharic ENA news items

Aligned Corpus Counts Cumulative %
1-1 2,928 2,928 0.26
1-2 1,535 4,463 0.40
1-3 6,813 11,276 1.00
Unique Amharic Corpus 10,487 0.93

Unique English Corpus 11,276 1.00

Table 5: Automatically Aligned English-Amharic ENA news items

Parliamentary English-Amharic parallel
proclamation corpus
• Proclamation coverage: Aug 21, 1995 - July 16, 2010
Collected Amharic English Total
Counts of Raw Documents 632 632 1,264
Documents 115 115 230
Sentences 19,115 25,730 44,845
Counts of Aligned
Words 219,430 283,578 503,008
Vocabularies 32,299 17,908 50,207

Table 6: Aligned Parliamentary English-Amharic
parallel proclamation corpus

Sentence level aligned English-Amharic
parallel corpora
• The alignment process is similar for both the ENA
news items and the proclamation.
• The alignment is done using a sentence aligner called
Hunalign (similar to Gale and Church ,1993).
• Hunalign aligns bilingual text using sentence-length.
• An English-Amharic bilingual dictionary of word lists
sized 8,212 have been adopted and used
(Armbruster, 2007).
• The aligner aligns an English Sentence to Amharic in
0-1, 1-1 or 1-2.

Sentence level aligned English-Amharic
parallel corpora
• The result of the alignment at the sentence level for
both the ENA news and the proclamation

Aligned Sentence pairs Counts

ENA Corpus 155,200

Proclamation Corpus 18,632

Total 173,832

Table 7: Sentence aligned English-Amharic bilingual corpus

Way Forward

• To increase the number of the English-Amharic
proclamation corpus as much as possible.
• To further analyze the experiment conducted so far.
• To increase the translation quality using
linguistic knowledge: morpho-syntactically.

Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT)

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (16)

Semelhante a Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT)

Semelhante a Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT) (20)

Mais de Guy De Pauw

Mais de Guy De Pauw (20)

Último

Último (20)

Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT)