SlideShare uma empresa Scribd logo
1 de 20
Bilingual Data Mining for the
  English-Amharic Statistical
 Machine Translation (EASMT)

                 Mulu Gebreegziabher
Addis Ababa, Ethiopia: IT Doctoral Program, Addis Ababa University
                 Prof. Laurent Besacier
      Grenoble, France: University Joseph Fourier
         Dr. Girma Taye & Dr. Dereje Teferi
         Addis Ababa, Ethiopia: Addis Ababa University
                    December 2, 2011
Presentation Outline
•   Introduction
•   Objectives
•   Experiment on the English-Amharic bilingual corpus
•   ENA English-Amharic parallel news corpus
•   Parliamentary English-Amharic parallel proclamation corpus
•   Sentence level aligned English-Amharic parallel corpora
•   Way Forward
Introduction                         MT is the application
                                                          of computers to
                                                        translate text from
                                                       one natural language
                                                             to another.
              Machine Translation Systems



   Machine Assisted                         Fully Automated
     Translation                               Translation


Human Aided Machine Aided                                  Rule-based
                                 Empirical Systems          Systems
 Translation Translation

                      Statistical Machine          Example-based
                          Translation                Translation
Introduction Contd…
•   SMT systems are data driven that rely on bilingual
    parallel aligned corpus.
•   The performance of a SMT systems depends on the
    size of the available training corpus.
•   The larger the corpus, the better is the
    performance of the SMT system.
•   To develop EASMT, parallel data has to be collected
    from English-Amharic bilingual sentence pairs.
•   The experiment is to be conducted on at least a
    corpus of size 2M word pairs (40K sentence pairs).
Introduction Contd…
English-Amharic Statistical Machine Translation (EASMT)
• Translation between two disparate languages
                       Amharic         English

 Language Family       Afro-Asiatic    Indo-European


 Morphology            Complex         Less inflected


 Syntactic Structure   SOV             SVO


 Writing System        Geez Alphabet   Latin Letters
Introduction Contd…
Parallel Corpus
• Parallel corpus is a collection of text paired with
   translations into another language.
• The experiment is conducted on training corpus of
   both languages based on expressions that are found
   in parallel Amharic-English news, parliamentary and
   constitutional documents.
• The parallel ENA news contains sentences of day-to-
   day usage:
  –   Direct translations of each other
  –   Indirect translations written on the same topic in different
      languages called comparable corpora.
Objectives


The objective of the research is to study and
develop an English-Amharic Statistical
Machine Translation (EASMT) system and to
improve the translation quality by integrating
linguistic knowledge into the system.
Experiment on the English-Amharic
          bilingual corpus
Mining the parallel corpus
• There are five steps to process a bilingual text corpus
  used for SMT system. (by Besacier et.al, 2009):
  – Raw data collection: proclamation and parallel
      news corpora have been collected
  – Document alignment: manual & automatic
  – Tokenization: splitting and trimming
  – Sentence splitting: done using the punct. [?!. ፡፡   ]
  – Sentence alignment: almost completed
ENA English-Amharic parallel news corpus
 • News coverage: Aug 21, 2006 - January 06, 2008


     News Corpus                                  Counts     Total

                            Domestic Language       10,116
     Amharic                                                   23,771
                            Regional                13,655
     English                Foreign Language        11,276      11,276
                            Monitoring                 494
     Amharic-English                                             3,610
                            Information              3,116

                       Table 1: ENA news corpus
ENA English-Amharic parallel news corpus
 • Count Summary: ENA news corpus

     Collected                          Amharic English   Total
                         Documents        23,771   11,276    35,047
                         Sentences       322,673 212,050 534,723
     Counts of Raw                      5,277,711 3,704,644 8,982,355
                         Words
                         Vocabularies     270,786     130,803    401,589
                         Documents          1,036       1,036      2,072
                         Sentences         26,112      25,834     51,946
     Counts of Aligned                    207,200     198,461    405,661
                         Words
                         Vocabularies       36,519     17,987     54,506

 Table 2: The status of English-Amharic parallel news corpus on May 25, 2011
ENA English-Amharic parallel news corpus
 • Manual alignment at document level: Challenges
   – Easy: preprocessing including exporting from SQL
      database to word, converting to Unicode using
      Zilla word to text converter
   – Time consuming: difficult to align at document
      level, since the files are stored in different folders
      with        no        structure,       the        date
      difference, punctuation, heading information
      differs (parallel/comparable corpus)
   – Document level alignment is done by looking at
      the heading and pick the news id from the folders
ENA English-Amharic parallel news corpus
 • Automatically aligned English-Amharic Sample ENA
   news corpora at document level
 • The aligner takes the following into consideration to
   align the news items:
    – Start from the English corpus (constitute 32%).
    – Match news items that have different story language.
    – Limit the match with neighboring Amharic corpus to look 80
      files around the current file.
    – A scoring method is used that gives equal weights to all
      matching columns.
ENA English-Amharic parallel news corpus
 • The output result of the automatic aligner.

  Aligned Corpus                 Counts         Cumulative     %
  1-1                                     383            383       0.37
  1-2                                     155            538       0.52
  1-3                                     498          1,036       1.00
  Total Exact Matches                                   880        0.85

  Unique Amharic Corpus                                 968        0.93

  Unique English Corpus                                1,036       1.00

        Table 4: Automatically Aligned English-Amharic Sample
                 ENA news items
ENA English-Amharic parallel news corpus

• Some of the sample English Documents were
  better aligned with not seen document, e.g.
  – 41827  41791 (manual 41827  41826)
• 85% matches have been exactly automatically
  aligned similar to the manual alignment.
• Thus, 15% is a new match that does not
  indicate to an error.


      Table ENA: Aligned Sample English/Amharic News corpus
ENA English-Amharic parallel news corpus
 • Extended to automatically align the whole English-
   Amharic ENA news items

        Aligned Corpus          Counts Cumulative %
        1-1                      2,928       2,928    0.26
        1-2                      1,535       4,463    0.40
        1-3                      6,813      11,276    1.00
        Unique Amharic Corpus               10,487    0.93

        Unique English Corpus               11,276    1.00


  Table 5: Automatically Aligned English-Amharic ENA news items
Parliamentary English-Amharic parallel
        proclamation corpus
• Proclamation coverage: Aug 21, 1995 - July 16, 2010
   Collected                          Amharic English     Total
   Counts of Raw Documents                632      632            1,264
                       Documents          115      115             230
                       Sentences        19,115   25,730      44,845
   Counts of Aligned
                       Words           219,430 283,578      503,008
                       Vocabularies     32,299   17,908      50,207


       Table 6: Aligned Parliamentary English-Amharic
                parallel proclamation corpus
Sentence level aligned English-Amharic
           parallel corpora
• The alignment process is similar for both the ENA
  news items and the proclamation.
• The alignment is done using a sentence aligner called
  Hunalign (similar to Gale and Church ,1993).
• Hunalign aligns bilingual text using sentence-length.
• An English-Amharic bilingual dictionary of word lists
  sized 8,212 have been adopted and used
  (Armbruster, 2007).
• The aligner aligns an English Sentence to Amharic in
  0-1, 1-1 or 1-2.
Sentence level aligned English-Amharic
           parallel corpora
• The result of the alignment at the sentence level for
  both the ENA news and the proclamation

      Aligned Sentence pairs                 Counts

      ENA Corpus                                      155,200

      Proclamation Corpus                              18,632

                                     Total            173,832

    Table 7: Sentence aligned English-Amharic bilingual corpus
Way Forward


• To increase the number of the English-Amharic
  proclamation corpus as much as possible.
• To further analyze the experiment conducted so far.
• To increase the translation quality using
  linguistic knowledge: morpho-syntactically.
Thank You!!!

Mais conteúdo relacionado

Destaque

Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
Google translate (new russian)
Google translate (new russian)Google translate (new russian)
Google translate (new russian)
Nurbek Matzhani
 
8 Google Translate
8 Google Translate8 Google Translate
8 Google Translate
aptwano
 
Techniques in Translation
Techniques in TranslationTechniques in Translation
Techniques in Translation
juvelle villafania
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
Rushdi Shams
 
5 Best Powerpoint Templates Amazing Creative Presentation Themes
5 Best Powerpoint Templates   Amazing Creative Presentation Themes5 Best Powerpoint Templates   Amazing Creative Presentation Themes
5 Best Powerpoint Templates Amazing Creative Presentation Themes
Yeasir Arafat
 

Destaque (16)

Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Google translate (new russian)
Google translate (new russian)Google translate (new russian)
Google translate (new russian)
 
8 Google Translate
8 Google Translate8 Google Translate
8 Google Translate
 
Google Translate in the Classroom
Google Translate in the ClassroomGoogle Translate in the Classroom
Google Translate in the Classroom
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
 
Google Translate Update
Google Translate UpdateGoogle Translate Update
Google Translate Update
 
Google translate
Google translateGoogle translate
Google translate
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Machine Translation=Google Translator
Machine Translation=Google TranslatorMachine Translation=Google Translator
Machine Translation=Google Translator
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introduction
 
Techniques in Translation
Techniques in TranslationTechniques in Translation
Techniques in Translation
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
MyGengo.com State Of Global Translation Industry (2009)
MyGengo.com State Of Global Translation Industry (2009)MyGengo.com State Of Global Translation Industry (2009)
MyGengo.com State Of Global Translation Industry (2009)
 
Slideshare
SlideshareSlideshare
Slideshare
 
5 Best Powerpoint Templates Amazing Creative Presentation Themes
5 Best Powerpoint Templates   Amazing Creative Presentation Themes5 Best Powerpoint Templates   Amazing Creative Presentation Themes
5 Best Powerpoint Templates Amazing Creative Presentation Themes
 

Semelhante a Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT)

LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Lifeng (Aaron) Han
 

Semelhante a Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT) (20)

SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert System
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translation
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifying
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 
“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
 
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTAMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
 
Jq3616701679
Jq3616701679Jq3616701679
Jq3616701679
 

Mais de Guy De Pauw

The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
Guy De Pauw
 

Mais de Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT)

  • 1. Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT) Mulu Gebreegziabher Addis Ababa, Ethiopia: IT Doctoral Program, Addis Ababa University Prof. Laurent Besacier Grenoble, France: University Joseph Fourier Dr. Girma Taye & Dr. Dereje Teferi Addis Ababa, Ethiopia: Addis Ababa University December 2, 2011
  • 2. Presentation Outline • Introduction • Objectives • Experiment on the English-Amharic bilingual corpus • ENA English-Amharic parallel news corpus • Parliamentary English-Amharic parallel proclamation corpus • Sentence level aligned English-Amharic parallel corpora • Way Forward
  • 3. Introduction MT is the application of computers to translate text from one natural language to another. Machine Translation Systems Machine Assisted Fully Automated Translation Translation Human Aided Machine Aided Rule-based Empirical Systems Systems Translation Translation Statistical Machine Example-based Translation Translation
  • 4. Introduction Contd… • SMT systems are data driven that rely on bilingual parallel aligned corpus. • The performance of a SMT systems depends on the size of the available training corpus. • The larger the corpus, the better is the performance of the SMT system. • To develop EASMT, parallel data has to be collected from English-Amharic bilingual sentence pairs. • The experiment is to be conducted on at least a corpus of size 2M word pairs (40K sentence pairs).
  • 5. Introduction Contd… English-Amharic Statistical Machine Translation (EASMT) • Translation between two disparate languages Amharic English Language Family Afro-Asiatic Indo-European Morphology Complex Less inflected Syntactic Structure SOV SVO Writing System Geez Alphabet Latin Letters
  • 6. Introduction Contd… Parallel Corpus • Parallel corpus is a collection of text paired with translations into another language. • The experiment is conducted on training corpus of both languages based on expressions that are found in parallel Amharic-English news, parliamentary and constitutional documents. • The parallel ENA news contains sentences of day-to- day usage: – Direct translations of each other – Indirect translations written on the same topic in different languages called comparable corpora.
  • 7. Objectives The objective of the research is to study and develop an English-Amharic Statistical Machine Translation (EASMT) system and to improve the translation quality by integrating linguistic knowledge into the system.
  • 8. Experiment on the English-Amharic bilingual corpus Mining the parallel corpus • There are five steps to process a bilingual text corpus used for SMT system. (by Besacier et.al, 2009): – Raw data collection: proclamation and parallel news corpora have been collected – Document alignment: manual & automatic – Tokenization: splitting and trimming – Sentence splitting: done using the punct. [?!. ፡፡ ] – Sentence alignment: almost completed
  • 9. ENA English-Amharic parallel news corpus • News coverage: Aug 21, 2006 - January 06, 2008 News Corpus Counts Total Domestic Language 10,116 Amharic 23,771 Regional 13,655 English Foreign Language 11,276 11,276 Monitoring 494 Amharic-English 3,610 Information 3,116 Table 1: ENA news corpus
  • 10. ENA English-Amharic parallel news corpus • Count Summary: ENA news corpus Collected Amharic English Total Documents 23,771 11,276 35,047 Sentences 322,673 212,050 534,723 Counts of Raw 5,277,711 3,704,644 8,982,355 Words Vocabularies 270,786 130,803 401,589 Documents 1,036 1,036 2,072 Sentences 26,112 25,834 51,946 Counts of Aligned 207,200 198,461 405,661 Words Vocabularies 36,519 17,987 54,506 Table 2: The status of English-Amharic parallel news corpus on May 25, 2011
  • 11. ENA English-Amharic parallel news corpus • Manual alignment at document level: Challenges – Easy: preprocessing including exporting from SQL database to word, converting to Unicode using Zilla word to text converter – Time consuming: difficult to align at document level, since the files are stored in different folders with no structure, the date difference, punctuation, heading information differs (parallel/comparable corpus) – Document level alignment is done by looking at the heading and pick the news id from the folders
  • 12. ENA English-Amharic parallel news corpus • Automatically aligned English-Amharic Sample ENA news corpora at document level • The aligner takes the following into consideration to align the news items: – Start from the English corpus (constitute 32%). – Match news items that have different story language. – Limit the match with neighboring Amharic corpus to look 80 files around the current file. – A scoring method is used that gives equal weights to all matching columns.
  • 13. ENA English-Amharic parallel news corpus • The output result of the automatic aligner. Aligned Corpus Counts Cumulative % 1-1 383 383 0.37 1-2 155 538 0.52 1-3 498 1,036 1.00 Total Exact Matches 880 0.85 Unique Amharic Corpus 968 0.93 Unique English Corpus 1,036 1.00 Table 4: Automatically Aligned English-Amharic Sample ENA news items
  • 14. ENA English-Amharic parallel news corpus • Some of the sample English Documents were better aligned with not seen document, e.g. – 41827  41791 (manual 41827  41826) • 85% matches have been exactly automatically aligned similar to the manual alignment. • Thus, 15% is a new match that does not indicate to an error. Table ENA: Aligned Sample English/Amharic News corpus
  • 15. ENA English-Amharic parallel news corpus • Extended to automatically align the whole English- Amharic ENA news items Aligned Corpus Counts Cumulative % 1-1 2,928 2,928 0.26 1-2 1,535 4,463 0.40 1-3 6,813 11,276 1.00 Unique Amharic Corpus 10,487 0.93 Unique English Corpus 11,276 1.00 Table 5: Automatically Aligned English-Amharic ENA news items
  • 16. Parliamentary English-Amharic parallel proclamation corpus • Proclamation coverage: Aug 21, 1995 - July 16, 2010 Collected Amharic English Total Counts of Raw Documents 632 632 1,264 Documents 115 115 230 Sentences 19,115 25,730 44,845 Counts of Aligned Words 219,430 283,578 503,008 Vocabularies 32,299 17,908 50,207 Table 6: Aligned Parliamentary English-Amharic parallel proclamation corpus
  • 17. Sentence level aligned English-Amharic parallel corpora • The alignment process is similar for both the ENA news items and the proclamation. • The alignment is done using a sentence aligner called Hunalign (similar to Gale and Church ,1993). • Hunalign aligns bilingual text using sentence-length. • An English-Amharic bilingual dictionary of word lists sized 8,212 have been adopted and used (Armbruster, 2007). • The aligner aligns an English Sentence to Amharic in 0-1, 1-1 or 1-2.
  • 18. Sentence level aligned English-Amharic parallel corpora • The result of the alignment at the sentence level for both the ENA news and the proclamation Aligned Sentence pairs Counts ENA Corpus 155,200 Proclamation Corpus 18,632 Total 173,832 Table 7: Sentence aligned English-Amharic bilingual corpus
  • 19. Way Forward • To increase the number of the English-Amharic proclamation corpus as much as possible. • To further analyze the experiment conducted so far. • To increase the translation quality using linguistic knowledge: morpho-syntactically.