SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
Enriching Transliteration Lexicon Using
Automatic Transliteration Extraction
Sarvnaz Karimi
School of Computer Science and IT
RMIT University
Supervisors: Dr. Falk Scholer and Dr. Andrew Turpin
Keywords: Transliteration, Parallel Corpus
Machine Transliteration
• Machine transliteration transforms a word from a source language to a
target language with preserved pronunciation.
• Machine translation, cross-lingual information retrieval and cross-lingual
question answering are the main areas that automatic transliteration is
applicable.
• Transliteration has been studied in two major areas: transliteration
generation and transliteration extraction.
• Transliteration generation gets an input source word in source language
(e.g. Sydney in English) and generates its transliteration in target
language (e.g.
©
Â
ª
in Persian).
• Transliteration extraction is discovering transliteration pairs (e.g.
(Sydney,
©
Â
ª
) in bilingual texts.
Transliteration Extraction So Far!
Discovery of transliteration methods in literature consider:
• Extraction from parallel corpus:
– Statistical methods are beneficial, particularly because the
sentences/words can be aligned.
– Yet parallel corpus is hard to find for many less-computerised
languages.
• Extraction from comparable corpus:
– More evidence than just statistical information are required to extract
pairs (e.g. temporal, phonetic information or Web-count).
– Comparable corpora are easier to construct and find than parallel one.
Most studies use name entity (NE) recogniser to separate proper nouns that
are subject to transliteration from other words.
Persian and English Transliteration
• Transliteration generation has been studied using n-gram based and
consonant-vowel based approaches.
• Transliteration extraction is not previously studied for this language pair,
mainly due to lack of any parallel or comparable corpus.
• Transliteration extraction has been studied using co-occurrence, temporal,
edit distance measures or phonetic similarities. We aim to apply our
transliteration generation methods as a basis for this task.
Proposed Method: Application of Transliteration
Generation in Extraction
1. For each document in each language we perform a pre-processing
to generate a bag of words from each document (tokenise) and
also remove stop-words
2. Each word in source language is matched against a dictionary,
if not found then it is an out-of-dictionary word that needs
transliteration in target document.
3. A ranked list of possible transliterations for each source
word is generated by transliteration system.
4. Those transliterations matching with the target document
potential words are considered as a potential pair.
5. A score can be given to these pairs based on the rank of the
transliteration and number of times they are paired.
Experimental Setup
• An English-Persian comparable corpus of news texts is constructed
consisting of 3,474 documents.
• An English machine-readable dictionary was applied which contains
120,177 entries.
• Experiments:
– Accuracy of transliterations extracted (Fixed Training Collection).
Different methods of matching experienced (1-English documents are
parsed to extract their out-of-dictionary words using dictionary look-up
and stemming. 2- A parsing on Persian documents is performed by
rendering the words that contain allophones characters to one unique
character. 3- Repeating the previous experiment including capital
characters knowledge.)
– Impact of seed transliteration lexicon.
Experiments and Results
Experiment 1 : Accuracy
#Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.
1 4.6 3.6 (81.3) 5.8 68.9 8.5 11.3 1662 2860 70.3
2 4.1 3.6 (90.2) 5.9 69.7 8.5 11.3 1641 2496 80.4
3 6.6 5.9 (89.2) 6.9 66.8 8.4 22.6 1725 3694 75.2
Experiment 2 : Train Size
Train #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr.
200 2.6 2.2 (86.5) 2.8 84.2 9.1 24.0 1322 1287 78.2
300 3.1 2.5 (82.5) 4.2 71.9 8.8 23.5 1494 1579 73.3
400 3.1 2.6 (84.5) 4.6 71.3 8.8 23.4 1483 1569 74.1
500 2.8 2.5 (89.5) 4.3 78.0 8.9 23.5 1459 1507 79.0
Conclusions and Further Work
• Transliteration extraction can be helpful in automatically generating
transliteration lexicons.
• Transliteration lexicon as a dictionary of transliteration of a proper noun or
technical terms that are not translated are beneficial in dictionary-based
machine translation applications.
• We investigated a method of applying the current yet incomplete
transliteration lexicons in enriching them using comparable corpora.
• In future, role of NE-recogniser will be investigated to compare with a
simple dictionary look-up.

Mais conteúdo relacionado

Mais procurados

Assessing receptive skills
Assessing receptive skillsAssessing receptive skills
Assessing receptive skills
Michael Chin
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
Alicia Ruiz
 

Mais procurados (18)

Assessing receptive skills
Assessing receptive skillsAssessing receptive skills
Assessing receptive skills
 
Part of speech tagging for Arabic
Part of speech tagging for ArabicPart of speech tagging for Arabic
Part of speech tagging for Arabic
 
P99 1067
P99 1067P99 1067
P99 1067
 
English to Bangla Translation
English to Bangla TranslationEnglish to Bangla Translation
English to Bangla Translation
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Computer dictionaries and_parsing_ppt
Computer dictionaries and_parsing_pptComputer dictionaries and_parsing_ppt
Computer dictionaries and_parsing_ppt
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Syntactic parsing for arabic
Syntactic parsing for arabicSyntactic parsing for arabic
Syntactic parsing for arabic
 
4 salient features of corpus
4 salient features of corpus4 salient features of corpus
4 salient features of corpus
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
Applications of CL to FLT
Applications of CL to FLTApplications of CL to FLT
Applications of CL to FLT
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONA ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
 
Moses
MosesMoses
Moses
 
Selecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for childrenSelecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for children
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 

Destaque

Parallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corporaParallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corpora
Haithem Afli
 
Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...
Tobias Wunner
 
Macro economische analyse van brazilië
Macro economische analyse van braziliëMacro economische analyse van brazilië
Macro economische analyse van brazilië
Jan-Willem Lammens
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
teflang
 

Destaque (17)

Parallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corporaParallel text extraction from multimodal comparable corpora
Parallel text extraction from multimodal comparable corpora
 
Chelo Vargas-Sierra
Chelo Vargas-SierraChelo Vargas-Sierra
Chelo Vargas-Sierra
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeDealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
 
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
Michael Bloodgood - 2017 - Acquisition of Translation Lexicons for Historical...
 
Applicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologiesApplicative evaluation of bilingual terminologies
Applicative evaluation of bilingual terminologies
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...
 
Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...Cross-lingual ontology lexicalisation, translation and information extraction...
Cross-lingual ontology lexicalisation, translation and information extraction...
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
A cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexiconA cognitive view of the bilingual lexicon
A cognitive view of the bilingual lexicon
 
Bilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation PatternsBilingual Terminology Extraction based on Translation Patterns
Bilingual Terminology Extraction based on Translation Patterns
 
Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...Challenges in the linguistic exploitation of specialized republishable web co...
Challenges in the linguistic exploitation of specialized republishable web co...
 
Macro economische analyse van brazilië
Macro economische analyse van braziliëMacro economische analyse van brazilië
Macro economische analyse van brazilië
 
Embedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and EvaluationEmbedded Human Computation for Knowledge Extraction and Evaluation
Embedded Human Computation for Knowledge Extraction and Evaluation
 
Word Formation in English
Word Formation in EnglishWord Formation in English
Word Formation in English
 

Semelhante a Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Edmond Lepedus
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
Universidad Nacional de San Martin
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
Editor IJARCET
 
Fsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaFsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attia
Mohammed Attia
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
behzad66
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 

Semelhante a Enriching Transliteration Lexicon Using Automatic Transliteration Extraction (20)

Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert System
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
project present
project presentproject present
project present
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
 
C8 akumaran
C8 akumaranC8 akumaran
C8 akumaran
 
An expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabicAn expert system for automatic reading of a text written in standard arabic
An expert system for automatic reading of a text written in standard arabic
 
Ey4301913917
Ey4301913917Ey4301913917
Ey4301913917
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
 
Fsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaFsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attia
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 

Mais de Sarvnaz Karimi

Mais de Sarvnaz Karimi (7)

Search in Medical Text
Search in Medical TextSearch in Medical Text
Search in Medical Text
 
Corpus Effects on the Evaluation of Automated Transliteration Systems
Corpus Effects on the Evaluation of Automated Transliteration SystemsCorpus Effects on the Evaluation of Automated Transliteration Systems
Corpus Effects on the Evaluation of Automated Transliteration Systems
 
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...
Collapsed Consonant and Vowel Models: New Approaches for English-Persian Tran...
 
Karimi esair2015
Karimi esair2015Karimi esair2015
Karimi esair2015
 
Pinpointing Location Focus in Microblogs
Pinpointing Location Focus in MicroblogsPinpointing Location Focus in Microblogs
Pinpointing Location Focus in Microblogs
 
Biomedical Search
Biomedical SearchBiomedical Search
Biomedical Search
 
Classifying Microblogs For Disasters
Classifying Microblogs For DisastersClassifying Microblogs For Disasters
Classifying Microblogs For Disasters
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

  • 1. Enriching Transliteration Lexicon Using Automatic Transliteration Extraction Sarvnaz Karimi School of Computer Science and IT RMIT University Supervisors: Dr. Falk Scholer and Dr. Andrew Turpin Keywords: Transliteration, Parallel Corpus
  • 2. Machine Transliteration • Machine transliteration transforms a word from a source language to a target language with preserved pronunciation. • Machine translation, cross-lingual information retrieval and cross-lingual question answering are the main areas that automatic transliteration is applicable. • Transliteration has been studied in two major areas: transliteration generation and transliteration extraction. • Transliteration generation gets an input source word in source language (e.g. Sydney in English) and generates its transliteration in target language (e.g. © Â ª in Persian). • Transliteration extraction is discovering transliteration pairs (e.g. (Sydney, © Â ª ) in bilingual texts.
  • 3. Transliteration Extraction So Far! Discovery of transliteration methods in literature consider: • Extraction from parallel corpus: – Statistical methods are beneficial, particularly because the sentences/words can be aligned. – Yet parallel corpus is hard to find for many less-computerised languages. • Extraction from comparable corpus: – More evidence than just statistical information are required to extract pairs (e.g. temporal, phonetic information or Web-count). – Comparable corpora are easier to construct and find than parallel one. Most studies use name entity (NE) recogniser to separate proper nouns that are subject to transliteration from other words.
  • 4. Persian and English Transliteration • Transliteration generation has been studied using n-gram based and consonant-vowel based approaches. • Transliteration extraction is not previously studied for this language pair, mainly due to lack of any parallel or comparable corpus. • Transliteration extraction has been studied using co-occurrence, temporal, edit distance measures or phonetic similarities. We aim to apply our transliteration generation methods as a basis for this task.
  • 5. Proposed Method: Application of Transliteration Generation in Extraction 1. For each document in each language we perform a pre-processing to generate a bag of words from each document (tokenise) and also remove stop-words 2. Each word in source language is matched against a dictionary, if not found then it is an out-of-dictionary word that needs transliteration in target document. 3. A ranked list of possible transliterations for each source word is generated by transliteration system. 4. Those transliterations matching with the target document potential words are considered as a potential pair. 5. A score can be given to these pairs based on the rank of the transliteration and number of times they are paired.
  • 6. Experimental Setup • An English-Persian comparable corpus of news texts is constructed consisting of 3,474 documents. • An English machine-readable dictionary was applied which contains 120,177 entries. • Experiments: – Accuracy of transliterations extracted (Fixed Training Collection). Different methods of matching experienced (1-English documents are parsed to extract their out-of-dictionary words using dictionary look-up and stemming. 2- A parsing on Persian documents is performed by rendering the words that contain allophones characters to one unique character. 3- Repeating the previous experiment including capital characters knowledge.) – Impact of seed transliteration lexicon.
  • 7. Experiments and Results Experiment 1 : Accuracy #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr. 1 4.6 3.6 (81.3) 5.8 68.9 8.5 11.3 1662 2860 70.3 2 4.1 3.6 (90.2) 5.9 69.7 8.5 11.3 1641 2496 80.4 3 6.6 5.9 (89.2) 6.9 66.8 8.4 22.6 1725 3694 75.2 Experiment 2 : Train Size Train #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr. 200 2.6 2.2 (86.5) 2.8 84.2 9.1 24.0 1322 1287 78.2 300 3.1 2.5 (82.5) 4.2 71.9 8.8 23.5 1494 1579 73.3 400 3.1 2.6 (84.5) 4.6 71.3 8.8 23.4 1483 1569 74.1 500 2.8 2.5 (89.5) 4.3 78.0 8.9 23.5 1459 1507 79.0
  • 8. Conclusions and Further Work • Transliteration extraction can be helpful in automatically generating transliteration lexicons. • Transliteration lexicon as a dictionary of transliteration of a proper noun or technical terms that are not translated are beneficial in dictionary-based machine translation applications. • We investigated a method of applying the current yet incomplete transliteration lexicons in enriching them using comparable corpora. • In future, role of NE-recogniser will be investigated to compare with a simple dictionary look-up.