SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Statistical machine translation in a few slides
Mikel L. Forcada1,2
1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,
E-03071 Alacant (Spain)
2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)
April 14-16, 2009: Free/open-source MT tutorial at the
CNGL
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Contents
1 Translation as probability
2 “Decoding”
3 Training
4 “Log-linear”
5 Ain’t got nothin’ but the BLEUs?
6 The SMT lifecycle
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
The “canonical” model
Translation as probability/1
Instead of saying that
a source-language (SL) sentence s in a SL text
and a target-language (TL) sentence t
as found in a SL–TL bitext are or are not a translation of
each other,
in SMT one says that they are a translation of each other
with a probability p(s, t) = p(t, s) (a joint probability).
We’ll assume we have such a probability model available.
Or at least a reasonable estimate.
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
The “canonical” model
Translation as probability/2
According to basic probability laws, we can write:
p(s, t) = p(t, s) = p(s|t)p(t) = p(t|s)p(s) (1)
where p(x|y) is the conditional probability of x given y.
We are interested in translating from SL to TL. That is, we
want to find the most likely translation given the SL
sentence s:
t = arg max
t
p(t|s) (2)
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
The “canonical” model
The “canonical” model
We can rewrite eq. (1) as
p(t|s) =
p(s|t)p(t)
p(s)
(3)
and then with (2) to get
t = arg max
t
p(s|t)p(t) (4)
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Decoding”/1
t = arg max
t
p(s|t)p(t)
We have a product of two probability models:
A reverse translation model p(s|t) which tells us how likely
the SL sentence s is a translation of the candidate TL
sentence t, and
a target-language model p(t) which tells us how likely the
sentence t is in the TL side of bitexts.
These may be related (respectively) to the usual notions of
[reverse] adequacy: how much of the meaning of t is
conveyed by s
fluency: how fluent is the candidate TL sentence.
The arg max strikes a balance between the two.
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Decoding”/2
In SMT parlance, the process of finding t∗ is called
decoding.1
Obviously, it does not explore all possible translations t in
the search space. There are infinitely many.
The search space is pruned.
Therefore, one just gets a reasonable t instead of the
ideal t
Pruning and search strategies are a very active research
topic.
Free/open-source software: Moses.
1
Reading SMT articles usually entails deciphering jargon which may be
very obscure to outsiders or newcomers
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Training/1
So where do these probabilities come from?
p(t) may easily be estimated from a large monolingual TL
corpus (free/open-source software: irstlm)
The estimation of p(s|t) is more complex. It’s usually made
of
a lexical model describing the probability that the
translation of certain TL word or sequence of words
(“phrase”2
) is a certain SL word or sequence of words.
an alignment model describing the reordering of words or
“phrases”.
2
A very unfortunate choice in SMT jargon
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Training/2
The lexical model and the alignment model are estimated
using a large sentence-aligned bilingual corpus through a
complex iterative process.
An initial set of lexical probabilities is obtained by
assuming, for instance, that any word in the TL sentence
aligns with any word in its SL counterpart. And then:
Alignment probabilities in accordance with the lexical
probabilities are computed.
Lexical probabilities are obtained in accordance with the
alignment probabilities
This process (“expectation maximization”) is repeated a
fixed number of times or until some convergence is
observed (free/open-source software: Giza++).
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Training/3
In “phrase-based” SMT, alignments may be used to extract
(SL-phrase, TL-phrase) pairs of phrases
and their corresponding probabilities
for easier decoding and to avoid “word salad”.
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Log-linear”/1
More SMT jargon!
It’s short for linear combination of logarithms of
probabilities.
And, sometimes, even features that aren’t logarithms or
probabilities of any kind.
OK, let’s take a look at the maths.
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Log-linear”/2
One can write a more general formula:
p(t|s) =
exp( nF
k=1 λk fk (t, s))
Z
(5)
with nF feature functions fk (t, s) which can depend on s, t
or both.
Setting nF = 2, f1(s, t) = log p(s|t), f2(s, t) = log p(t), and
Z = p(s) one recovers the canonical formula (3).
The best translation is then
t = arg max
t
nF
k=1
λk fk (t, s) (6)
Most of the fk (t, s) are logarithms, hence “log-linear”.
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Log-linear”/3
“Feature selection is a very open problem in SMT” (Lopez
2008)
Other possible functions include length penalties
(discouraging unreasonably short or long translations),
“inverted” versions of p(s|t), etc.
Where do we get the λk ’s from?
They are usually tuned so as to optimize the results on a
tuning set, according to a certain objective function that
is taken to be an indicator that correlates with translation
quality
may be automatically obtained from the output of the SMT
system and the translation in the corpus.
This is called MERT (minimum error rate training)
sometimes (free/open-source software: the Moses suite).
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Ain’t got nothin’ but the BLEUs?
The most famous “quality indicator” is called BLEU, but
there are many others.
BLEU counts which fraction of the 1-word,
2-word,. . . n-word sequences in the output match the
reference translation.
Correlation with subjective assessments of quality is still an
open question.
A lot of SMT research is currently BLEU-driven and makes
little contact with real applications of MT.
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
The SMT lifecycle
Development:
Training: monolingual and sentence-aligned
bilingual corpora are used to estimate
probability models (features)
Tuning: a held-out portion of the
sentence-aligned bilingual corpus is
used to tune the coeficients λk
Decoding: sentences s are fed into the SMT system and
“decoded” into their translations t.
Evaluation: the system is evaluated against a reference
corpus.
Mikel L. Forcada SMT in a few slides
Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
License
This work may be distributed under the terms of
the Creative Commons Attribution–Share Alike license:
http:
//creativecommons.org/licenses/by-sa/3.0/
the GNU GPL v. 3.0 License:
http://www.gnu.org/licenses/gpl.html
Dual license! E-mail me to get the sources: mlf@ua.es
Mikel L. Forcada SMT in a few slides

Mais conteúdo relacionado

Mais procurados

13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translationHrishikesh Nair
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine TranslationRIILP
 
Open-source machine translation for Icelandic: the Apertium platform as an o...
Open-source machine translation for Icelandic:
 the Apertium platform as an o...Open-source machine translation for Icelandic:
 the Apertium platform as an o...
Open-source machine translation for Icelandic: the Apertium platform as an o...Forcada Mikel
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...Yandex
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introductionnlab_utokyo
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approachvini89
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter Systemkkkseld
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine TranslationJaganadh Gopinadhan
 
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesCoping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesMarc Pantel
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 

Mais procurados (20)

eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and SummarizationeSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translation
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Latest trends in NLP - Exploring BERT
Latest trends in NLP -  Exploring BERTLatest trends in NLP -  Exploring BERT
Latest trends in NLP - Exploring BERT
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
 
Open-source machine translation for Icelandic: the Apertium platform as an o...
Open-source machine translation for Icelandic:
 the Apertium platform as an o...Open-source machine translation for Icelandic:
 the Apertium platform as an o...
Open-source machine translation for Icelandic: the Apertium platform as an o...
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
 
Automata
AutomataAutomata
Automata
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introduction
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approach
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesCoping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 

Destaque

Kim Daha Çabuk İş bulmak İster?
Kim Daha Çabuk İş bulmak İster?Kim Daha Çabuk İş bulmak İster?
Kim Daha Çabuk İş bulmak İster?Taylan Demirkaya
 
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...Forcada Mikel
 
Integrating corpus-based and rule-based approaches in an open-source machine ...
Integrating corpus-based and rule-based approaches in an open-source machine ...Integrating corpus-based and rule-based approaches in an open-source machine ...
Integrating corpus-based and rule-based approaches in an open-source machine ...Forcada Mikel
 
Davranışsal Finans ve Ekonomi
Davranışsal Finans ve EkonomiDavranışsal Finans ve Ekonomi
Davranışsal Finans ve EkonomiTaylan Demirkaya
 

Destaque (6)

Kim Daha Çabuk İş bulmak İster?
Kim Daha Çabuk İş bulmak İster?Kim Daha Çabuk İş bulmak İster?
Kim Daha Çabuk İş bulmak İster?
 
Embryonix Ne Yapar?
Embryonix Ne Yapar?Embryonix Ne Yapar?
Embryonix Ne Yapar?
 
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
Traducció automàtica de codi obert: Apertium, una oportunitat per a llengües ...
 
Chapter 7
Chapter 7Chapter 7
Chapter 7
 
Integrating corpus-based and rule-based approaches in an open-source machine ...
Integrating corpus-based and rule-based approaches in an open-source machine ...Integrating corpus-based and rule-based approaches in an open-source machine ...
Integrating corpus-based and rule-based approaches in an open-source machine ...
 
Davranışsal Finans ve Ekonomi
Davranışsal Finans ve EkonomiDavranışsal Finans ve Ekonomi
Davranışsal Finans ve Ekonomi
 

Semelhante a Smt in-a-few-slides

Deep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsDeep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsJeongkyu Shin
 
An introduction to erlang
An introduction to erlangAn introduction to erlang
An introduction to erlangMirko Bonadei
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
The hangover: A "modern" (?) high performance approach to build an offensive ...
The hangover: A "modern" (?) high performance approach to build an offensive ...The hangover: A "modern" (?) high performance approach to build an offensive ...
The hangover: A "modern" (?) high performance approach to build an offensive ...Nelson Brito
 
Model-driven Development of Model Transformations
Model-driven Development of Model TransformationsModel-driven Development of Model Transformations
Model-driven Development of Model TransformationsPieter Van Gorp
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019 Alexis Agahi
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfeliasabdi2024
 
Programming_Language_Syntax.ppt
Programming_Language_Syntax.pptProgramming_Language_Syntax.ppt
Programming_Language_Syntax.pptAmrita Sharma
 
Performance analysis of bangla speech recognizer model using hmm
Performance analysis of bangla speech recognizer model using hmmPerformance analysis of bangla speech recognizer model using hmm
Performance analysis of bangla speech recognizer model using hmmAbdullah al Mamun
 
Site visit presentation 2012 12 14
Site visit presentation 2012 12 14Site visit presentation 2012 12 14
Site visit presentation 2012 12 14Mitchell Wand
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsTonny Madsen
 
2009 CSBB LAB 新生訓練
2009 CSBB LAB 新生訓練2009 CSBB LAB 新生訓練
2009 CSBB LAB 新生訓練Abner Huang
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analyticsshanbady
 
The future of DSLs - functions and formal methods
The future of DSLs - functions and formal methodsThe future of DSLs - functions and formal methods
The future of DSLs - functions and formal methodsMarkus Voelter
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)TAUS - The Language Data Network
 
Fast and Precise Symbolic Analysis of Concurrency Bugs in Device Drivers
Fast and Precise Symbolic Analysis of Concurrency Bugs in Device DriversFast and Precise Symbolic Analysis of Concurrency Bugs in Device Drivers
Fast and Precise Symbolic Analysis of Concurrency Bugs in Device DriversPantazis Deligiannis
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 

Semelhante a Smt in-a-few-slides (20)

Deep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsDeep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractions
 
An introduction to erlang
An introduction to erlangAn introduction to erlang
An introduction to erlang
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
The hangover: A "modern" (?) high performance approach to build an offensive ...
The hangover: A "modern" (?) high performance approach to build an offensive ...The hangover: A "modern" (?) high performance approach to build an offensive ...
The hangover: A "modern" (?) high performance approach to build an offensive ...
 
Model-driven Development of Model Transformations
Model-driven Development of Model TransformationsModel-driven Development of Model Transformations
Model-driven Development of Model Transformations
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 
Programming_Language_Syntax.ppt
Programming_Language_Syntax.pptProgramming_Language_Syntax.ppt
Programming_Language_Syntax.ppt
 
Performance analysis of bangla speech recognizer model using hmm
Performance analysis of bangla speech recognizer model using hmmPerformance analysis of bangla speech recognizer model using hmm
Performance analysis of bangla speech recognizer model using hmm
 
Site visit presentation 2012 12 14
Site visit presentation 2012 12 14Site visit presentation 2012 12 14
Site visit presentation 2012 12 14
 
ITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and GrammarsITU - MDD - Textural Languages and Grammars
ITU - MDD - Textural Languages and Grammars
 
2009 CSBB LAB 新生訓練
2009 CSBB LAB 新生訓練2009 CSBB LAB 新生訓練
2009 CSBB LAB 新生訓練
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
The future of DSLs - functions and formal methods
The future of DSLs - functions and formal methodsThe future of DSLs - functions and formal methods
The future of DSLs - functions and formal methods
 
NLTK
NLTKNLTK
NLTK
 
Pycon Korea 2020
Pycon Korea 2020 Pycon Korea 2020
Pycon Korea 2020
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
Fast and Precise Symbolic Analysis of Concurrency Bugs in Device Drivers
Fast and Precise Symbolic Analysis of Concurrency Bugs in Device DriversFast and Precise Symbolic Analysis of Concurrency Bugs in Device Drivers
Fast and Precise Symbolic Analysis of Concurrency Bugs in Device Drivers
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Smt in-a-few-slides

  • 1. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Statistical machine translation in a few slides Mikel L. Forcada1,2 1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain) 2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain) April 14-16, 2009: Free/open-source MT tutorial at the CNGL Mikel L. Forcada SMT in a few slides
  • 2. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Contents 1 Translation as probability 2 “Decoding” 3 Training 4 “Log-linear” 5 Ain’t got nothin’ but the BLEUs? 6 The SMT lifecycle Mikel L. Forcada SMT in a few slides
  • 3. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle The “canonical” model Translation as probability/1 Instead of saying that a source-language (SL) sentence s in a SL text and a target-language (TL) sentence t as found in a SL–TL bitext are or are not a translation of each other, in SMT one says that they are a translation of each other with a probability p(s, t) = p(t, s) (a joint probability). We’ll assume we have such a probability model available. Or at least a reasonable estimate. Mikel L. Forcada SMT in a few slides
  • 4. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle The “canonical” model Translation as probability/2 According to basic probability laws, we can write: p(s, t) = p(t, s) = p(s|t)p(t) = p(t|s)p(s) (1) where p(x|y) is the conditional probability of x given y. We are interested in translating from SL to TL. That is, we want to find the most likely translation given the SL sentence s: t = arg max t p(t|s) (2) Mikel L. Forcada SMT in a few slides
  • 5. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle The “canonical” model The “canonical” model We can rewrite eq. (1) as p(t|s) = p(s|t)p(t) p(s) (3) and then with (2) to get t = arg max t p(s|t)p(t) (4) Mikel L. Forcada SMT in a few slides
  • 6. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Decoding”/1 t = arg max t p(s|t)p(t) We have a product of two probability models: A reverse translation model p(s|t) which tells us how likely the SL sentence s is a translation of the candidate TL sentence t, and a target-language model p(t) which tells us how likely the sentence t is in the TL side of bitexts. These may be related (respectively) to the usual notions of [reverse] adequacy: how much of the meaning of t is conveyed by s fluency: how fluent is the candidate TL sentence. The arg max strikes a balance between the two. Mikel L. Forcada SMT in a few slides
  • 7. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Decoding”/2 In SMT parlance, the process of finding t∗ is called decoding.1 Obviously, it does not explore all possible translations t in the search space. There are infinitely many. The search space is pruned. Therefore, one just gets a reasonable t instead of the ideal t Pruning and search strategies are a very active research topic. Free/open-source software: Moses. 1 Reading SMT articles usually entails deciphering jargon which may be very obscure to outsiders or newcomers Mikel L. Forcada SMT in a few slides
  • 8. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Training/1 So where do these probabilities come from? p(t) may easily be estimated from a large monolingual TL corpus (free/open-source software: irstlm) The estimation of p(s|t) is more complex. It’s usually made of a lexical model describing the probability that the translation of certain TL word or sequence of words (“phrase”2 ) is a certain SL word or sequence of words. an alignment model describing the reordering of words or “phrases”. 2 A very unfortunate choice in SMT jargon Mikel L. Forcada SMT in a few slides
  • 9. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Training/2 The lexical model and the alignment model are estimated using a large sentence-aligned bilingual corpus through a complex iterative process. An initial set of lexical probabilities is obtained by assuming, for instance, that any word in the TL sentence aligns with any word in its SL counterpart. And then: Alignment probabilities in accordance with the lexical probabilities are computed. Lexical probabilities are obtained in accordance with the alignment probabilities This process (“expectation maximization”) is repeated a fixed number of times or until some convergence is observed (free/open-source software: Giza++). Mikel L. Forcada SMT in a few slides
  • 10. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Training/3 In “phrase-based” SMT, alignments may be used to extract (SL-phrase, TL-phrase) pairs of phrases and their corresponding probabilities for easier decoding and to avoid “word salad”. Mikel L. Forcada SMT in a few slides
  • 11. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Log-linear”/1 More SMT jargon! It’s short for linear combination of logarithms of probabilities. And, sometimes, even features that aren’t logarithms or probabilities of any kind. OK, let’s take a look at the maths. Mikel L. Forcada SMT in a few slides
  • 12. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Log-linear”/2 One can write a more general formula: p(t|s) = exp( nF k=1 λk fk (t, s)) Z (5) with nF feature functions fk (t, s) which can depend on s, t or both. Setting nF = 2, f1(s, t) = log p(s|t), f2(s, t) = log p(t), and Z = p(s) one recovers the canonical formula (3). The best translation is then t = arg max t nF k=1 λk fk (t, s) (6) Most of the fk (t, s) are logarithms, hence “log-linear”. Mikel L. Forcada SMT in a few slides
  • 13. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle “Log-linear”/3 “Feature selection is a very open problem in SMT” (Lopez 2008) Other possible functions include length penalties (discouraging unreasonably short or long translations), “inverted” versions of p(s|t), etc. Where do we get the λk ’s from? They are usually tuned so as to optimize the results on a tuning set, according to a certain objective function that is taken to be an indicator that correlates with translation quality may be automatically obtained from the output of the SMT system and the translation in the corpus. This is called MERT (minimum error rate training) sometimes (free/open-source software: the Moses suite). Mikel L. Forcada SMT in a few slides
  • 14. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle Ain’t got nothin’ but the BLEUs? The most famous “quality indicator” is called BLEU, but there are many others. BLEU counts which fraction of the 1-word, 2-word,. . . n-word sequences in the output match the reference translation. Correlation with subjective assessments of quality is still an open question. A lot of SMT research is currently BLEU-driven and makes little contact with real applications of MT. Mikel L. Forcada SMT in a few slides
  • 15. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle The SMT lifecycle Development: Training: monolingual and sentence-aligned bilingual corpora are used to estimate probability models (features) Tuning: a held-out portion of the sentence-aligned bilingual corpus is used to tune the coeficients λk Decoding: sentences s are fed into the SMT system and “decoded” into their translations t. Evaluation: the system is evaluated against a reference corpus. Mikel L. Forcada SMT in a few slides
  • 16. Translation as probability “Decoding” Training “Log-linear” Ain’t got nothin’ but the BLEUs? The SMT lifecycle License This work may be distributed under the terms of the Creative Commons Attribution–Share Alike license: http: //creativecommons.org/licenses/by-sa/3.0/ the GNU GPL v. 3.0 License: http://www.gnu.org/licenses/gpl.html Dual license! E-mail me to get the sources: mlf@ua.es Mikel L. Forcada SMT in a few slides