Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Smt in-a-few-slides
1. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Statistical machine translation in a few slides
Mikel L. Forcada1,2
1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,
E-03071 Alacant (Spain)
2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)
April 14-16, 2009: Free/open-source MT tutorial at the
CNGL
Mikel L. Forcada SMT in a few slides
2. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Contents
1 Translation as probability
2 “Decoding”
3 Training
4 “Log-linear”
5 Ain’t got nothin’ but the BLEUs?
6 The SMT lifecycle
Mikel L. Forcada SMT in a few slides
3. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
The “canonical” model
Translation as probability/1
Instead of saying that
a source-language (SL) sentence s in a SL text
and a target-language (TL) sentence t
as found in a SL–TL bitext are or are not a translation of
each other,
in SMT one says that they are a translation of each other
with a probability p(s, t) = p(t, s) (a joint probability).
We’ll assume we have such a probability model available.
Or at least a reasonable estimate.
Mikel L. Forcada SMT in a few slides
4. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
The “canonical” model
Translation as probability/2
According to basic probability laws, we can write:
p(s, t) = p(t, s) = p(s|t)p(t) = p(t|s)p(s) (1)
where p(x|y) is the conditional probability of x given y.
We are interested in translating from SL to TL. That is, we
want to find the most likely translation given the SL
sentence s:
t = arg max
t
p(t|s) (2)
Mikel L. Forcada SMT in a few slides
5. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
The “canonical” model
The “canonical” model
We can rewrite eq. (1) as
p(t|s) =
p(s|t)p(t)
p(s)
(3)
and then with (2) to get
t = arg max
t
p(s|t)p(t) (4)
Mikel L. Forcada SMT in a few slides
6. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Decoding”/1
t = arg max
t
p(s|t)p(t)
We have a product of two probability models:
A reverse translation model p(s|t) which tells us how likely
the SL sentence s is a translation of the candidate TL
sentence t, and
a target-language model p(t) which tells us how likely the
sentence t is in the TL side of bitexts.
These may be related (respectively) to the usual notions of
[reverse] adequacy: how much of the meaning of t is
conveyed by s
fluency: how fluent is the candidate TL sentence.
The arg max strikes a balance between the two.
Mikel L. Forcada SMT in a few slides
7. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Decoding”/2
In SMT parlance, the process of finding t∗ is called
decoding.1
Obviously, it does not explore all possible translations t in
the search space. There are infinitely many.
The search space is pruned.
Therefore, one just gets a reasonable t instead of the
ideal t
Pruning and search strategies are a very active research
topic.
Free/open-source software: Moses.
1
Reading SMT articles usually entails deciphering jargon which may be
very obscure to outsiders or newcomers
Mikel L. Forcada SMT in a few slides
8. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Training/1
So where do these probabilities come from?
p(t) may easily be estimated from a large monolingual TL
corpus (free/open-source software: irstlm)
The estimation of p(s|t) is more complex. It’s usually made
of
a lexical model describing the probability that the
translation of certain TL word or sequence of words
(“phrase”2
) is a certain SL word or sequence of words.
an alignment model describing the reordering of words or
“phrases”.
2
A very unfortunate choice in SMT jargon
Mikel L. Forcada SMT in a few slides
9. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Training/2
The lexical model and the alignment model are estimated
using a large sentence-aligned bilingual corpus through a
complex iterative process.
An initial set of lexical probabilities is obtained by
assuming, for instance, that any word in the TL sentence
aligns with any word in its SL counterpart. And then:
Alignment probabilities in accordance with the lexical
probabilities are computed.
Lexical probabilities are obtained in accordance with the
alignment probabilities
This process (“expectation maximization”) is repeated a
fixed number of times or until some convergence is
observed (free/open-source software: Giza++).
Mikel L. Forcada SMT in a few slides
10. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Training/3
In “phrase-based” SMT, alignments may be used to extract
(SL-phrase, TL-phrase) pairs of phrases
and their corresponding probabilities
for easier decoding and to avoid “word salad”.
Mikel L. Forcada SMT in a few slides
11. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Log-linear”/1
More SMT jargon!
It’s short for linear combination of logarithms of
probabilities.
And, sometimes, even features that aren’t logarithms or
probabilities of any kind.
OK, let’s take a look at the maths.
Mikel L. Forcada SMT in a few slides
12. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Log-linear”/2
One can write a more general formula:
p(t|s) =
exp( nF
k=1 λk fk (t, s))
Z
(5)
with nF feature functions fk (t, s) which can depend on s, t
or both.
Setting nF = 2, f1(s, t) = log p(s|t), f2(s, t) = log p(t), and
Z = p(s) one recovers the canonical formula (3).
The best translation is then
t = arg max
t
nF
k=1
λk fk (t, s) (6)
Most of the fk (t, s) are logarithms, hence “log-linear”.
Mikel L. Forcada SMT in a few slides
13. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
“Log-linear”/3
“Feature selection is a very open problem in SMT” (Lopez
2008)
Other possible functions include length penalties
(discouraging unreasonably short or long translations),
“inverted” versions of p(s|t), etc.
Where do we get the λk ’s from?
They are usually tuned so as to optimize the results on a
tuning set, according to a certain objective function that
is taken to be an indicator that correlates with translation
quality
may be automatically obtained from the output of the SMT
system and the translation in the corpus.
This is called MERT (minimum error rate training)
sometimes (free/open-source software: the Moses suite).
Mikel L. Forcada SMT in a few slides
14. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
Ain’t got nothin’ but the BLEUs?
The most famous “quality indicator” is called BLEU, but
there are many others.
BLEU counts which fraction of the 1-word,
2-word,. . . n-word sequences in the output match the
reference translation.
Correlation with subjective assessments of quality is still an
open question.
A lot of SMT research is currently BLEU-driven and makes
little contact with real applications of MT.
Mikel L. Forcada SMT in a few slides
15. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
The SMT lifecycle
Development:
Training: monolingual and sentence-aligned
bilingual corpora are used to estimate
probability models (features)
Tuning: a held-out portion of the
sentence-aligned bilingual corpus is
used to tune the coeficients λk
Decoding: sentences s are fed into the SMT system and
“decoded” into their translations t.
Evaluation: the system is evaluated against a reference
corpus.
Mikel L. Forcada SMT in a few slides
16. Translation as probability
“Decoding”
Training
“Log-linear”
Ain’t got nothin’ but the BLEUs?
The SMT lifecycle
License
This work may be distributed under the terms of
the Creative Commons Attribution–Share Alike license:
http:
//creativecommons.org/licenses/by-sa/3.0/
the GNU GPL v. 3.0 License:
http://www.gnu.org/licenses/gpl.html
Dual license! E-mail me to get the sources: mlf@ua.es
Mikel L. Forcada SMT in a few slides