Bilingual sentence-aligned parallel corpora, or bitexts, are a useful resource for solving many computational linguistics problems including part-of speech tagging, syntactic parsing, named entity recognition, word sense disambiguation, sentiment analysis, etc.; they are also a critical resource for some real-world applications such as statistical machine translation (SMT) and cross-language information retrieval. Unfortunately, building large bi-texts is hard, and thus most of the 6,500+ world languages remain resource-poor in bi-texts. However, many resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for using their bi-texts.
We explore various options for bi-text reuse: (i) direct combination of bi-texts, (ii) combination of models trained on such bi-texts, and (iii) a sophisticated combination of (i) and (ii).
We further explore the idea of generating bitexts for a resource-poor language by adapting a bi-text for a resource-rich language. We build a lattice of adaptation options for each word and phrase, and we then decode it using a language model for the resource-poor language. We compare word- and phrase-level adaptation, and we further make use of cross-language morphology. For the adaptation, we experiment with (a) a standard phrase-based SMT decoder, and (b) a specialized beam-search adaptation decoder.
Finally, we observe that for closely-related languages, many of the differences are at the subword level. Thus, we explore the idea of reducing translation to character-level transliteration. We further demonstrate the potential of combining word- and character-level models.
Технические особенности создания сайта, Дмитрий Васильева, лекция в Школе веб...Yandex
Mais conteúdo relacionado
Semelhante a Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1
Semelhante a Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1 (7)
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1
1. Combining, Adapting and Reusing Bi-texts
between Related Languages:
Application to Statistical Machine Translation
Preslav Nakov, Qatar Computing Research Institute
(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)
Yandex seminar
August 13, 2014, Moscow, Russia
2. 2
Plan
• Part I
- Introduction to Statistical Machine Translation
• Part II
- Combining, Adapting and Reusing Bi-texts between Related
Languages: Application to Statistical Machine Translation
• Part III
- Further Discussion on SMT
4. 4
Statistical Machine Translation (SMT)
Reach Out to Asia (ROTA) has
announced its fifth Wheels
‘n’ Heels, Qatar’s largest
annual community event,
which will promote ROTA’s
partnership with the Qatar
Japan 2012 Committee. Held
at the Museum of Islamic Art
Park on 10 February, the
event will celebrate 40 years
of cordial relations between
the two countries. Essa Al
Mannai, ROTA Director, said:
“A group of 40 Japanese
students are traveling to
Doha especially to take part
in our event.
SMT systems:
- learn from human-generated translations
- extract useful knowledge and build models
- use the models to translate new sentences
6. 6
Translation as Decoding
• 1947, Warren Weaver, Rockefeller Foundation:
One naturally wonders if the problem of
translation could conceivably be treated as a
problem in cryptography. When I look at an
article in Russian, I say: ‘This is really written in
English, but it has been coded in some strange
symbols. I will now proceed to decode.’
Example:
– Это действительно написано по-английски .
– This is really written in English .
7. 7
The Basic Components of an SMT System
Look for the best English translation
that both conveys the French meaning
and is grammatical.
8. 8
Components of an SMT System
• Language Model
- English text е P(e)
o good English high probability
o bad English low probability
• Translation Model
- Pair <f,e> P(f|e)
o <f,e> are translations high probability
o <f,e> are not translations low probability
• Decoder
- Given P(e), P(f|e), and f we look for е that maximizes
[P(e).P(f|e)]
9. 9
Combining P(e) and P(f|e)
How do we translate to English
the Russian phrase “красный цветок”?
P(e) P(f|e) P(e).P(f|e)
a flower red ↓ ↑ ↓
red flower a ↓ ↑ ↓
flower red a ↓ ↑ ↓
a red dog ↑ ↓ ↓
dog cat mouse ↓ ↓ ↓ ↓
a red flower ↑ ↑ ↑
11. 11
Language Model
•Goal: prefer “good” to “bad” English
- “good” ≠ grammatical
- “bad” ≈ unlikely
•Examples (grammaticality):
- I do not like strong tea. good
- I do not like powerful tea. bad
- I like strong tea not. bad
- Like not tea strong do I. bad
12. 12
Example:
Grammatical but Low-probability Text
Eye halve a spelling checker
It came with my pea sea
It plainly marks four my revue
Miss steaks eye kin knot sea.
Eye strike a key and type a word
And weight four it two say
Weather eye am wrong oar write
It shows me a strait a weigh.
As soon as a mist ache is maid
It nose bee fore two long
And eye can put the error rite
Its rare lea ever wrong.
Eye have run this poem threw it
I am shore your pleased two no
Its letter perfect awl the weigh
My checker tolled me sew.
Торопыжка был голодный - проглотил утюг холодный.
17. 17
Modeling P(f|e) – Sentence Level
Batman did not fight any cat woman .
Бэтмен не вел бой с никакой женщиной кошкой .
• Cannot be estimated directly
18. 18
Modeling P(f|e)
Batman did not fight any cat woman .
Бэтмен не вел бой с никакой женщиной кошкой .
• Broken into smaller steps
19. 19
IBM Model 4: Generation
(Brown et al., CL 1993)
Batman did not fight any cat woman .
Batman not fight fight any cat woman .
Batman not fight fight NULL any cat woman .
Бэтмен не вел бой с никакой кошкой женщиной .
Бэтмен не вел бой с никакой женщиной кошкой .
n(3|fight)
P-NULL
t(не|not)
d(8|7)
(Brown et al., CL 1993)
20. 20
IBM Model 4: Generation
(Brown et al., CL 1993)
Batman did not fight any cat woman .
Batman not fight fight any cat woman .
Batman not fight fight NULL any cat woman .
Бэтмен не вел бой с никакой кошкой женщиной .
Бэтмен не вел бой с никакой женщиной кошкой .
n(3|fight)
P-NULL
t(не|not)
d(8|7)
• All these probabilities could be learned
if word alignments were available.
• We can learn word alignments using EM.
(Brown et al., CL 1993)
21. 21
Translation Model: Learned from a Bi-Text
Reach Out to Asia (ROTA) has
announced its fifth Wheels
‘n’ Heels, Qatar’s largest
annual community event,
which will promote ROTA’s
partnership with the Qatar
Japan 2012 Committee. Held
at the Museum of Islamic Art
Park on 10 February, the
event will celebrate 40 years
of cordial relations between
the two countries. Essa Al
Mannai, ROTA Director, said:
“A group of 40 Japanese
students are traveling to
Doha especially to take part
in our event.
35. 35
Phrase-Based SMT
• Sentence is broken into phrases
– Contiguous token sequences
– Not linguistic units
• Each phrase is translated in isolation
• Translated phrases are reordered
Batman has not fought a cat woman yet .
Бэтмен пока не сражался с женщиной кошкой .
(Koehn&al., HLT-NAACL 2003)
(Koehn&al., HLT-NAACL 2003)
38. 38
Sample Phrases: главен
главни прокурори chief prosecutors
главни счетоводители chief accountants
главни архитекти chief architects
главни щабове main staffs
главни улици main streets
главни методисти senior instructors
главно предизвикателство major challenge
39. 39
Sample Phrases: както
• както физическа , така и психическа ||| both
physical and psychological
• както целият регион ||| like the whole region
• както те са определени ||| as defined
• както и размера ||| as well as the size
• както и предишните редовни доклади ||| in line
with previous regular reports
• както и по други ||| and in other
47. 47
How MT Evaluation is NOT Done…
• Backtranslation
- A “mythical” example (Hutchins,1995)
o En: The spirit is willing, but the flesh is weak.
o Ru: Дух бодр, но плоть слаба.
o En. The vodka is good, but the meat is rotten.
- Not used, can be gamed easily:
o En: The spirit is willing, but the flesh is weak.
o Ru: The spirit is willing, but the flesh is weak.
o En: The spirit is willing, but the flesh is weak.
48. 48
The BLEU Evaluation Metric
(Papineni et al., ACL 2002)
Reference (human) translation:
The U.S. island of Guam is
maintaining a high state of alert
after the Guam airport and its
offices both received an e-mail
from someone calling himself the
Saudi Arabian Osama bin Laden
and threatening a
biological/chemical attack against
public places such as the airport .
Machine translation:
The American [?] international
airport and its the office all
receives one calls self the sand
Arab rich business [?] and so on
electronic mail , which sends out ;
The threat will be able after public
place and so on the airport to start
the biochemistry attack , [?] highly
alerts after the maintenance.
• BLEU4 formula
(counts n-grams up to length 4)
exp (1.0 * log p1 +
0.5 * log p2 +
0.25 * log p3 +
0.125 * log p4 –
max(words-in-reference / words-in-machine – 1, 0)
p1 = 1-gram precision
p2 = 2-gram precision
p3 = 3-gram precision
p4 = 4-gram precision
Correlates well with human judgments
Very hard to “game” it
(Papineni et al., ACL 2002)
49. 49
BLEU: Multiple Reference Translations
Reference translation 1:
The U.S. island of Guam is maintaining
a high state of alert after the Guam
airport and its offices both received an
e-mail from someone calling himself
the Saudi Arabian Osama bin Laden
and threatening a biological/chemical
attack against public places such as
the airport .
Reference translation 3:
The US International Airport of Guam
and its office has received an email
from a self-claimed Arabian millionaire
named Laden , which threatens to
launch a biochemical attack on such
public places as airport . Guam
authority has been on alert .
Reference translation 4:
US Guam International Airport and its
office received an email from Mr. Bin
Laden and other rich businessman
from Saudi Arabia . They said there
would be biochemistry air raid to Guam
Airport and other public places . Guam
needs to be in high precaution about
this matter .
Reference translation 2:
Guam International Airport and its
offices are maintaining a high state of
alert after receiving an e-mail that was
from a person claiming to be the
wealthy Saudi Arabian businessman
Bin Laden and that threatened to
launch a biological and chemical attack
on the airport and other public places .
Machine translation:
The American [?] international airport
and its the office all receives one calls
self the sand Arab rich business [?]
and so on electronic mail , which
sends out ; The threat will be able
after public place and so on the
airport to start the biochemistry attack
, [?] highly alerts after the
maintenance.
Reference translation 1:
The U.S. island of Guam is maintaining
a high state of alert after the Guam
airport and its offices both received an
e-mail from someone calling himself
the Saudi Arabian Osama bin Laden
and threatening a biological/chemical
attack against public places such as
the airport .
Reference translation 3:
The US International Airport of Guam
and its office has received an email
from a self-claimed Arabian millionaire
named Laden , which threatens to
launch a biochemical attack on such
public places as airport . Guam
authority has been on alert .
Reference translation 4:
US Guam International Airport and its
office received an email from Mr. Bin
Laden and other rich businessman
from Saudi Arabia . They said there
would be biochemistry air raid to Guam
Airport and other public places . Guam
needs to be in high precaution about
this matter .
Reference translation 2:
Guam International Airport and its
offices are maintaining a high state of
alert after receiving an e-mail that was
from a person claiming to be the
wealthy Saudi Arabian businessman
Bin Laden and that threatened to
launch a biological and chemical attack
on the airport and other public places .
Machine translation:
The American [?] international airport
and its the office all receives one calls
self the sand Arab rich business [?]
and so on electronic mail , which
sends out ; The threat will be able
after public place and so on the
airport to start the biochemistry attack
, [?] highly alerts after the
maintenance.
(Papineni et al., ACL 2002)
51. 51
The Basic Model, Revisited
argmax P(e | f) =
e
argmax P(e) x P(f | e) / P(f)
e
argmax P(e) x P(f | e)
e
argmax P(e)2.4 x P(f | e)
e
argmax P(e)2.4 x P(f | e) x #words(e)1.1
e
Rewards longer hypotheses, since
they are unfairly penalized by P(e)
Works better
x P(e | f)1.1 x Plex(f | e)1.3 x Plex(e | f)0.9 x #phrases(e,f)0.5...
(Och, ACL 2003)
52. 52
Maximum BLEU Training
(Och, ACL 2003)
Translation
System
(Automatic,
Trainable)
Translation
Quality
Evaluator
(Automatic)
French
input
English
MT Output
English
Reference Translations
(sample “right answers”)
BLEU
score
Language
Model #1
Translation
Model
Language
Model #2
Length
Model
Other
Features
MERT: Minimum Error Rate Training
(optimizes BLEU directly)
(Och, ACL 2003)
53. 53
Statistical Phrase-Based Translation
1. Training:
1. P(e): n-gram language model
2. P(f|e):
1. Generate word alignments
2. Build a phrase table
2. Tuning:
1. Use MERT to tune the parameters
3. Evaluation:
1. Run the system on test data
2. Calculate BLEU