Mattingly "AI & Prompt Design: The Basics of Prompt Design"
5. bleu
1. BLEU: a Method for Automatic
Evaluation of Machine Translation
(BiLingual Evaluation Understudy)
Kishore Papineni, Salim Roukos, Todd
Ward, and Wei-Jing Zhu
Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL),
Philadelphia, July 2002, pp. 311- 318
2. Viewpoint
• The idea: the closer a machine translation is to a
professional human translation, the better it is.
• To judge the quality
– Numerical metric
• So, MT evaluation system requires:
1. A numerical “translation closeness” metric
2. A corpus of good quality human reference translations
• Word error rate metric
– Idea: use of weighted average of variable length phrase
matches against the reference translations
– 参照変換に対して可変長フレーズ一致の加重平均を
使用 (Google Translate)
3. Baseline BLEU Metric
• The primary programming task for a BLEU
implementor is to compare n-grams of the
candidate with the n-grams of the reference
translation and count the number of matches
• So, we look at computing unigram matches
4. n-gram precision
• Precision measure
– Counts up the number of candidate translation words
( unigrams ) which occur in any reference translation and
then divides by the total number of words in the candidate
translation
• However, MT generates improbable, high-precision
translations like the example result below
– A ref word considered exhausted after a matching
candidate word is identified
5. Modified n-gram precision
• Modified unigram precision
– Counts the maximum number of times a word occurs in any single reference
translation
– Clips the total count of each candidate word by its maximum reference count
– Adds these clipped counts up
– Divides by the total (unclipped) number of candidate words
• Modified n-gram precision
– All candidate n-gram counts & corresponding maximum reference counts are
collected
– The candidate counts are clipped by their corresponding reference maximum
value, summed and divided by the total number of candidate n-grams
6. Modified n-gram precision on text
blocks
• Basic unit of evaluation is the sentence
• Compute the n-gram matches sentence by sentence
• Add clipped n-gram counts for all the candidate sentences
• Divide by the number of candidate n-grams in the test corpus to compute
a modified precision score
7. Ranking systems
• Human translation & machine translation
• 4 reference translations for each of 127 source sentences
• Result:
• From this result:
– Single n-gram precision score can distinguish good/bad translations
• To be useful, the metric must distinguish between two human translations that do not differ so
greatly in quality
8. Ranking systems
• Translations done by:
– Lacking native proficiency in both SL/TL
– Native English speaker
– Three commercial systems
• Result:
– The systems in result order is the same rank order by
human judges
9. Combining the modified n-gram
precisions
• The result, in prev. slide, shows:
– It decays roughly exponentially with n
– mod. unigram precision > bigram > trigram
• BLEU uses the average logarithm with uniform
weights (BLEUは一様重み付き平均の対数を
使用しています)
10. Recall
• BLEU considers multiple reference translations,
each of which may use a different word choice
to translate the same source word.
• A good candidate translation will only use
(recall) one of these possible choices, but not
all. Indeed, recalling all choices leads to a bad
translation
11. Sentence brevity penalty
• Candidate translations longer than references are penalized by the
modified n-gram precision measure
• Brevity penalty factor:
– A high-scoring candidate translation must match the reference translations in
length, in word choice and in word order
• Brevity penalty 1.0: candidate’s length is the same as any reference translations length.
• c: the length of the candidate translation
• r: the effective reference corpus length
• exp(1 - r/c): brevity penalty
12. BLEU details
• Take the geometric mean of the test corpus’ modified precision scores and
then multiply the result by an exponential brevity penalty factor.
• We first compute the geometric average of the modified n-gram precisions,
pn, using n-grams up to length N and positive weights wn summing to one.
• To make the behavior apparent
13. The BLEU Evaluation
• The BLEU metric ranges from 0 to 1
• 1 is very rare: only for perfect match
• The more, the better
• Human translation score 0.3468 against four references and scored 0.2571
against two references
• Table 1: 5 systems against two reference
14. • Is the difference in BLEU metric reliable?
• What is the variance of the BLEU score?
• If we were to pick another random set of 500 sentences, would we still judge S3 to
be better than S2?
• 20 blocks of 25 sentences each on BLEU metric
• Computed the means, variances, paired t-statistics
• What the Table2 indicates is:
– 500 sentences in Table 1 and 25 sentences in Table 2
– t-statistics of 1.7 or above is considered 95% significant
15. Evaluation
• Two groups of people, each group has 10 ppl
– Monolingual group
– Bilingual group
• Evaluated previous 5 systems
• Evaluation Rate: 1 (very bad) to 5 (very good)
• There were some liberal evaluations than
others