5. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Why is evaluation important?
Compare MT systems
Measure progress of MT systems over time
Diagnose of MT systems
Assess (and pay) human translators
Quality assurance
Tuning of SMT systems
Decision on fitness-for-purpose
...
Translation Quality Assessment: Evaluation and Estimation
5 / 66
Conclusions
6. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Why is evaluation hard?
What does quality mean?
Fluent?
Adequate?
Easy to post-edit?
Translation Quality Assessment: Evaluation and Estimation
6 / 66
Conclusions
7. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Why is evaluation hard?
What does quality mean?
Fluent?
Adequate?
Easy to post-edit?
Quality for whom/what?
End-user: gisting (Google Translate), internal
communications, or publication (dissemination)
MT-system: tuning or diagnosis
Post-editor: draft translations (light vs heavy
post-editing)
Other applications, e.g. CLIR
Translation Quality Assessment: Evaluation and Estimation
6 / 66
Conclusions
8. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Translation Quality Assessment: Evaluation and Estimation
7 / 66
Conclusions
9. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
Translation Quality Assessment: Evaluation and Estimation
7 / 66
Conclusions
10. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
Ref: The battery lasts 6 hours and it can be fully recharged
in 30 minutes.
MT: Six-hour battery, 30 minutes to full charge last.
Translation Quality Assessment: Evaluation and Estimation
7 / 66
11. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators
Ref: The battery lasts 6 hours and it can be fully recharged
in 30 minutes.
MT: Six-hour battery, 30 minutes to full charge last.
Ok for gisting - meaning preserved
Very costly for post-editing if style is to be preserved
Translation Quality Assessment: Evaluation and Estimation
7 / 66
12. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Overview
How do we measure quality?
Manual metrics:
Error counts, ranking, acceptability, 1-N judgements on
fluency/adequacy
Task-based human metrics: productivity tests
(HTER, PE time, keystrokes), user-satisfaction, reading
comprehension
Translation Quality Assessment: Evaluation and Estimation
8 / 66
13. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Overview
How do we measure quality?
Manual metrics:
Error counts, ranking, acceptability, 1-N judgements on
fluency/adequacy
Task-based human metrics: productivity tests
(HTER, PE time, keystrokes), user-satisfaction, reading
comprehension
Automatic metrics:
Based on human references: BLEU, METEOR, TER, ...
Reference-less: quality estimation
Translation Quality Assessment: Evaluation and Estimation
8 / 66
15. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Judgements in an n-point scale
Adequacy using 5-point scale (NIST-like)
5 All meaning expressed in the source fragment appears in
the translation fragment.
4 Most of the source fragment meaning is expressed in the
translation fragment.
3 Much of the source fragment meaning is expressed in the
translation fragment.
2 Little of the source fragment meaning is expressed in the
translation fragment.
1 None of the meaning expressed in the source fragment is
expressed in the translation fragment.
Translation Quality Assessment: Evaluation and Estimation
10 / 66
16. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Judgements in an n-point scale
Fluency using 5-point scale (NIST-like)
5 Native language fluency. No grammar errors, good
word choice and syntactic structure in the translation
fragment.
4 Near native language fluency. Few terminology or
grammar errors which don’t impact the overall
understanding of the meaning.
3 Not very fluent. About half of translation contains
errors.
2 Little fluency. Wrong word choice, poor grammar and
syntactic structure.
1 No fluency. Absolutely ungrammatical and for the most
part doesn’t make any sense.
Translation Quality Assessment: Evaluation and Estimation
11 / 66
17. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Judgements in an n-point scale
Issues:
Subjective judgements
Hard to reach significant agreement
Is it realible at all?
Can we use multiple annotators?
Translation Quality Assessment: Evaluation and Estimation
12 / 66
Conclusions
18. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Judgements in an n-point scale
Issues:
Subjective judgements
Hard to reach significant agreement
Is it realible at all?
Can we use multiple annotators?
Are fluency and adequacy really separable?
Ref: Absolutely ungrammatical and for the most part doesn’t
make any sense.
MT: Absolutely sense doesn’t ungrammatical for the and most
make any part.
Translation Quality Assessment: Evaluation and Estimation
12 / 66
Conclusions
20. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences
Translation Quality Assessment: Evaluation and Estimation
14 / 66
Conclusions
21. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences
Ref: The majority of existing work focuses on predicting some
form of post-editing effort to help professional translators.
MT1: Few of the existing work focuses on predicting some form
of post-editing effort to help professional translators.
MT2: The majority of existing work focuses on predicting some
form of post-editing effort to help machine translation.
Translation Quality Assessment: Evaluation and Estimation
14 / 66
22. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences
Ref: The majority of existing work focuses on predicting some
form of post-editing effort to help professional translators.
MT1: Few of the existing work focuses on predicting some form
of post-editing effort to help professional translators.
MT2: The majority of existing work focuses on predicting some
form of post-editing effort to help machine translation.
Only serve for comparison purposes - the best system
might not be good enough
Absolute evaluation can do both
Translation Quality Assessment: Evaluation and Estimation
14 / 66
23. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Error counts
More fine-grained
Aimed at diagnosis of MT systems, quality control of
human translation.
Translation Quality Assessment: Evaluation and Estimation
15 / 66
24. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Error counts
More fine-grained
Aimed at diagnosis of MT systems, quality control of
human translation.
E.g.: Multidimensional Quality Metrics (MQM)
Machine and human translation quality
Takes quality of source text into account
Actual metric is based on a specification
Translation Quality Assessment: Evaluation and Estimation
15 / 66
25. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
MQM
Issues selected based on a given specification (dimensions):
Language/locale
Subject field/domain
Text Type
Audience
Purpose
Register
Style
Content correspondence
Output modality, ...
Translation Quality Assessment: Evaluation and Estimation
16 / 66
30. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Error counts
Issues:
Time consuming
Requires training, esp. to distinguish between
fine-grained error types
Different errors are more relevant for different
specifications: need to select and weight them accordingly
Translation Quality Assessment: Evaluation and Estimation
20 / 66
36. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Post-editing - PET
How often post-editing (PE) a translation tool output is
faster than translating from scratch (HT):
System
Google
Moses
Systran
Trados
Faster than HT
94%
86.8%
81.20%
72.40%
Translation Quality Assessment: Evaluation and Estimation
26 / 66
Conclusions
37. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Post-editing - PET
How often post-editing (PE) a translation tool output is
faster than translating from scratch (HT):
System
Google
Moses
Systran
Trados
Faster than HT
94%
86.8%
81.20%
72.40%
Comparing the time to translate from scratch with the time
to PE MT, in seconds:
Annotator
Average
Deviation
HT (s)
31.89
9.99
PE (s)
18.82
6.79
HT/PE
1.73
0.26
Translation Quality Assessment: Evaluation and Estimation
PE/HT
0.59
0.09
26 / 66
Conclusions
38. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
User satisfaction
Solving a problem: E.g.: Intel measuring user satisfaction
with un-edited MT
Translation is good if customer can solve problem
Translation Quality Assessment: Evaluation and Estimation
27 / 66
Conclusions
39. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
User satisfaction
Solving a problem: E.g.: Intel measuring user satisfaction
with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites
Overall customer satisfaction: 75% for English→Chinese
95% reduction in cost
Project cycle from 10 days to 1 day
From 300 to 60,000 words translated/hour
Translation Quality Assessment: Evaluation and Estimation
27 / 66
40. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
User satisfaction
Solving a problem: E.g.: Intel measuring user satisfaction
with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites
Overall customer satisfaction: 75% for English→Chinese
95% reduction in cost
Project cycle from 10 days to 1 day
From 300 to 60,000 words translated/hour
Customers in China using MT texts were more satisfied
with support than natives using original texts (68%)!
Translation Quality Assessment: Evaluation and Estimation
27 / 66
41. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Reading comprehension
Defense language proficiency test (Jones et al., 2005):
Translation Quality Assessment: Evaluation and Estimation
28 / 66
Conclusions
43. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Reading comprehension
MT quality: function of
1
2
Text passage comprehension, as measured by answers
accuracy, and
Time taken to complete a test item (read a passage +
answer its questions)
Translation Quality Assessment: Evaluation and Estimation
29 / 66
Conclusions
44. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Reading comprehension
Compared to Human Translation (HT):
Translation Quality Assessment: Evaluation and Estimation
30 / 66
Conclusions
45. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Reading comprehension
Compared to Human Translation (HT):
Translation Quality Assessment: Evaluation and Estimation
31 / 66
Conclusions
46. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Task-based metrics
Issues:
Final goal needs to be very clear
Can be more cost/time consuming
Final task has to have a meaningful metric
Other elements may affect the final quality measurement
(e.g. Chinese vs. Americans)
Translation Quality Assessment: Evaluation and Estimation
32 / 66
48. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Automatic metrics
Compare output of an MT system to one or more
reference (usually human) translations: how close is the
MT output to the reference translation?
Numerous metrics: BLEU, NIST, etc.
Advantages:
Fast and cheap, minimal human labour, no need for
bilingual speakers
Once test set is created, can be reused many times
Can be used on an on-going basis during system
development to test changes
Translation Quality Assessment: Evaluation and Estimation
34 / 66
49. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Automatic metrics
Disadvantages:
Very few metrics look at variable ways of saying the
same thing (word-level): stems, synonyms, paraphrases
Individual sentence scores are not very reliable,
aggregate scores on a large test set are required
Translation Quality Assessment: Evaluation and Estimation
35 / 66
50. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Automatic metrics
Disadvantages:
Very few metrics look at variable ways of saying the
same thing (word-level): stems, synonyms, paraphrases
Individual sentence scores are not very reliable,
aggregate scores on a large test set are required
Very few of these metrics penalise different
mismatches differently
Translation Quality Assessment: Evaluation and Estimation
35 / 66
51. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Automatic metrics
Disadvantages:
Very few metrics look at variable ways of saying the
same thing (word-level): stems, synonyms, paraphrases
Individual sentence scores are not very reliable,
aggregate scores on a large test set are required
Very few of these metrics penalise different
mismatches differently
Reference translations are only a subset of the
possible good translations
Translation Quality Assessment: Evaluation and Estimation
35 / 66
52. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
String matching
BLEU: BiLingual Evaluation Understudy
Most widely used metric, both for MT system
evaluation/comparison and SMT tuning
Matching of n-grams between MT and Ref: rewards
same words in equal order
#clip(g ) count of reference n-grams g which happen in a
hypothesis sentence h clipped by the number of times g
appears in the reference sentence for h; #(g ) = number
of n-grams in hypotheses
n-gram precision pn for a set of MT translations H =
pn =
h∈H
h∈H
g ∈ngrams(h)
#clip(g )
g ∈ngrams(h)
Translation Quality Assessment: Evaluation and Estimation
#(g )
36 / 66
53. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
BLEU
Combine (mean of the log) 1-n n-gram precisions
log pn
n
Translation Quality Assessment: Evaluation and Estimation
37 / 66
Conclusions
54. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
BLEU
Combine (mean of the log) 1-n n-gram precisions
log pn
n
Bias towards translations with fewer words (denominator)
Brevity penalty to penalise MT sentences that are
shorter than reference
Compares the overall number of words wh of the entire
hypotheses set with ref length wr :
BP =
1
e (1−wr /wh )
Translation Quality Assessment: Evaluation and Estimation
if wh ≥ wr
otherwise
37 / 66
55. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
BLEU
Combine (mean of the log) 1-n n-gram precisions
log pn
n
Bias towards translations with fewer words (denominator)
Brevity penalty to penalise MT sentences that are
shorter than reference
Compares the overall number of words wh of the entire
hypotheses set with ref length wr :
BP =
1
e (1−wr /wh )
if wh ≥ wr
otherwise
BLEU = BP ∗ exp
log pn
n
Translation Quality Assessment: Evaluation and Estimation
37 / 66
56. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
BLEU
Scale: 0-1, but highly dependent on the test set
Rewards fluency by matching high n-grams (up to 4)
Adequacy rewarded by unigrams and brevity penalty –
poor model of recall
Synonyms and paraphrases only handled if they are in
any of the reference translations
All tokens are equally weighted: missing out on a
content word = missing out on a determiner
Better for evaluating changes in the same system than
comparing different MT architectures
Translation Quality Assessment: Evaluation and Estimation
38 / 66
57. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
BLEU
Not good at sentence-level, unless smoothing is applied:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
1-gram precision:
2-gram precision:
3-gram precision:
4-gram precision:
BLEU = 0
4/8
1/7
0/6
0/5
Translation Quality Assessment: Evaluation and Estimation
39 / 66
58. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
BLEU
Importance of clipping and brevity penalty
Ref1: the Iraqi weapons are to be handed over to the army
within two weeks
Ref2: the Iraqi weapons will be surrendered to the army in two
weeks
MT: the the the the
Count for the should be clipped at 2: max count of the
word in any reference. Unigram score = 2/4 (not 4/4)
Translation Quality Assessment: Evaluation and Estimation
40 / 66
59. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
BLEU
Importance of clipping and brevity penalty
Ref1: the Iraqi weapons are to be handed over to the army
within two weeks
Ref2: the Iraqi weapons will be surrendered to the army in two
weeks
MT: the the the the
Count for the should be clipped at 2: max count of the
word in any reference. Unigram score = 2/4 (not 4/4)
MT: Iraqi weapons will be
1-gram precision: 4/4
2-gram precision: 3/3
3-gram precision: 2/2
4-gram precision: 1/1
Precision (pn ) = 1
Precision score penalised because h < r
Translation Quality Assessment: Evaluation and Estimation
40 / 66
61. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Edit distance
WER: Word Error Rate:
Levenshtein edit distance
Minimum proportion of insertions, deletions, and
substitutions needed to transform an MT sentence into
the reference sentence
Heavily penalises reorderings: correct translation in the
wrong location: deletion + insertion
S +D +I
WER =
N
PER: Position-independent word Error Rate:
Does not penalise reorderings: output and reference
sentences are unordered sets
Translation Quality Assessment: Evaluation and Estimation
42 / 66
63. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF:
SAUDI ARABIA denied this week
information published in the AMERICAN new york times
HYP: [this week] the saudis denied
information published in the *****
Translation Quality Assessment: Evaluation and Estimation
new york times
43 / 66
64. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF:
SAUDI ARABIA denied this week
information published in the AMERICAN new york times
HYP: [this week] the saudis denied
information published in the *****
new york times
1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:
4
TER = 13 = 0.31
Translation Quality Assessment: Evaluation and Estimation
43 / 66
65. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF:
SAUDI ARABIA denied this week
information published in the AMERICAN new york times
HYP: [this week] the saudis denied
information published in the *****
new york times
1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:
4
TER = 13 = 0.31
Human-targeted TER (HTER)
TER between MT and its post-edited version
Translation Quality Assessment: Evaluation and Estimation
43 / 66
67. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Alignment-based
METEOR:
Unigram Precision and Recall
Align MT output with reference. Take best scoring pair
for multiple refs.
Matching considers word inflection variations (stems),
synonyms/paraphrases
Fluency addressed via a direct penalty: fragmentation of
the matching
METEOR score = F-mean score discounted for
fragmentation = F-mean * (1 - DF)
Translation Quality Assessment: Evaluation and Estimation
45 / 66
68. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
Translation Quality Assessment: Evaluation and Estimation
46 / 66
Conclusions
69. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
Translation Quality Assessment: Evaluation and Estimation
46 / 66
Conclusions
70. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
P = 5/8 =0.625
R = 5/14 = 0.357
F-mean = 10*P*R/(9P+R) = 0.3731
Translation Quality Assessment: Evaluation and Estimation
46 / 66
Conclusions
71. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
P = 5/8 =0.625
R = 5/14 = 0.357
F-mean = 10*P*R/(9P+R) = 0.3731
Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6
Discounting factor: DF = 0.5 * (0.6**3) = 0.108
METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333
Translation Quality Assessment: Evaluation and Estimation
46 / 66
72. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Others
WMT shared task on metrics:
TerroCat
DepRef
MEANT and TINE
TESLA
LEPOR
ROSE
AMBER
Many other linguistically motivated metrics where
matching is not done at the word-level (only)
...
Translation Quality Assessment: Evaluation and Estimation
47 / 66
Conclusions
74. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
Translation Quality Assessment: Evaluation and Estimation
49 / 66
Conclusions
75. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Translation Quality Assessment: Evaluation and Estimation
49 / 66
Conclusions
76. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Translation Quality Assessment: Evaluation and Estimation
49 / 66
Conclusions
77. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Translation Quality Assessment: Evaluation and Estimation
49 / 66
Conclusions
78. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Translation Quality Assessment: Evaluation and Estimation
49 / 66
Conclusions
79. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation
49 / 66
Conclusions
80. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Framework
X: examples of
source &
translations
Feature
extraction
Y: Quality
scores for
examples in X
Translation Quality Assessment: Evaluation and Estimation
Features
Machine
Learning
QE model
50 / 66
81. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Framework
MT system
Translation
for xt'
Feature
extraction
Source
Text xs'
Features
Quality score
y'
Translation Quality Assessment: Evaluation and Estimation
QE model
50 / 66
82. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Framework
Main components to build a QE system:
1
Definition of quality: what to predict
2
(Human) labelled data (for quality)
3
Features
4
Machine learning algorithm
Translation Quality Assessment: Evaluation and Estimation
51 / 66
Conclusions
83. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Definition of quality
Predict
Predict
Predict
Predict
Predict
Predict
Predict
Predict
1-N absolute scores for adequacy/fluency
1-N absolute scores for post-editing effort
average post-editing time per word
relative rankings
relative rankings for same source
percentage of edits needed for sentence
word-level edits and its types
BLEU, etc. scores for document
Translation Quality Assessment: Evaluation and Estimation
52 / 66
Conclusions
85. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Features
Adequacy
indicators
Source text
Complexity
indicators
MT system
Confidence
indicators
Translation Quality Assessment: Evaluation and Estimation
Translation
Fluency
indicators
54 / 66
Conclusions
86. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
QuEst
Goal: framework to explore features for QE
Feature extractors for 150+ features of all types: Java
Machine learning: wrappers for a number of algorithms
in the scikit-learn toolkit, grid search, feature selection
Open source:
http://www.quest.dcs.shef.ac.uk/
Translation Quality Assessment: Evaluation and Estimation
55 / 66
87. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
State of the art in QE
WMT12-13 shared tasks
Translation Quality Assessment: Evaluation and Estimation
56 / 66
Conclusions
88. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
State of the art in QE
WMT12-13 shared tasks
Sentence- and word-level estimation of PE effort
Translation Quality Assessment: Evaluation and Estimation
56 / 66
Conclusions
89. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
State of the art in QE
WMT12-13 shared tasks
Sentence- and word-level estimation of PE effort
Datasets and language pairs:
Quality
1-5 subjective scores
Ranking all sentences best-worst
HTER scores
Post-editing time
Word-level edits: change/keep
Word-level edits: keep/delete/replace
Ranking 5 MTs per source
Translation Quality Assessment: Evaluation and Estimation
Year
WMT12
WMT12/13
WMT13
WMT13
WMT13
WMT13
WMT13
Languages
en-es
en-es
en-es
en-es
en-es
en-es
en-es; de-en
56 / 66
90. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
State of the art in QE
WMT12-13 shared tasks
Sentence- and word-level estimation of PE effort
Datasets and language pairs:
Quality
1-5 subjective scores
Ranking all sentences best-worst
HTER scores
Post-editing time
Word-level edits: change/keep
Word-level edits: keep/delete/replace
Ranking 5 MTs per source
Evaluation metric:
MAE =
N
i=1
Year
WMT12
WMT12/13
WMT13
WMT13
WMT13
WMT13
WMT13
Languages
en-es
en-es
en-es
en-es
en-es
en-es
en-es; de-en
|H(si ) − V (si )|
N
Translation Quality Assessment: Evaluation and Estimation
56 / 66
91. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Baseline system
Features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequency
quartiles 1 and 4
% of seen source unigrams
Translation Quality Assessment: Evaluation and Estimation
57 / 66
92. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Baseline system
Features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequency
quartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ, and C
optimised using a grid-search and 5-fold cross validation on the
training set
Translation Quality Assessment: Evaluation and Estimation
57 / 66
95. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Open issues
Agreement between annotators
Absolute value judgements: difficult to achieve
consistency even in highly controlled settings
WMT12: 30% of initial dataset discarded
Remaining annotations had to be scaled
Translation Quality Assessment: Evaluation and Estimation
60 / 66
Conclusions
96. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Open issues
Annotation costs: active learning to select subset of
instances to be annotated (Beck et al., ACL 2013)
Translation Quality Assessment: Evaluation and Estimation
61 / 66
Conclusions
97. BL'
AF'
BL+PR'
AF+PR'
Translation Quality Assessment: Evaluation and Estimation
09
2 s4
'
09
2 s3
'
0.554'
0.5401'
0.5401'
0.5249'
0.5194'
0.5462'
0.5399'
0.5301'
0.5249'
0.521'
0.5339'
0.5437'
0.5113'
0.5309'
0.506'
0.4614'
0.4741'
0.4493'
0.4609'
0.441'
Reference-based metrics
GA
LE
11
2s2
''
0.4'
0.3591'
0.3578'
0.3401'
0.3409'
0.337'
0.35'
GA
LE
11
2s1
'
T2
EA
M
T2
09
2 s2
'
0.5313'
0.5265'
0.5123'
0.5109'
0.5025'
Task-based metrics
EA
M
T2
09
2 s1
'
0.6'
EA
M
T2
0.4401'
0.4292'
0.4183'
0.4169'
0.411'
Manual metrics
EA
M
'
0.5'
r2e
n)
0.45'
0.4857'
0.4719'
0.449'
0.4471'
0.432'
0.55'
11
'( f
es
)'
0.6821'
0.6717'
0.629'
0.6324'
0.6131'
0.7'
11
'( e
n2
T1
2'
W
M
0.65'
EA
M
T2
T2
EA
M
Translation quality
Quality estimation
Open issues
Curse of dimensionality: feature selection to identify
relevant info for dataset (Shah et al., MT Summit 2013)
0.3'
FS'
62 / 66
Conclusions
98. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Open issues
0.554'
0.5401'
0.5401'
0.5249'
0.5194'
0.5462'
0.5399'
0.5301'
0.5249'
0.521'
0.3591'
0.3578'
0.3401'
0.3409'
0.337'
0.5339'
0.5437'
0.5113'
0.5309'
0.506'
0.4614'
0.4741'
0.4493'
0.4609'
0.441'
0.5'
0.45'
0.4401'
0.4292'
0.4183'
0.4169'
0.411'
0.4857'
0.4719'
0.449'
0.4471'
0.432'
0.6'
0.55'
0.5313'
0.5265'
0.5123'
0.5109'
0.5025'
0.7'
0.65'
0.6821'
0.6717'
0.629'
0.6324'
0.6131'
Curse of dimensionality: feature selection to identify
relevant info for dataset (Shah et al., MT Summit 2013)
0.4'
0.35'
BL'
AF'
BL+PR'
AF+PR'
GA
LE
11
2s2
''
GA
LE
11
2s1
'
09
2 s4
'
EA
M
T2
09
2 s3
'
T2
EA
M
09
2 s2
'
EA
M
T2
09
2 s1
'
T2
EA
M
11
'( f
r2e
n)
'
es
)'
EA
M
T2
11
'( e
n2
EA
M
T2
W
M
T1
2'
0.3'
FS'
Common feature set identified, but nuanced subsets for
specific datasets
Translation Quality Assessment: Evaluation and Estimation
62 / 66
Conclusions
99. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Open issues
How to use estimated PE effort scores?: Do users prefer
detailed estimates (sub-sentence level) or an overall
estimate for the complete sentence or not seeing bad
sentences at all?
Too much information vs hard-to-interpret scores
Translation Quality Assessment: Evaluation and Estimation
63 / 66
Conclusions
100. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Open issues
How to use estimated PE effort scores?: Do users prefer
detailed estimates (sub-sentence level) or an overall
estimate for the complete sentence or not seeing bad
sentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness metric
Translation Quality Assessment: Evaluation and Estimation
63 / 66
Conclusions
101. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Open issues
How to use estimated PE effort scores?: Do users prefer
detailed estimates (sub-sentence level) or an overall
estimate for the complete sentence or not seeing bad
sentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness metric
MATECAT project investigating it
Translation Quality Assessment: Evaluation and Estimation
63 / 66
Conclusions
103. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Translation Quality Assessment: Evaluation and Estimation
65 / 66
Conclusions
104. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Translation Quality Assessment: Evaluation and Estimation
65 / 66
Conclusions
105. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Quality estimation: learning of these different notions,
but requires labelled data
Translation Quality Assessment: Evaluation and Estimation
65 / 66
106. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Quality estimation: learning of these different notions,
but requires labelled data
Solution:
Think of what quality means in your scenario
Measure significance
Measure agreement if manual metrics
Translation Quality Assessment: Evaluation and Estimation
65 / 66
107. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Quality estimation: learning of these different notions,
but requires labelled data
Solution:
Think of what quality means in your scenario
Measure significance
Measure agreement if manual metrics
Use various metrics
Translation Quality Assessment: Evaluation and Estimation
65 / 66
108. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Conclusions
Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Quality estimation: learning of these different notions,
but requires labelled data
Solution:
Think of what quality means in your scenario
Measure significance
Measure agreement if manual metrics
Use various metrics
Invent your own metric!
Translation Quality Assessment: Evaluation and Estimation
65 / 66
109. Translation quality
Manual metrics
Task-based metrics
Reference-based metrics
Quality estimation
Translation Quality Assessment:
Evaluation and Estimation
Lucia Specia
University of Sheffield
l.specia@sheffield.ac.uk
EXPERT Winter School, 12 November 2013
Translation Quality Assessment: Evaluation and Estimation
66 / 66
Conclusions