SlideShare uma empresa Scribd logo
1 de 109
Baixar para ler offline
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Translation Quality Assessment:
Evaluation and Estimation
Lucia Specia
University of Sheffield
l.specia@sheffield.ac.uk

EXPERT Winter School, 12 November 2013

Translation Quality Assessment: Evaluation and Estimation

1 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Overview

“Machine Translation evaluation is better understood than
Machine Translation”
(Carbonell and Wilks, 1991)

Translation Quality Assessment: Evaluation and Estimation

2 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Outline
1

Translation quality

2

Manual metrics

3

Task-based metrics

4

Reference-based metrics

5

Quality estimation

6

Conclusions

Translation Quality Assessment: Evaluation and Estimation

3 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Outline
1

Translation quality

2

Manual metrics

3

Task-based metrics

4

Reference-based metrics

5

Quality estimation

6

Conclusions
Translation Quality Assessment: Evaluation and Estimation

4 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Why is evaluation important?

Compare MT systems
Measure progress of MT systems over time
Diagnose of MT systems
Assess (and pay) human translators
Quality assurance
Tuning of SMT systems
Decision on fitness-for-purpose
...

Translation Quality Assessment: Evaluation and Estimation

5 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Why is evaluation hard?
What does quality mean?
Fluent?
Adequate?
Easy to post-edit?

Translation Quality Assessment: Evaluation and Estimation

6 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Why is evaluation hard?
What does quality mean?
Fluent?
Adequate?
Easy to post-edit?

Quality for whom/what?
End-user: gisting (Google Translate), internal
communications, or publication (dissemination)
MT-system: tuning or diagnosis
Post-editor: draft translations (light vs heavy
post-editing)
Other applications, e.g. CLIR

Translation Quality Assessment: Evaluation and Estimation

6 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!

Translation Quality Assessment: Evaluation and Estimation

7 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators

Translation Quality Assessment: Evaluation and Estimation

7 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators

Ref: The battery lasts 6 hours and it can be fully recharged
in 30 minutes.
MT: Six-hour battery, 30 minutes to full charge last.

Translation Quality Assessment: Evaluation and Estimation

7 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Overview
Ref: Do not buy this product, it’s their craziest invention!
MT: Do buy this product, it’s their craziest invention!
Severe if end-user does not speak source language
Trivial to post-edit by translators

Ref: The battery lasts 6 hours and it can be fully recharged
in 30 minutes.
MT: Six-hour battery, 30 minutes to full charge last.
Ok for gisting - meaning preserved
Very costly for post-editing if style is to be preserved
Translation Quality Assessment: Evaluation and Estimation

7 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Overview
How do we measure quality?
Manual metrics:
Error counts, ranking, acceptability, 1-N judgements on
fluency/adequacy
Task-based human metrics: productivity tests
(HTER, PE time, keystrokes), user-satisfaction, reading
comprehension

Translation Quality Assessment: Evaluation and Estimation

8 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Overview
How do we measure quality?
Manual metrics:
Error counts, ranking, acceptability, 1-N judgements on
fluency/adequacy
Task-based human metrics: productivity tests
(HTER, PE time, keystrokes), user-satisfaction, reading
comprehension

Automatic metrics:
Based on human references: BLEU, METEOR, TER, ...
Reference-less: quality estimation

Translation Quality Assessment: Evaluation and Estimation

8 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Outline
1

Translation quality

2

Manual metrics

3

Task-based metrics

4

Reference-based metrics

5

Quality estimation

6

Conclusions
Translation Quality Assessment: Evaluation and Estimation

9 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Judgements in an n-point scale
Adequacy using 5-point scale (NIST-like)
5 All meaning expressed in the source fragment appears in
the translation fragment.
4 Most of the source fragment meaning is expressed in the
translation fragment.
3 Much of the source fragment meaning is expressed in the
translation fragment.
2 Little of the source fragment meaning is expressed in the
translation fragment.
1 None of the meaning expressed in the source fragment is
expressed in the translation fragment.

Translation Quality Assessment: Evaluation and Estimation

10 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Judgements in an n-point scale
Fluency using 5-point scale (NIST-like)
5 Native language fluency. No grammar errors, good
word choice and syntactic structure in the translation
fragment.
4 Near native language fluency. Few terminology or
grammar errors which don’t impact the overall
understanding of the meaning.
3 Not very fluent. About half of translation contains
errors.
2 Little fluency. Wrong word choice, poor grammar and
syntactic structure.
1 No fluency. Absolutely ungrammatical and for the most
part doesn’t make any sense.
Translation Quality Assessment: Evaluation and Estimation

11 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Judgements in an n-point scale
Issues:
Subjective judgements
Hard to reach significant agreement
Is it realible at all?
Can we use multiple annotators?

Translation Quality Assessment: Evaluation and Estimation

12 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Judgements in an n-point scale
Issues:
Subjective judgements
Hard to reach significant agreement
Is it realible at all?
Can we use multiple annotators?

Are fluency and adequacy really separable?
Ref: Absolutely ungrammatical and for the most part doesn’t
make any sense.
MT: Absolutely sense doesn’t ungrammatical for the and most
make any part.

Translation Quality Assessment: Evaluation and Estimation

12 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Ranking
WMT-13 Appraise tool: rank translations best-worst (w. ties)

Translation Quality Assessment: Evaluation and Estimation

13 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences

Translation Quality Assessment: Evaluation and Estimation

14 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences
Ref: The majority of existing work focuses on predicting some
form of post-editing effort to help professional translators.
MT1: Few of the existing work focuses on predicting some form
of post-editing effort to help professional translators.
MT2: The majority of existing work focuses on predicting some
form of post-editing effort to help machine translation.

Translation Quality Assessment: Evaluation and Estimation

14 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Ranking
Issues:
Subjective judgements: what does “best” mean?
Hard to judge for long sentences
Ref: The majority of existing work focuses on predicting some
form of post-editing effort to help professional translators.
MT1: Few of the existing work focuses on predicting some form
of post-editing effort to help professional translators.
MT2: The majority of existing work focuses on predicting some
form of post-editing effort to help machine translation.

Only serve for comparison purposes - the best system
might not be good enough
Absolute evaluation can do both
Translation Quality Assessment: Evaluation and Estimation

14 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Error counts
More fine-grained
Aimed at diagnosis of MT systems, quality control of
human translation.

Translation Quality Assessment: Evaluation and Estimation

15 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Error counts
More fine-grained
Aimed at diagnosis of MT systems, quality control of
human translation.
E.g.: Multidimensional Quality Metrics (MQM)
Machine and human translation quality
Takes quality of source text into account
Actual metric is based on a specification

Translation Quality Assessment: Evaluation and Estimation

15 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

MQM
Issues selected based on a given specification (dimensions):
Language/locale
Subject field/domain
Text Type
Audience
Purpose
Register
Style
Content correspondence
Output modality, ...

Translation Quality Assessment: Evaluation and Estimation

16 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

MQM
Issue types (core):

Altogether: 120 categories

Translation Quality Assessment: Evaluation and Estimation

17 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

MQM

Issue types: http://www.qt21.eu/launchpad/content/
high-level-structure-0
Combining issue types:
TQ = 100 − AccP − (FluPT − FluPS ) − (VerPT − VerPS )

Translation Quality Assessment: Evaluation and Estimation

18 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

MQM

Issue types: http://www.qt21.eu/launchpad/content/
high-level-structure-0
Combining issue types:
TQ = 100 − AccP − (FluPT − FluPS ) − (VerPT − VerPS )

translate5: open source graphical (Web) interface for inline
error annotation: www.translate5.net

Translation Quality Assessment: Evaluation and Estimation

18 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

MQM

Translation Quality Assessment: Evaluation and Estimation

19 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Error counts

Issues:
Time consuming
Requires training, esp. to distinguish between
fine-grained error types
Different errors are more relevant for different
specifications: need to select and weight them accordingly

Translation Quality Assessment: Evaluation and Estimation

20 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Outline
1

Translation quality

2

Manual metrics

3

Task-based metrics

4

Reference-based metrics

5

Quality estimation

6

Conclusions
Translation Quality Assessment: Evaluation and Estimation

21 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Post-editing
Productivity analysis: Measure translation quality within
task. E.g. Autodesk - Productivity test through post-editing
2-day translation and post-editing , 37 participants
In-house Moses (Autodesk data: software)
Time spent on each segment

Translation Quality Assessment: Evaluation and Estimation

22 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Post-editing
PET: Records time, keystrokes, edit distance

Translation Quality Assessment: Evaluation and Estimation

23 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Post-editing - PET

Translation Quality Assessment: Evaluation and Estimation

24 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Post-editing - PET

Translation Quality Assessment: Evaluation and Estimation

25 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Post-editing - PET
How often post-editing (PE) a translation tool output is
faster than translating from scratch (HT):
System
Google
Moses
Systran
Trados

Faster than HT
94%
86.8%
81.20%
72.40%

Translation Quality Assessment: Evaluation and Estimation

26 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Post-editing - PET
How often post-editing (PE) a translation tool output is
faster than translating from scratch (HT):
System
Google
Moses
Systran
Trados

Faster than HT
94%
86.8%
81.20%
72.40%

Comparing the time to translate from scratch with the time
to PE MT, in seconds:
Annotator
Average
Deviation

HT (s)
31.89
9.99

PE (s)
18.82
6.79

HT/PE
1.73
0.26

Translation Quality Assessment: Evaluation and Estimation

PE/HT
0.59
0.09
26 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

User satisfaction
Solving a problem: E.g.: Intel measuring user satisfaction
with un-edited MT
Translation is good if customer can solve problem

Translation Quality Assessment: Evaluation and Estimation

27 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

User satisfaction
Solving a problem: E.g.: Intel measuring user satisfaction
with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites
Overall customer satisfaction: 75% for English→Chinese
95% reduction in cost
Project cycle from 10 days to 1 day
From 300 to 60,000 words translated/hour

Translation Quality Assessment: Evaluation and Estimation

27 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

User satisfaction
Solving a problem: E.g.: Intel measuring user satisfaction
with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites
Overall customer satisfaction: 75% for English→Chinese
95% reduction in cost
Project cycle from 10 days to 1 day
From 300 to 60,000 words translated/hour
Customers in China using MT texts were more satisfied
with support than natives using original texts (68%)!

Translation Quality Assessment: Evaluation and Estimation

27 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Reading comprehension
Defense language proficiency test (Jones et al., 2005):

Translation Quality Assessment: Evaluation and Estimation

28 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Reading comprehension

Translation Quality Assessment: Evaluation and Estimation

29 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Reading comprehension

MT quality: function of
1

2

Text passage comprehension, as measured by answers
accuracy, and
Time taken to complete a test item (read a passage +
answer its questions)
Translation Quality Assessment: Evaluation and Estimation

29 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Reading comprehension
Compared to Human Translation (HT):

Translation Quality Assessment: Evaluation and Estimation

30 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Reading comprehension
Compared to Human Translation (HT):

Translation Quality Assessment: Evaluation and Estimation

31 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Task-based metrics

Issues:
Final goal needs to be very clear
Can be more cost/time consuming
Final task has to have a meaningful metric
Other elements may affect the final quality measurement
(e.g. Chinese vs. Americans)

Translation Quality Assessment: Evaluation and Estimation

32 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Outline
1

Translation quality

2

Manual metrics

3

Task-based metrics

4

Reference-based metrics

5

Quality estimation

6

Conclusions
Translation Quality Assessment: Evaluation and Estimation

33 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Automatic metrics

Compare output of an MT system to one or more
reference (usually human) translations: how close is the
MT output to the reference translation?
Numerous metrics: BLEU, NIST, etc.
Advantages:
Fast and cheap, minimal human labour, no need for
bilingual speakers
Once test set is created, can be reused many times
Can be used on an on-going basis during system
development to test changes

Translation Quality Assessment: Evaluation and Estimation

34 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Automatic metrics

Disadvantages:
Very few metrics look at variable ways of saying the
same thing (word-level): stems, synonyms, paraphrases
Individual sentence scores are not very reliable,
aggregate scores on a large test set are required

Translation Quality Assessment: Evaluation and Estimation

35 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Automatic metrics

Disadvantages:
Very few metrics look at variable ways of saying the
same thing (word-level): stems, synonyms, paraphrases
Individual sentence scores are not very reliable,
aggregate scores on a large test set are required
Very few of these metrics penalise different
mismatches differently

Translation Quality Assessment: Evaluation and Estimation

35 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Automatic metrics

Disadvantages:
Very few metrics look at variable ways of saying the
same thing (word-level): stems, synonyms, paraphrases
Individual sentence scores are not very reliable,
aggregate scores on a large test set are required
Very few of these metrics penalise different
mismatches differently
Reference translations are only a subset of the
possible good translations

Translation Quality Assessment: Evaluation and Estimation

35 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

String matching
BLEU: BiLingual Evaluation Understudy
Most widely used metric, both for MT system
evaluation/comparison and SMT tuning
Matching of n-grams between MT and Ref: rewards
same words in equal order
#clip(g ) count of reference n-grams g which happen in a
hypothesis sentence h clipped by the number of times g
appears in the reference sentence for h; #(g ) = number
of n-grams in hypotheses
n-gram precision pn for a set of MT translations H =
pn =

h∈H
h∈H

g ∈ngrams(h)

#clip(g )

g ∈ngrams(h)

Translation Quality Assessment: Evaluation and Estimation

#(g )
36 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

BLEU
Combine (mean of the log) 1-n n-gram precisions
log pn
n

Translation Quality Assessment: Evaluation and Estimation

37 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

BLEU
Combine (mean of the log) 1-n n-gram precisions
log pn
n

Bias towards translations with fewer words (denominator)
Brevity penalty to penalise MT sentences that are
shorter than reference
Compares the overall number of words wh of the entire
hypotheses set with ref length wr :

BP =

1
e (1−wr /wh )

Translation Quality Assessment: Evaluation and Estimation

if wh ≥ wr
otherwise

37 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

BLEU
Combine (mean of the log) 1-n n-gram precisions
log pn
n

Bias towards translations with fewer words (denominator)
Brevity penalty to penalise MT sentences that are
shorter than reference
Compares the overall number of words wh of the entire
hypotheses set with ref length wr :

BP =

1
e (1−wr /wh )

if wh ≥ wr
otherwise

BLEU = BP ∗ exp

log pn
n

Translation Quality Assessment: Evaluation and Estimation

37 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

BLEU
Scale: 0-1, but highly dependent on the test set
Rewards fluency by matching high n-grams (up to 4)
Adequacy rewarded by unigrams and brevity penalty –
poor model of recall
Synonyms and paraphrases only handled if they are in
any of the reference translations
All tokens are equally weighted: missing out on a
content word = missing out on a determiner
Better for evaluating changes in the same system than
comparing different MT architectures

Translation Quality Assessment: Evaluation and Estimation

38 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

BLEU

Not good at sentence-level, unless smoothing is applied:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
1-gram precision:
2-gram precision:
3-gram precision:
4-gram precision:
BLEU = 0

4/8
1/7
0/6
0/5

Translation Quality Assessment: Evaluation and Estimation

39 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

BLEU
Importance of clipping and brevity penalty
Ref1: the Iraqi weapons are to be handed over to the army
within two weeks
Ref2: the Iraqi weapons will be surrendered to the army in two
weeks
MT: the the the the
Count for the should be clipped at 2: max count of the
word in any reference. Unigram score = 2/4 (not 4/4)

Translation Quality Assessment: Evaluation and Estimation

40 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

BLEU
Importance of clipping and brevity penalty
Ref1: the Iraqi weapons are to be handed over to the army
within two weeks
Ref2: the Iraqi weapons will be surrendered to the army in two
weeks
MT: the the the the
Count for the should be clipped at 2: max count of the
word in any reference. Unigram score = 2/4 (not 4/4)
MT: Iraqi weapons will be
1-gram precision: 4/4
2-gram precision: 3/3
3-gram precision: 2/2
4-gram precision: 1/1
Precision (pn ) = 1

Precision score penalised because h < r
Translation Quality Assessment: Evaluation and Estimation

40 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

String matching
BLEU:

Translation Quality Assessment: Evaluation and Estimation

41 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Edit distance
WER: Word Error Rate:
Levenshtein edit distance
Minimum proportion of insertions, deletions, and
substitutions needed to transform an MT sentence into
the reference sentence
Heavily penalises reorderings: correct translation in the
wrong location: deletion + insertion
S +D +I
WER =
N
PER: Position-independent word Error Rate:
Does not penalise reorderings: output and reference
sentences are unordered sets
Translation Quality Assessment: Evaluation and Estimation

42 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Edit distance: TER
TER: Translation Error Rate
Adds shift operation

Translation Quality Assessment: Evaluation and Estimation

43 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF:
SAUDI ARABIA denied this week
information published in the AMERICAN new york times
HYP: [this week] the saudis denied
information published in the *****

Translation Quality Assessment: Evaluation and Estimation

new york times

43 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF:
SAUDI ARABIA denied this week
information published in the AMERICAN new york times
HYP: [this week] the saudis denied
information published in the *****

new york times

1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:
4
TER = 13 = 0.31

Translation Quality Assessment: Evaluation and Estimation

43 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Edit distance: TER
TER: Translation Error Rate
Adds shift operation
REF:
SAUDI ARABIA denied this week
information published in the AMERICAN new york times
HYP: [this week] the saudis denied
information published in the *****

new york times

1 Shift, 2 Substitutions, 1 Deletion → 4 Edits:
4
TER = 13 = 0.31
Human-targeted TER (HTER)
TER between MT and its post-edited version
Translation Quality Assessment: Evaluation and Estimation

43 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Edit distance: TER
TER:

Translation Quality Assessment: Evaluation and Estimation

44 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Alignment-based
METEOR:
Unigram Precision and Recall
Align MT output with reference. Take best scoring pair
for multiple refs.
Matching considers word inflection variations (stems),
synonyms/paraphrases
Fluency addressed via a direct penalty: fragmentation of
the matching
METEOR score = F-mean score discounted for
fragmentation = F-mean * (1 - DF)

Translation Quality Assessment: Evaluation and Estimation

45 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army

Translation Quality Assessment: Evaluation and Estimation

46 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army

Translation Quality Assessment: Evaluation and Estimation

46 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
P = 5/8 =0.625
R = 5/14 = 0.357
F-mean = 10*P*R/(9P+R) = 0.3731

Translation Quality Assessment: Evaluation and Estimation

46 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

METEOR
Example:
Ref: the Iraqi weapons are to be handed over to the army
within two weeks
MT: in two weeks Iraq’s weapons will give army
Matching:
Ref: Iraqi weapons army two weeks
MT two weeks Iraq’s weapons army
P = 5/8 =0.625
R = 5/14 = 0.357
F-mean = 10*P*R/(9P+R) = 0.3731
Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6
Discounting factor: DF = 0.5 * (0.6**3) = 0.108
METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333
Translation Quality Assessment: Evaluation and Estimation

46 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Others
WMT shared task on metrics:
TerroCat
DepRef
MEANT and TINE
TESLA
LEPOR
ROSE
AMBER
Many other linguistically motivated metrics where
matching is not done at the word-level (only)
...
Translation Quality Assessment: Evaluation and Estimation

47 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Outline
1

Translation quality

2

Manual metrics

3

Task-based metrics

4

Reference-based metrics

5

Quality estimation

6

Conclusions
Translation Quality Assessment: Evaluation and Estimation

48 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations

Translation Quality Assessment: Evaluation and Estimation

49 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data

Translation Quality Assessment: Evaluation and Estimation

49 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?

Translation Quality Assessment: Evaluation and Estimation

49 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?

Translation Quality Assessment: Evaluation and Estimation

49 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?

Translation Quality Assessment: Evaluation and Estimation

49 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Overview
Quality estimation (QE): metrics that provide an
estimate on the quality of unseen translations
No access to reference translations
Quality defined by the data
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Translation Quality Assessment: Evaluation and Estimation

49 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Framework

X: examples of
source &
translations

Feature
extraction

Y: Quality
scores for
examples in X

Translation Quality Assessment: Evaluation and Estimation

Features

Machine
Learning

QE model

50 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Framework

MT system

Translation
for xt'
Feature
extraction

Source
Text xs'
Features

Quality score
y'

Translation Quality Assessment: Evaluation and Estimation

QE model

50 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Framework

Main components to build a QE system:
1
Definition of quality: what to predict
2
(Human) labelled data (for quality)
3
Features
4
Machine learning algorithm

Translation Quality Assessment: Evaluation and Estimation

51 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Definition of quality

Predict
Predict
Predict
Predict
Predict
Predict
Predict
Predict

1-N absolute scores for adequacy/fluency
1-N absolute scores for post-editing effort
average post-editing time per word
relative rankings
relative rankings for same source
percentage of edits needed for sentence
word-level edits and its types
BLEU, etc. scores for document

Translation Quality Assessment: Evaluation and Estimation

52 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Datasets

SHEF (several): http://staffwww.dcs.shef.ac.uk/
people/L.Specia/resources.html
LIG (10K, fr-en): http://www-clips.imag.fr/geod/
User/marion.potet/index.php?page=download
LMSI (14K, fr-en, en-fr, 2 post-editors):
http://web.limsi.fr/Individu/wisniews/
recherche/index.html

Translation Quality Assessment: Evaluation and Estimation

53 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Features

Adequacy
indicators

Source text

Complexity
indicators

MT system

Confidence
indicators

Translation Quality Assessment: Evaluation and Estimation

Translation

Fluency
indicators

54 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

QuEst
Goal: framework to explore features for QE
Feature extractors for 150+ features of all types: Java
Machine learning: wrappers for a number of algorithms
in the scikit-learn toolkit, grid search, feature selection

Open source:
http://www.quest.dcs.shef.ac.uk/
Translation Quality Assessment: Evaluation and Estimation

55 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

State of the art in QE
WMT12-13 shared tasks

Translation Quality Assessment: Evaluation and Estimation

56 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

State of the art in QE
WMT12-13 shared tasks
Sentence- and word-level estimation of PE effort

Translation Quality Assessment: Evaluation and Estimation

56 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

State of the art in QE
WMT12-13 shared tasks
Sentence- and word-level estimation of PE effort
Datasets and language pairs:
Quality
1-5 subjective scores
Ranking all sentences best-worst
HTER scores
Post-editing time
Word-level edits: change/keep
Word-level edits: keep/delete/replace
Ranking 5 MTs per source

Translation Quality Assessment: Evaluation and Estimation

Year
WMT12
WMT12/13
WMT13
WMT13
WMT13
WMT13
WMT13

Languages
en-es
en-es
en-es
en-es
en-es
en-es
en-es; de-en

56 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

State of the art in QE
WMT12-13 shared tasks
Sentence- and word-level estimation of PE effort
Datasets and language pairs:
Quality
1-5 subjective scores
Ranking all sentences best-worst
HTER scores
Post-editing time
Word-level edits: change/keep
Word-level edits: keep/delete/replace
Ranking 5 MTs per source

Evaluation metric:
MAE =

N
i=1

Year
WMT12
WMT12/13
WMT13
WMT13
WMT13
WMT13
WMT13

Languages
en-es
en-es
en-es
en-es
en-es
en-es
en-es; de-en

|H(si ) − V (si )|
N

Translation Quality Assessment: Evaluation and Estimation

56 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Baseline system
Features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequency
quartiles 1 and 4
% of seen source unigrams

Translation Quality Assessment: Evaluation and Estimation

57 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Baseline system
Features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequency
quartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ, and C
optimised using a grid-search and 5-fold cross validation on the
training set
Translation Quality Assessment: Evaluation and Estimation

57 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Results - scoring sub-task (WMT12)
System ID
• SDLLW M5PbestDeltaAvg
UU best
SDLLW SVM
UU bltk
Loria SVMlinear
UEdin
TCD M5P-resources-only*
Baseline bb17 SVR
Loria SVMrbf
SJTU
WLV-SHEF FS
PRHLT-UPV
WLV-SHEF BL
DCU-SYMC unconstrained
DFKI grcfs-mars
DFKI cfs-plsreg
UPC 1
DCU-SYMC constrained
UPC 2
TCD M5P-all

MAE
0.61
0.64
0.64
0.64
0.68
0.68
0.68
0.69
0.69
0.69
0.69
0.70
0.72
0.75
0.82
0.82
0.84
0.86
0.87
2.09

Translation Quality Assessment: Evaluation and Estimation

RMSE
0.75
0.79
0.78
0.79
0.82
0.82
0.82
0.82
0.83
0.83
0.85
0.85
0.86
0.97
0.98
0.99
1.01
1.12
1.04
2.32
58 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Results - scoring sub-task (WMT13)
System ID
• SHEF FS
SHEF FS-AL
CNGL SVRPLS
LIMSI
DCU-SYMC combine
DCU-SYMC alltypes
CMU noB
CNGL SVR
FBK-UEdin extra
FBK-UEdin rand-svr
LORIA inctrain
Baseline bb17 SVR
TCD-CNGL open
LORIA inctraincont
TCD-CNGL restricted
CMU full
UMAC

MAE
12.42
13.02
13.26
13.32
13.45
13.51
13.84
13.85
14.38
14.50
14.79
14.81
14.81
14.83
15.20
15.25
16.97

Translation Quality Assessment: Evaluation and Estimation

RMSE
15.74
17.03
16.82
17.22
16.64
17.14
17.46
17.28
17.68
17.73
18.34
18.22
19.00
18.17
19.59
18.97
21.94

59 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Open issues

Agreement between annotators
Absolute value judgements: difficult to achieve
consistency even in highly controlled settings
WMT12: 30% of initial dataset discarded
Remaining annotations had to be scaled

Translation Quality Assessment: Evaluation and Estimation

60 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Open issues
Annotation costs: active learning to select subset of
instances to be annotated (Beck et al., ACL 2013)

Translation Quality Assessment: Evaluation and Estimation

61 / 66

Conclusions
BL'
AF'
BL+PR'
AF+PR'

Translation Quality Assessment: Evaluation and Estimation
09
2 s4
'

09
2 s3
'

0.554'
0.5401'
0.5401'
0.5249'
0.5194'

0.5462'
0.5399'
0.5301'
0.5249'
0.521'

0.5339'
0.5437'
0.5113'
0.5309'
0.506'

0.4614'
0.4741'
0.4493'
0.4609'
0.441'

Reference-based metrics

GA
LE
11
2s2
''

0.4'

0.3591'
0.3578'
0.3401'
0.3409'
0.337'

0.35'

GA
LE
11
2s1
'

T2

EA
M

T2

09
2 s2
'

0.5313'
0.5265'
0.5123'
0.5109'
0.5025'

Task-based metrics

EA
M

T2

09
2 s1
'

0.6'

EA
M

T2

0.4401'
0.4292'
0.4183'
0.4169'
0.411'

Manual metrics

EA
M

'

0.5'

r2e
n)

0.45'
0.4857'
0.4719'
0.449'
0.4471'
0.432'

0.55'

11
'( f

es
)'

0.6821'
0.6717'
0.629'
0.6324'
0.6131'

0.7'

11
'( e
n2

T1
2'

W
M

0.65'

EA
M
T2

T2

EA
M

Translation quality
Quality estimation

Open issues
Curse of dimensionality: feature selection to identify
relevant info for dataset (Shah et al., MT Summit 2013)

0.3'

FS'

62 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Open issues

0.554'
0.5401'
0.5401'
0.5249'
0.5194'

0.5462'
0.5399'
0.5301'
0.5249'
0.521'
0.3591'
0.3578'
0.3401'
0.3409'
0.337'

0.5339'
0.5437'
0.5113'
0.5309'
0.506'

0.4614'
0.4741'
0.4493'
0.4609'
0.441'

0.5'
0.45'

0.4401'
0.4292'
0.4183'
0.4169'
0.411'

0.4857'
0.4719'
0.449'
0.4471'
0.432'

0.6'
0.55'

0.5313'
0.5265'
0.5123'
0.5109'
0.5025'

0.7'
0.65'

0.6821'
0.6717'
0.629'
0.6324'
0.6131'

Curse of dimensionality: feature selection to identify
relevant info for dataset (Shah et al., MT Summit 2013)

0.4'
0.35'

BL'

AF'

BL+PR'

AF+PR'

GA
LE
11
2s2
''

GA
LE
11
2s1
'

09
2 s4
'
EA
M

T2

09
2 s3
'
T2
EA
M

09
2 s2
'
EA
M

T2

09
2 s1
'
T2
EA
M

11
'( f

r2e
n)

'

es
)'
EA
M
T2

11
'( e
n2

EA
M

T2

W
M

T1
2'

0.3'

FS'

Common feature set identified, but nuanced subsets for
specific datasets
Translation Quality Assessment: Evaluation and Estimation

62 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Open issues
How to use estimated PE effort scores?: Do users prefer
detailed estimates (sub-sentence level) or an overall
estimate for the complete sentence or not seeing bad
sentences at all?
Too much information vs hard-to-interpret scores

Translation Quality Assessment: Evaluation and Estimation

63 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Open issues
How to use estimated PE effort scores?: Do users prefer
detailed estimates (sub-sentence level) or an overall
estimate for the complete sentence or not seeing bad
sentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness metric

Translation Quality Assessment: Evaluation and Estimation

63 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Open issues
How to use estimated PE effort scores?: Do users prefer
detailed estimates (sub-sentence level) or an overall
estimate for the complete sentence or not seeing bad
sentences at all?
Too much information vs hard-to-interpret scores
IBM’s Goodness metric

MATECAT project investigating it
Translation Quality Assessment: Evaluation and Estimation

63 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Outline
1

Translation quality

2

Manual metrics

3

Task-based metrics

4

Reference-based metrics

5

Quality estimation

6

Conclusions
Translation Quality Assessment: Evaluation and Estimation

64 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions
(Machine) Translation evaluation & estimation: still an
open problem

Translation Quality Assessment: Evaluation and Estimation

65 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality

Translation Quality Assessment: Evaluation and Estimation

65 / 66

Conclusions
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Quality estimation: learning of these different notions,
but requires labelled data

Translation Quality Assessment: Evaluation and Estimation

65 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Quality estimation: learning of these different notions,
but requires labelled data
Solution:
Think of what quality means in your scenario
Measure significance
Measure agreement if manual metrics

Translation Quality Assessment: Evaluation and Estimation

65 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Quality estimation: learning of these different notions,
but requires labelled data
Solution:
Think of what quality means in your scenario
Measure significance
Measure agreement if manual metrics
Use various metrics

Translation Quality Assessment: Evaluation and Estimation

65 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Conclusions

Conclusions
(Machine) Translation evaluation & estimation: still an
open problem
Different metrics for: different purposes/users, different
needs, different notions of quality
Quality estimation: learning of these different notions,
but requires labelled data
Solution:
Think of what quality means in your scenario
Measure significance
Measure agreement if manual metrics
Use various metrics
Invent your own metric!

Translation Quality Assessment: Evaluation and Estimation

65 / 66
Translation quality

Manual metrics

Task-based metrics

Reference-based metrics

Quality estimation

Translation Quality Assessment:
Evaluation and Estimation
Lucia Specia
University of Sheffield
l.specia@sheffield.ac.uk

EXPERT Winter School, 12 November 2013

Translation Quality Assessment: Evaluation and Estimation

66 / 66

Conclusions

Mais conteúdo relacionado

Mais procurados

Overview of Multidimensional Quality Metrics (QTLaunchPad)
Overview of Multidimensional Quality Metrics (QTLaunchPad)Overview of Multidimensional Quality Metrics (QTLaunchPad)
Overview of Multidimensional Quality Metrics (QTLaunchPad)Arle Lommel
 
Translation quality measurement2
Translation quality measurement2Translation quality measurement2
Translation quality measurement2patigalin
 
TAUS Quality Dashboard and the integration of DQF in translation technologies...
TAUS Quality Dashboard and the integration of DQF in translation technologies...TAUS Quality Dashboard and the integration of DQF in translation technologies...
TAUS Quality Dashboard and the integration of DQF in translation technologies...TAUS - The Language Data Network
 
High Volume, Rapid Turn Around Localization: Lessons Learned
High Volume, Rapid Turn Around Localization: Lessons LearnedHigh Volume, Rapid Turn Around Localization: Lessons Learned
High Volume, Rapid Turn Around Localization: Lessons LearnedSDL
 
Top Trans Survey Translation Issues
Top Trans Survey Translation IssuesTop Trans Survey Translation Issues
Top Trans Survey Translation IssuesRaya Wasser
 
How Technology Has Changed the World of Technical Translation
How Technology Has Changed the World of Technical TranslationHow Technology Has Changed the World of Technical Translation
How Technology Has Changed the World of Technical TranslationTennycut
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 
Technical_translation_is_it_really_about_terminology_en
Technical_translation_is_it_really_about_terminology_enTechnical_translation_is_it_really_about_terminology_en
Technical_translation_is_it_really_about_terminology_enVyacheslav Guzovsky
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translationbdonaldson
 
Language translator internship report
Language translator internship reportLanguage translator internship report
Language translator internship reportSumitSumit26
 
Agile Tester - Crash Slides
Agile Tester - Crash SlidesAgile Tester - Crash Slides
Agile Tester - Crash SlidesSamer Desouky
 
Improving the Usability of Refactoring Tools for Software Change Tasks
Improving the Usability of Refactoring Tools for Software Change TasksImproving the Usability of Refactoring Tools for Software Change Tasks
Improving the Usability of Refactoring Tools for Software Change TasksAnna Maria Eilertsen
 
Neural Machine Translation: a report from the front line
Neural Machine Translation: a report from the front lineNeural Machine Translation: a report from the front line
Neural Machine Translation: a report from the front lineIconic Translation Machines
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of PatentsIconic Translation Machines
 

Mais procurados (19)

Overview of Multidimensional Quality Metrics (QTLaunchPad)
Overview of Multidimensional Quality Metrics (QTLaunchPad)Overview of Multidimensional Quality Metrics (QTLaunchPad)
Overview of Multidimensional Quality Metrics (QTLaunchPad)
 
Translation quality measurement2
Translation quality measurement2Translation quality measurement2
Translation quality measurement2
 
TAUS Quality Dashboard and the integration of DQF in translation technologies...
TAUS Quality Dashboard and the integration of DQF in translation technologies...TAUS Quality Dashboard and the integration of DQF in translation technologies...
TAUS Quality Dashboard and the integration of DQF in translation technologies...
 
High Volume, Rapid Turn Around Localization: Lessons Learned
High Volume, Rapid Turn Around Localization: Lessons LearnedHigh Volume, Rapid Turn Around Localization: Lessons Learned
High Volume, Rapid Turn Around Localization: Lessons Learned
 
Top Trans Survey Translation Issues
Top Trans Survey Translation IssuesTop Trans Survey Translation Issues
Top Trans Survey Translation Issues
 
How Technology Has Changed the World of Technical Translation
How Technology Has Changed the World of Technical TranslationHow Technology Has Changed the World of Technical Translation
How Technology Has Changed the World of Technical Translation
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
Technical_translation_is_it_really_about_terminology_en
Technical_translation_is_it_really_about_terminology_enTechnical_translation_is_it_really_about_terminology_en
Technical_translation_is_it_really_about_terminology_en
 
Technical Translation
Technical TranslationTechnical Translation
Technical Translation
 
Back translation explained: what we do and what you get
Back translation explained: what we do and what you getBack translation explained: what we do and what you get
Back translation explained: what we do and what you get
 
Steps in translation process
Steps in translation processSteps in translation process
Steps in translation process
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
The 3 types of translation review – and when to use them
The 3 types of translation review – and when to use themThe 3 types of translation review – and when to use them
The 3 types of translation review – and when to use them
 
Language translator internship report
Language translator internship reportLanguage translator internship report
Language translator internship report
 
TAUS Evaluating Post-Editor Performance Guidelines
TAUS Evaluating Post-Editor Performance GuidelinesTAUS Evaluating Post-Editor Performance Guidelines
TAUS Evaluating Post-Editor Performance Guidelines
 
Agile Tester - Crash Slides
Agile Tester - Crash SlidesAgile Tester - Crash Slides
Agile Tester - Crash Slides
 
Improving the Usability of Refactoring Tools for Software Change Tasks
Improving the Usability of Refactoring Tools for Software Change TasksImproving the Usability of Refactoring Tools for Software Change Tasks
Improving the Usability of Refactoring Tools for Software Change Tasks
 
Neural Machine Translation: a report from the front line
Neural Machine Translation: a report from the front lineNeural Machine Translation: a report from the front line
Neural Machine Translation: a report from the front line
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 

Semelhante a 10. Lucia Specia (USFD) Evaluation of Machine Translation

How Does Your MT System Measure Up? tekom/tcworld 2014
How Does Your MT System Measure Up? tekom/tcworld 2014 How Does Your MT System Measure Up? tekom/tcworld 2014
How Does Your MT System Measure Up? tekom/tcworld 2014 kantanmt
 
Translation quality measurement2
Translation quality measurement2Translation quality measurement2
Translation quality measurement2patigalin
 
A data driven approach to translation outcomes
A data driven approach to translation outcomesA data driven approach to translation outcomes
A data driven approach to translation outcomesSmartling
 
Learn the different approaches to machine translation and how to improve the ...
Learn the different approaches to machine translation and how to improve the ...Learn the different approaches to machine translation and how to improve the ...
Learn the different approaches to machine translation and how to improve the ...SDL
 
Miguel Vera - Unbabel - OSL19
Miguel Vera - Unbabel - OSL19Miguel Vera - Unbabel - OSL19
Miguel Vera - Unbabel - OSL19marketingsyone
 
Language Quality Management: Models, Measures, Methodologies
Language Quality Management: Models, Measures, Methodologies Language Quality Management: Models, Measures, Methodologies
Language Quality Management: Models, Measures, Methodologies Sajan
 
PDF Zhen Guan Mini conference
PDF Zhen Guan Mini conferencePDF Zhen Guan Mini conference
PDF Zhen Guan Mini conferenceZhen Guan
 
4 Steps to Epiphany: Streamlining Translation Quality Management at Larger La...
4 Steps to Epiphany: Streamlining Translation Quality Management at Larger La...4 Steps to Epiphany: Streamlining Translation Quality Management at Larger La...
4 Steps to Epiphany: Streamlining Translation Quality Management at Larger La...Kirill Soloviev
 
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systems
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT SystemsATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systems
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systemskantanmt
 
Amazon sentimental analysis
Amazon sentimental analysisAmazon sentimental analysis
Amazon sentimental analysisAkhila
 
LavaCon 2015: Efficient Translation Management - 5 Specific Metrics That Wil...
LavaCon 2015:  Efficient Translation Management - 5 Specific Metrics That Wil...LavaCon 2015:  Efficient Translation Management - 5 Specific Metrics That Wil...
LavaCon 2015: Efficient Translation Management - 5 Specific Metrics That Wil...Scott Carothers
 
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_saRobert Martin
 
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...TAUS - The Language Data Network
 
BENCHMARKING MINI-SERIES PART #1: Proving Value & Quantifying the Impact of U...
BENCHMARKING MINI-SERIES PART #1: Proving Value & Quantifying the Impact of U...BENCHMARKING MINI-SERIES PART #1: Proving Value & Quantifying the Impact of U...
BENCHMARKING MINI-SERIES PART #1: Proving Value & Quantifying the Impact of U...UserZoom
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize
 
Serge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-finalSerge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-finalSerge Gladkoff
 

Semelhante a 10. Lucia Specia (USFD) Evaluation of Machine Translation (20)

Lucia Specia - Estimativa de qualidade em TA
Lucia Specia - Estimativa de qualidade em TALucia Specia - Estimativa de qualidade em TA
Lucia Specia - Estimativa de qualidade em TA
 
How Does Your MT System Measure Up? tekom/tcworld 2014
How Does Your MT System Measure Up? tekom/tcworld 2014 How Does Your MT System Measure Up? tekom/tcworld 2014
How Does Your MT System Measure Up? tekom/tcworld 2014
 
Translation quality measurement2
Translation quality measurement2Translation quality measurement2
Translation quality measurement2
 
A data driven approach to translation outcomes
A data driven approach to translation outcomesA data driven approach to translation outcomes
A data driven approach to translation outcomes
 
Learn the different approaches to machine translation and how to improve the ...
Learn the different approaches to machine translation and how to improve the ...Learn the different approaches to machine translation and how to improve the ...
Learn the different approaches to machine translation and how to improve the ...
 
Miguel Vera - Unbabel - OSL19
Miguel Vera - Unbabel - OSL19Miguel Vera - Unbabel - OSL19
Miguel Vera - Unbabel - OSL19
 
Language Quality Management: Models, Measures, Methodologies
Language Quality Management: Models, Measures, Methodologies Language Quality Management: Models, Measures, Methodologies
Language Quality Management: Models, Measures, Methodologies
 
Auditing_COAs_TranslationSilviaZaragoza
Auditing_COAs_TranslationSilviaZaragozaAuditing_COAs_TranslationSilviaZaragoza
Auditing_COAs_TranslationSilviaZaragoza
 
PDF Zhen Guan Mini conference
PDF Zhen Guan Mini conferencePDF Zhen Guan Mini conference
PDF Zhen Guan Mini conference
 
Measurably improve translation quality in 60 days
Measurably improve translation quality in 60 daysMeasurably improve translation quality in 60 days
Measurably improve translation quality in 60 days
 
4 Steps to Epiphany: Streamlining Translation Quality Management at Larger La...
4 Steps to Epiphany: Streamlining Translation Quality Management at Larger La...4 Steps to Epiphany: Streamlining Translation Quality Management at Larger La...
4 Steps to Epiphany: Streamlining Translation Quality Management at Larger La...
 
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systems
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT SystemsATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systems
ATC Summit 2016: The 7th Habit of 7 Habits of Effective MT Systems
 
Amazon sentimental analysis
Amazon sentimental analysisAmazon sentimental analysis
Amazon sentimental analysis
 
LavaCon 2015: Efficient Translation Management - 5 Specific Metrics That Wil...
LavaCon 2015:  Efficient Translation Management - 5 Specific Metrics That Wil...LavaCon 2015:  Efficient Translation Management - 5 Specific Metrics That Wil...
LavaCon 2015: Efficient Translation Management - 5 Specific Metrics That Wil...
 
NISO Apr 29 Virtual Conference: ‘Good Enough’: Applying a Holistic Approach f...
NISO Apr 29 Virtual Conference: ‘Good Enough’: Applying a Holistic Approach f...NISO Apr 29 Virtual Conference: ‘Good Enough’: Applying a Holistic Approach f...
NISO Apr 29 Virtual Conference: ‘Good Enough’: Applying a Holistic Approach f...
 
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_sa
 
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
Pushing translation quality upstream (Klaus Fleischman, Managing Director of ...
 
BENCHMARKING MINI-SERIES PART #1: Proving Value & Quantifying the Impact of U...
BENCHMARKING MINI-SERIES PART #1: Proving Value & Quantifying the Impact of U...BENCHMARKING MINI-SERIES PART #1: Proving Value & Quantifying the Impact of U...
BENCHMARKING MINI-SERIES PART #1: Proving Value & Quantifying the Impact of U...
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
 
Serge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-finalSerge astm-presentation-chicago-2014-final
Serge astm-presentation-chicago-2014-final
 

Mais de RIILP

Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD RIILP
 
Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic RIILP
 
Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones RIILP
 
Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones RIILP
 
Gianluca Giulinin - FAO
Gianluca Giulinin - FAO Gianluca Giulinin - FAO
Gianluca Giulinin - FAO RIILP
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic RIILP
 
Tony O'Dowd - KantanMT
Tony O'Dowd -  KantanMT Tony O'Dowd -  KantanMT
Tony O'Dowd - KantanMT RIILP
 
Santanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAARSantanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAARRIILP
 
Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU RIILP
 
Anna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMAAnna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMARIILP
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD RIILP
 
Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW RIILP
 
Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA RIILP
 
Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU RIILP
 
Liling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARLiling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARRIILP
 
Sandra de luca - Acclaro
Sandra de luca - AcclaroSandra de luca - Acclaro
Sandra de luca - AcclaroRIILP
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015RIILP
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015RIILP
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015RIILP
 

Mais de RIILP (20)

Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD
 
Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic
 
Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
 
Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones
 
Gianluca Giulinin - FAO
Gianluca Giulinin - FAO Gianluca Giulinin - FAO
Gianluca Giulinin - FAO
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
 
Tony O'Dowd - KantanMT
Tony O'Dowd -  KantanMT Tony O'Dowd -  KantanMT
Tony O'Dowd - KantanMT
 
Santanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAARSantanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAAR
 
Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU
 
Anna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMAAnna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMA
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD
 
Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW
 
Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA
 
Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU
 
Liling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARLiling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAAR
 
Sandra de luca - Acclaro
Sandra de luca - AcclaroSandra de luca - Acclaro
Sandra de luca - Acclaro
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
 

Último

Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Kirill Klimov
 
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...ShrutiBose4
 
Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Riya Pathan
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailAriel592675
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfrichard876048
 
Investment in The Coconut Industry by Nancy Cheruiyot
Investment in The Coconut Industry by Nancy CheruiyotInvestment in The Coconut Industry by Nancy Cheruiyot
Investment in The Coconut Industry by Nancy Cheruiyotictsugar
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Seta Wicaksana
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfJos Voskuil
 
IoT Insurance Observatory: summary 2024
IoT Insurance Observatory:  summary 2024IoT Insurance Observatory:  summary 2024
IoT Insurance Observatory: summary 2024Matteo Carbone
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCRashishs7044
 
India Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportIndia Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportMintel Group
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044
 
Marketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent ChirchirMarketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent Chirchirictsugar
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Pereraictsugar
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCRashishs7044
 
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadIslamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadAyesha Khan
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy Verified Accounts
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...ictsugar
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCRashishs7044
 

Último (20)

Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024
 
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
Ms Motilal Padampat Sugar Mills vs. State of Uttar Pradesh & Ors. - A Milesto...
 
Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737Independent Call Girls Andheri Nightlaila 9967584737
Independent Call Girls Andheri Nightlaila 9967584737
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detail
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdf
 
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
 
Investment in The Coconut Industry by Nancy Cheruiyot
Investment in The Coconut Industry by Nancy CheruiyotInvestment in The Coconut Industry by Nancy Cheruiyot
Investment in The Coconut Industry by Nancy Cheruiyot
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdf
 
IoT Insurance Observatory: summary 2024
IoT Insurance Observatory:  summary 2024IoT Insurance Observatory:  summary 2024
IoT Insurance Observatory: summary 2024
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR
 
India Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportIndia Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample Report
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
 
Marketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent ChirchirMarketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent Chirchir
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Perera
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
 
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in IslamabadIslamabad Escorts | Call 03070433345 | Escort Service in Islamabad
Islamabad Escorts | Call 03070433345 | Escort Service in Islamabad
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail Accounts
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
 

10. Lucia Specia (USFD) Evaluation of Machine Translation

  • 1. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Translation Quality Assessment: Evaluation and Estimation Lucia Specia University of Sheffield l.specia@sheffield.ac.uk EXPERT Winter School, 12 November 2013 Translation Quality Assessment: Evaluation and Estimation 1 / 66 Conclusions
  • 2. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview “Machine Translation evaluation is better understood than Machine Translation” (Carbonell and Wilks, 1991) Translation Quality Assessment: Evaluation and Estimation 2 / 66
  • 3. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 3 / 66 Conclusions
  • 4. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 4 / 66 Conclusions
  • 5. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Why is evaluation important? Compare MT systems Measure progress of MT systems over time Diagnose of MT systems Assess (and pay) human translators Quality assurance Tuning of SMT systems Decision on fitness-for-purpose ... Translation Quality Assessment: Evaluation and Estimation 5 / 66 Conclusions
  • 6. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Why is evaluation hard? What does quality mean? Fluent? Adequate? Easy to post-edit? Translation Quality Assessment: Evaluation and Estimation 6 / 66 Conclusions
  • 7. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Why is evaluation hard? What does quality mean? Fluent? Adequate? Easy to post-edit? Quality for whom/what? End-user: gisting (Google Translate), internal communications, or publication (dissemination) MT-system: tuning or diagnosis Post-editor: draft translations (light vs heavy post-editing) Other applications, e.g. CLIR Translation Quality Assessment: Evaluation and Estimation 6 / 66 Conclusions
  • 8. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Ref: Do not buy this product, it’s their craziest invention! MT: Do buy this product, it’s their craziest invention! Translation Quality Assessment: Evaluation and Estimation 7 / 66 Conclusions
  • 9. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Ref: Do not buy this product, it’s their craziest invention! MT: Do buy this product, it’s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators Translation Quality Assessment: Evaluation and Estimation 7 / 66 Conclusions
  • 10. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview Ref: Do not buy this product, it’s their craziest invention! MT: Do buy this product, it’s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators Ref: The battery lasts 6 hours and it can be fully recharged in 30 minutes. MT: Six-hour battery, 30 minutes to full charge last. Translation Quality Assessment: Evaluation and Estimation 7 / 66
  • 11. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview Ref: Do not buy this product, it’s their craziest invention! MT: Do buy this product, it’s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators Ref: The battery lasts 6 hours and it can be fully recharged in 30 minutes. MT: Six-hour battery, 30 minutes to full charge last. Ok for gisting - meaning preserved Very costly for post-editing if style is to be preserved Translation Quality Assessment: Evaluation and Estimation 7 / 66
  • 12. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview How do we measure quality? Manual metrics: Error counts, ranking, acceptability, 1-N judgements on fluency/adequacy Task-based human metrics: productivity tests (HTER, PE time, keystrokes), user-satisfaction, reading comprehension Translation Quality Assessment: Evaluation and Estimation 8 / 66
  • 13. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview How do we measure quality? Manual metrics: Error counts, ranking, acceptability, 1-N judgements on fluency/adequacy Task-based human metrics: productivity tests (HTER, PE time, keystrokes), user-satisfaction, reading comprehension Automatic metrics: Based on human references: BLEU, METEOR, TER, ... Reference-less: quality estimation Translation Quality Assessment: Evaluation and Estimation 8 / 66
  • 14. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 9 / 66 Conclusions
  • 15. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Judgements in an n-point scale Adequacy using 5-point scale (NIST-like) 5 All meaning expressed in the source fragment appears in the translation fragment. 4 Most of the source fragment meaning is expressed in the translation fragment. 3 Much of the source fragment meaning is expressed in the translation fragment. 2 Little of the source fragment meaning is expressed in the translation fragment. 1 None of the meaning expressed in the source fragment is expressed in the translation fragment. Translation Quality Assessment: Evaluation and Estimation 10 / 66
  • 16. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Judgements in an n-point scale Fluency using 5-point scale (NIST-like) 5 Native language fluency. No grammar errors, good word choice and syntactic structure in the translation fragment. 4 Near native language fluency. Few terminology or grammar errors which don’t impact the overall understanding of the meaning. 3 Not very fluent. About half of translation contains errors. 2 Little fluency. Wrong word choice, poor grammar and syntactic structure. 1 No fluency. Absolutely ungrammatical and for the most part doesn’t make any sense. Translation Quality Assessment: Evaluation and Estimation 11 / 66
  • 17. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Judgements in an n-point scale Issues: Subjective judgements Hard to reach significant agreement Is it realible at all? Can we use multiple annotators? Translation Quality Assessment: Evaluation and Estimation 12 / 66 Conclusions
  • 18. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Judgements in an n-point scale Issues: Subjective judgements Hard to reach significant agreement Is it realible at all? Can we use multiple annotators? Are fluency and adequacy really separable? Ref: Absolutely ungrammatical and for the most part doesn’t make any sense. MT: Absolutely sense doesn’t ungrammatical for the and most make any part. Translation Quality Assessment: Evaluation and Estimation 12 / 66 Conclusions
  • 19. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Ranking WMT-13 Appraise tool: rank translations best-worst (w. ties) Translation Quality Assessment: Evaluation and Estimation 13 / 66
  • 20. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Ranking Issues: Subjective judgements: what does “best” mean? Hard to judge for long sentences Translation Quality Assessment: Evaluation and Estimation 14 / 66 Conclusions
  • 21. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Ranking Issues: Subjective judgements: what does “best” mean? Hard to judge for long sentences Ref: The majority of existing work focuses on predicting some form of post-editing effort to help professional translators. MT1: Few of the existing work focuses on predicting some form of post-editing effort to help professional translators. MT2: The majority of existing work focuses on predicting some form of post-editing effort to help machine translation. Translation Quality Assessment: Evaluation and Estimation 14 / 66
  • 22. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Ranking Issues: Subjective judgements: what does “best” mean? Hard to judge for long sentences Ref: The majority of existing work focuses on predicting some form of post-editing effort to help professional translators. MT1: Few of the existing work focuses on predicting some form of post-editing effort to help professional translators. MT2: The majority of existing work focuses on predicting some form of post-editing effort to help machine translation. Only serve for comparison purposes - the best system might not be good enough Absolute evaluation can do both Translation Quality Assessment: Evaluation and Estimation 14 / 66
  • 23. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Error counts More fine-grained Aimed at diagnosis of MT systems, quality control of human translation. Translation Quality Assessment: Evaluation and Estimation 15 / 66
  • 24. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Error counts More fine-grained Aimed at diagnosis of MT systems, quality control of human translation. E.g.: Multidimensional Quality Metrics (MQM) Machine and human translation quality Takes quality of source text into account Actual metric is based on a specification Translation Quality Assessment: Evaluation and Estimation 15 / 66
  • 25. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions MQM Issues selected based on a given specification (dimensions): Language/locale Subject field/domain Text Type Audience Purpose Register Style Content correspondence Output modality, ... Translation Quality Assessment: Evaluation and Estimation 16 / 66
  • 26. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation MQM Issue types (core): Altogether: 120 categories Translation Quality Assessment: Evaluation and Estimation 17 / 66 Conclusions
  • 27. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions MQM Issue types: http://www.qt21.eu/launchpad/content/ high-level-structure-0 Combining issue types: TQ = 100 − AccP − (FluPT − FluPS ) − (VerPT − VerPS ) Translation Quality Assessment: Evaluation and Estimation 18 / 66
  • 28. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions MQM Issue types: http://www.qt21.eu/launchpad/content/ high-level-structure-0 Combining issue types: TQ = 100 − AccP − (FluPT − FluPS ) − (VerPT − VerPS ) translate5: open source graphical (Web) interface for inline error annotation: www.translate5.net Translation Quality Assessment: Evaluation and Estimation 18 / 66
  • 29. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation MQM Translation Quality Assessment: Evaluation and Estimation 19 / 66 Conclusions
  • 30. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Error counts Issues: Time consuming Requires training, esp. to distinguish between fine-grained error types Different errors are more relevant for different specifications: need to select and weight them accordingly Translation Quality Assessment: Evaluation and Estimation 20 / 66
  • 31. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 21 / 66 Conclusions
  • 32. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Post-editing Productivity analysis: Measure translation quality within task. E.g. Autodesk - Productivity test through post-editing 2-day translation and post-editing , 37 participants In-house Moses (Autodesk data: software) Time spent on each segment Translation Quality Assessment: Evaluation and Estimation 22 / 66
  • 33. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing PET: Records time, keystrokes, edit distance Translation Quality Assessment: Evaluation and Estimation 23 / 66 Conclusions
  • 34. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing - PET Translation Quality Assessment: Evaluation and Estimation 24 / 66 Conclusions
  • 35. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing - PET Translation Quality Assessment: Evaluation and Estimation 25 / 66 Conclusions
  • 36. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing - PET How often post-editing (PE) a translation tool output is faster than translating from scratch (HT): System Google Moses Systran Trados Faster than HT 94% 86.8% 81.20% 72.40% Translation Quality Assessment: Evaluation and Estimation 26 / 66 Conclusions
  • 37. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing - PET How often post-editing (PE) a translation tool output is faster than translating from scratch (HT): System Google Moses Systran Trados Faster than HT 94% 86.8% 81.20% 72.40% Comparing the time to translate from scratch with the time to PE MT, in seconds: Annotator Average Deviation HT (s) 31.89 9.99 PE (s) 18.82 6.79 HT/PE 1.73 0.26 Translation Quality Assessment: Evaluation and Estimation PE/HT 0.59 0.09 26 / 66 Conclusions
  • 38. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation User satisfaction Solving a problem: E.g.: Intel measuring user satisfaction with un-edited MT Translation is good if customer can solve problem Translation Quality Assessment: Evaluation and Estimation 27 / 66 Conclusions
  • 39. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions User satisfaction Solving a problem: E.g.: Intel measuring user satisfaction with un-edited MT Translation is good if customer can solve problem MT for Customer Support websites Overall customer satisfaction: 75% for English→Chinese 95% reduction in cost Project cycle from 10 days to 1 day From 300 to 60,000 words translated/hour Translation Quality Assessment: Evaluation and Estimation 27 / 66
  • 40. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions User satisfaction Solving a problem: E.g.: Intel measuring user satisfaction with un-edited MT Translation is good if customer can solve problem MT for Customer Support websites Overall customer satisfaction: 75% for English→Chinese 95% reduction in cost Project cycle from 10 days to 1 day From 300 to 60,000 words translated/hour Customers in China using MT texts were more satisfied with support than natives using original texts (68%)! Translation Quality Assessment: Evaluation and Estimation 27 / 66
  • 41. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension Defense language proficiency test (Jones et al., 2005): Translation Quality Assessment: Evaluation and Estimation 28 / 66 Conclusions
  • 42. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension Translation Quality Assessment: Evaluation and Estimation 29 / 66 Conclusions
  • 43. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension MT quality: function of 1 2 Text passage comprehension, as measured by answers accuracy, and Time taken to complete a test item (read a passage + answer its questions) Translation Quality Assessment: Evaluation and Estimation 29 / 66 Conclusions
  • 44. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension Compared to Human Translation (HT): Translation Quality Assessment: Evaluation and Estimation 30 / 66 Conclusions
  • 45. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension Compared to Human Translation (HT): Translation Quality Assessment: Evaluation and Estimation 31 / 66 Conclusions
  • 46. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Task-based metrics Issues: Final goal needs to be very clear Can be more cost/time consuming Final task has to have a meaningful metric Other elements may affect the final quality measurement (e.g. Chinese vs. Americans) Translation Quality Assessment: Evaluation and Estimation 32 / 66
  • 47. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 33 / 66 Conclusions
  • 48. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Automatic metrics Compare output of an MT system to one or more reference (usually human) translations: how close is the MT output to the reference translation? Numerous metrics: BLEU, NIST, etc. Advantages: Fast and cheap, minimal human labour, no need for bilingual speakers Once test set is created, can be reused many times Can be used on an on-going basis during system development to test changes Translation Quality Assessment: Evaluation and Estimation 34 / 66
  • 49. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Automatic metrics Disadvantages: Very few metrics look at variable ways of saying the same thing (word-level): stems, synonyms, paraphrases Individual sentence scores are not very reliable, aggregate scores on a large test set are required Translation Quality Assessment: Evaluation and Estimation 35 / 66
  • 50. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Automatic metrics Disadvantages: Very few metrics look at variable ways of saying the same thing (word-level): stems, synonyms, paraphrases Individual sentence scores are not very reliable, aggregate scores on a large test set are required Very few of these metrics penalise different mismatches differently Translation Quality Assessment: Evaluation and Estimation 35 / 66
  • 51. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Automatic metrics Disadvantages: Very few metrics look at variable ways of saying the same thing (word-level): stems, synonyms, paraphrases Individual sentence scores are not very reliable, aggregate scores on a large test set are required Very few of these metrics penalise different mismatches differently Reference translations are only a subset of the possible good translations Translation Quality Assessment: Evaluation and Estimation 35 / 66
  • 52. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions String matching BLEU: BiLingual Evaluation Understudy Most widely used metric, both for MT system evaluation/comparison and SMT tuning Matching of n-grams between MT and Ref: rewards same words in equal order #clip(g ) count of reference n-grams g which happen in a hypothesis sentence h clipped by the number of times g appears in the reference sentence for h; #(g ) = number of n-grams in hypotheses n-gram precision pn for a set of MT translations H = pn = h∈H h∈H g ∈ngrams(h) #clip(g ) g ∈ngrams(h) Translation Quality Assessment: Evaluation and Estimation #(g ) 36 / 66
  • 53. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation BLEU Combine (mean of the log) 1-n n-gram precisions log pn n Translation Quality Assessment: Evaluation and Estimation 37 / 66 Conclusions
  • 54. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Combine (mean of the log) 1-n n-gram precisions log pn n Bias towards translations with fewer words (denominator) Brevity penalty to penalise MT sentences that are shorter than reference Compares the overall number of words wh of the entire hypotheses set with ref length wr : BP = 1 e (1−wr /wh ) Translation Quality Assessment: Evaluation and Estimation if wh ≥ wr otherwise 37 / 66
  • 55. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Combine (mean of the log) 1-n n-gram precisions log pn n Bias towards translations with fewer words (denominator) Brevity penalty to penalise MT sentences that are shorter than reference Compares the overall number of words wh of the entire hypotheses set with ref length wr : BP = 1 e (1−wr /wh ) if wh ≥ wr otherwise BLEU = BP ∗ exp log pn n Translation Quality Assessment: Evaluation and Estimation 37 / 66
  • 56. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Scale: 0-1, but highly dependent on the test set Rewards fluency by matching high n-grams (up to 4) Adequacy rewarded by unigrams and brevity penalty – poor model of recall Synonyms and paraphrases only handled if they are in any of the reference translations All tokens are equally weighted: missing out on a content word = missing out on a determiner Better for evaluating changes in the same system than comparing different MT architectures Translation Quality Assessment: Evaluation and Estimation 38 / 66
  • 57. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Not good at sentence-level, unless smoothing is applied: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army 1-gram precision: 2-gram precision: 3-gram precision: 4-gram precision: BLEU = 0 4/8 1/7 0/6 0/5 Translation Quality Assessment: Evaluation and Estimation 39 / 66
  • 58. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Importance of clipping and brevity penalty Ref1: the Iraqi weapons are to be handed over to the army within two weeks Ref2: the Iraqi weapons will be surrendered to the army in two weeks MT: the the the the Count for the should be clipped at 2: max count of the word in any reference. Unigram score = 2/4 (not 4/4) Translation Quality Assessment: Evaluation and Estimation 40 / 66
  • 59. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Importance of clipping and brevity penalty Ref1: the Iraqi weapons are to be handed over to the army within two weeks Ref2: the Iraqi weapons will be surrendered to the army in two weeks MT: the the the the Count for the should be clipped at 2: max count of the word in any reference. Unigram score = 2/4 (not 4/4) MT: Iraqi weapons will be 1-gram precision: 4/4 2-gram precision: 3/3 3-gram precision: 2/2 4-gram precision: 1/1 Precision (pn ) = 1 Precision score penalised because h < r Translation Quality Assessment: Evaluation and Estimation 40 / 66
  • 60. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation String matching BLEU: Translation Quality Assessment: Evaluation and Estimation 41 / 66 Conclusions
  • 61. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Edit distance WER: Word Error Rate: Levenshtein edit distance Minimum proportion of insertions, deletions, and substitutions needed to transform an MT sentence into the reference sentence Heavily penalises reorderings: correct translation in the wrong location: deletion + insertion S +D +I WER = N PER: Position-independent word Error Rate: Does not penalise reorderings: output and reference sentences are unordered sets Translation Quality Assessment: Evaluation and Estimation 42 / 66
  • 62. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Edit distance: TER TER: Translation Error Rate Adds shift operation Translation Quality Assessment: Evaluation and Estimation 43 / 66 Conclusions
  • 63. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Edit distance: TER TER: Translation Error Rate Adds shift operation REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times HYP: [this week] the saudis denied information published in the ***** Translation Quality Assessment: Evaluation and Estimation new york times 43 / 66
  • 64. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Edit distance: TER TER: Translation Error Rate Adds shift operation REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times HYP: [this week] the saudis denied information published in the ***** new york times 1 Shift, 2 Substitutions, 1 Deletion → 4 Edits: 4 TER = 13 = 0.31 Translation Quality Assessment: Evaluation and Estimation 43 / 66
  • 65. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Edit distance: TER TER: Translation Error Rate Adds shift operation REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times HYP: [this week] the saudis denied information published in the ***** new york times 1 Shift, 2 Substitutions, 1 Deletion → 4 Edits: 4 TER = 13 = 0.31 Human-targeted TER (HTER) TER between MT and its post-edited version Translation Quality Assessment: Evaluation and Estimation 43 / 66
  • 66. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Edit distance: TER TER: Translation Quality Assessment: Evaluation and Estimation 44 / 66 Conclusions
  • 67. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Alignment-based METEOR: Unigram Precision and Recall Align MT output with reference. Take best scoring pair for multiple refs. Matching considers word inflection variations (stems), synonyms/paraphrases Fluency addressed via a direct penalty: fragmentation of the matching METEOR score = F-mean score discounted for fragmentation = F-mean * (1 - DF) Translation Quality Assessment: Evaluation and Estimation 45 / 66
  • 68. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation METEOR Example: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army Translation Quality Assessment: Evaluation and Estimation 46 / 66 Conclusions
  • 69. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation METEOR Example: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army Matching: Ref: Iraqi weapons army two weeks MT two weeks Iraq’s weapons army Translation Quality Assessment: Evaluation and Estimation 46 / 66 Conclusions
  • 70. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation METEOR Example: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army Matching: Ref: Iraqi weapons army two weeks MT two weeks Iraq’s weapons army P = 5/8 =0.625 R = 5/14 = 0.357 F-mean = 10*P*R/(9P+R) = 0.3731 Translation Quality Assessment: Evaluation and Estimation 46 / 66 Conclusions
  • 71. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions METEOR Example: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army Matching: Ref: Iraqi weapons army two weeks MT two weeks Iraq’s weapons army P = 5/8 =0.625 R = 5/14 = 0.357 F-mean = 10*P*R/(9P+R) = 0.3731 Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6 Discounting factor: DF = 0.5 * (0.6**3) = 0.108 METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333 Translation Quality Assessment: Evaluation and Estimation 46 / 66
  • 72. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Others WMT shared task on metrics: TerroCat DepRef MEANT and TINE TESLA LEPOR ROSE AMBER Many other linguistically motivated metrics where matching is not done at the word-level (only) ... Translation Quality Assessment: Evaluation and Estimation 47 / 66 Conclusions
  • 73. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 48 / 66 Conclusions
  • 74. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • 75. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • 76. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Quality = Can we publish it as is? Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • 77. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Quality = Can we publish it as is? Quality = Can a reader get the gist? Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • 78. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Quality = Can we publish it as is? Quality = Can a reader get the gist? Quality = Is it worth post-editing it? Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • 79. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Quality = Can we publish it as is? Quality = Can a reader get the gist? Quality = Is it worth post-editing it? Quality = How much effort to fix it? Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • 80. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Framework X: examples of source & translations Feature extraction Y: Quality scores for examples in X Translation Quality Assessment: Evaluation and Estimation Features Machine Learning QE model 50 / 66
  • 81. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Framework MT system Translation for xt' Feature extraction Source Text xs' Features Quality score y' Translation Quality Assessment: Evaluation and Estimation QE model 50 / 66
  • 82. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Framework Main components to build a QE system: 1 Definition of quality: what to predict 2 (Human) labelled data (for quality) 3 Features 4 Machine learning algorithm Translation Quality Assessment: Evaluation and Estimation 51 / 66 Conclusions
  • 83. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Definition of quality Predict Predict Predict Predict Predict Predict Predict Predict 1-N absolute scores for adequacy/fluency 1-N absolute scores for post-editing effort average post-editing time per word relative rankings relative rankings for same source percentage of edits needed for sentence word-level edits and its types BLEU, etc. scores for document Translation Quality Assessment: Evaluation and Estimation 52 / 66 Conclusions
  • 84. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Datasets SHEF (several): http://staffwww.dcs.shef.ac.uk/ people/L.Specia/resources.html LIG (10K, fr-en): http://www-clips.imag.fr/geod/ User/marion.potet/index.php?page=download LMSI (14K, fr-en, en-fr, 2 post-editors): http://web.limsi.fr/Individu/wisniews/ recherche/index.html Translation Quality Assessment: Evaluation and Estimation 53 / 66
  • 85. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Features Adequacy indicators Source text Complexity indicators MT system Confidence indicators Translation Quality Assessment: Evaluation and Estimation Translation Fluency indicators 54 / 66 Conclusions
  • 86. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions QuEst Goal: framework to explore features for QE Feature extractors for 150+ features of all types: Java Machine learning: wrappers for a number of algorithms in the scikit-learn toolkit, grid search, feature selection Open source: http://www.quest.dcs.shef.ac.uk/ Translation Quality Assessment: Evaluation and Estimation 55 / 66
  • 87. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation State of the art in QE WMT12-13 shared tasks Translation Quality Assessment: Evaluation and Estimation 56 / 66 Conclusions
  • 88. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation State of the art in QE WMT12-13 shared tasks Sentence- and word-level estimation of PE effort Translation Quality Assessment: Evaluation and Estimation 56 / 66 Conclusions
  • 89. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions State of the art in QE WMT12-13 shared tasks Sentence- and word-level estimation of PE effort Datasets and language pairs: Quality 1-5 subjective scores Ranking all sentences best-worst HTER scores Post-editing time Word-level edits: change/keep Word-level edits: keep/delete/replace Ranking 5 MTs per source Translation Quality Assessment: Evaluation and Estimation Year WMT12 WMT12/13 WMT13 WMT13 WMT13 WMT13 WMT13 Languages en-es en-es en-es en-es en-es en-es en-es; de-en 56 / 66
  • 90. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions State of the art in QE WMT12-13 shared tasks Sentence- and word-level estimation of PE effort Datasets and language pairs: Quality 1-5 subjective scores Ranking all sentences best-worst HTER scores Post-editing time Word-level edits: change/keep Word-level edits: keep/delete/replace Ranking 5 MTs per source Evaluation metric: MAE = N i=1 Year WMT12 WMT12/13 WMT13 WMT13 WMT13 WMT13 WMT13 Languages en-es en-es en-es en-es en-es en-es en-es; de-en |H(si ) − V (si )| N Translation Quality Assessment: Evaluation and Estimation 56 / 66
  • 91. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Baseline system Features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of source 1-grams, 2-grams and 3-grams in frequency quartiles 1 and 4 % of seen source unigrams Translation Quality Assessment: Evaluation and Estimation 57 / 66
  • 92. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Baseline system Features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of source 1-grams, 2-grams and 3-grams in frequency quartiles 1 and 4 % of seen source unigrams SVM regression with RBF kernel with the parameters γ, and C optimised using a grid-search and 5-fold cross validation on the training set Translation Quality Assessment: Evaluation and Estimation 57 / 66
  • 93. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Results - scoring sub-task (WMT12) System ID • SDLLW M5PbestDeltaAvg UU best SDLLW SVM UU bltk Loria SVMlinear UEdin TCD M5P-resources-only* Baseline bb17 SVR Loria SVMrbf SJTU WLV-SHEF FS PRHLT-UPV WLV-SHEF BL DCU-SYMC unconstrained DFKI grcfs-mars DFKI cfs-plsreg UPC 1 DCU-SYMC constrained UPC 2 TCD M5P-all MAE 0.61 0.64 0.64 0.64 0.68 0.68 0.68 0.69 0.69 0.69 0.69 0.70 0.72 0.75 0.82 0.82 0.84 0.86 0.87 2.09 Translation Quality Assessment: Evaluation and Estimation RMSE 0.75 0.79 0.78 0.79 0.82 0.82 0.82 0.82 0.83 0.83 0.85 0.85 0.86 0.97 0.98 0.99 1.01 1.12 1.04 2.32 58 / 66 Conclusions
  • 94. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Results - scoring sub-task (WMT13) System ID • SHEF FS SHEF FS-AL CNGL SVRPLS LIMSI DCU-SYMC combine DCU-SYMC alltypes CMU noB CNGL SVR FBK-UEdin extra FBK-UEdin rand-svr LORIA inctrain Baseline bb17 SVR TCD-CNGL open LORIA inctraincont TCD-CNGL restricted CMU full UMAC MAE 12.42 13.02 13.26 13.32 13.45 13.51 13.84 13.85 14.38 14.50 14.79 14.81 14.81 14.83 15.20 15.25 16.97 Translation Quality Assessment: Evaluation and Estimation RMSE 15.74 17.03 16.82 17.22 16.64 17.14 17.46 17.28 17.68 17.73 18.34 18.22 19.00 18.17 19.59 18.97 21.94 59 / 66 Conclusions
  • 95. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues Agreement between annotators Absolute value judgements: difficult to achieve consistency even in highly controlled settings WMT12: 30% of initial dataset discarded Remaining annotations had to be scaled Translation Quality Assessment: Evaluation and Estimation 60 / 66 Conclusions
  • 96. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues Annotation costs: active learning to select subset of instances to be annotated (Beck et al., ACL 2013) Translation Quality Assessment: Evaluation and Estimation 61 / 66 Conclusions
  • 97. BL' AF' BL+PR' AF+PR' Translation Quality Assessment: Evaluation and Estimation 09 2 s4 ' 09 2 s3 ' 0.554' 0.5401' 0.5401' 0.5249' 0.5194' 0.5462' 0.5399' 0.5301' 0.5249' 0.521' 0.5339' 0.5437' 0.5113' 0.5309' 0.506' 0.4614' 0.4741' 0.4493' 0.4609' 0.441' Reference-based metrics GA LE 11 2s2 '' 0.4' 0.3591' 0.3578' 0.3401' 0.3409' 0.337' 0.35' GA LE 11 2s1 ' T2 EA M T2 09 2 s2 ' 0.5313' 0.5265' 0.5123' 0.5109' 0.5025' Task-based metrics EA M T2 09 2 s1 ' 0.6' EA M T2 0.4401' 0.4292' 0.4183' 0.4169' 0.411' Manual metrics EA M ' 0.5' r2e n) 0.45' 0.4857' 0.4719' 0.449' 0.4471' 0.432' 0.55' 11 '( f es )' 0.6821' 0.6717' 0.629' 0.6324' 0.6131' 0.7' 11 '( e n2 T1 2' W M 0.65' EA M T2 T2 EA M Translation quality Quality estimation Open issues Curse of dimensionality: feature selection to identify relevant info for dataset (Shah et al., MT Summit 2013) 0.3' FS' 62 / 66 Conclusions
  • 98. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues 0.554' 0.5401' 0.5401' 0.5249' 0.5194' 0.5462' 0.5399' 0.5301' 0.5249' 0.521' 0.3591' 0.3578' 0.3401' 0.3409' 0.337' 0.5339' 0.5437' 0.5113' 0.5309' 0.506' 0.4614' 0.4741' 0.4493' 0.4609' 0.441' 0.5' 0.45' 0.4401' 0.4292' 0.4183' 0.4169' 0.411' 0.4857' 0.4719' 0.449' 0.4471' 0.432' 0.6' 0.55' 0.5313' 0.5265' 0.5123' 0.5109' 0.5025' 0.7' 0.65' 0.6821' 0.6717' 0.629' 0.6324' 0.6131' Curse of dimensionality: feature selection to identify relevant info for dataset (Shah et al., MT Summit 2013) 0.4' 0.35' BL' AF' BL+PR' AF+PR' GA LE 11 2s2 '' GA LE 11 2s1 ' 09 2 s4 ' EA M T2 09 2 s3 ' T2 EA M 09 2 s2 ' EA M T2 09 2 s1 ' T2 EA M 11 '( f r2e n) ' es )' EA M T2 11 '( e n2 EA M T2 W M T1 2' 0.3' FS' Common feature set identified, but nuanced subsets for specific datasets Translation Quality Assessment: Evaluation and Estimation 62 / 66 Conclusions
  • 99. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues How to use estimated PE effort scores?: Do users prefer detailed estimates (sub-sentence level) or an overall estimate for the complete sentence or not seeing bad sentences at all? Too much information vs hard-to-interpret scores Translation Quality Assessment: Evaluation and Estimation 63 / 66 Conclusions
  • 100. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues How to use estimated PE effort scores?: Do users prefer detailed estimates (sub-sentence level) or an overall estimate for the complete sentence or not seeing bad sentences at all? Too much information vs hard-to-interpret scores IBM’s Goodness metric Translation Quality Assessment: Evaluation and Estimation 63 / 66 Conclusions
  • 101. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues How to use estimated PE effort scores?: Do users prefer detailed estimates (sub-sentence level) or an overall estimate for the complete sentence or not seeing bad sentences at all? Too much information vs hard-to-interpret scores IBM’s Goodness metric MATECAT project investigating it Translation Quality Assessment: Evaluation and Estimation 63 / 66 Conclusions
  • 102. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 64 / 66 Conclusions
  • 103. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions (Machine) Translation evaluation & estimation: still an open problem Translation Quality Assessment: Evaluation and Estimation 65 / 66 Conclusions
  • 104. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Translation Quality Assessment: Evaluation and Estimation 65 / 66 Conclusions
  • 105. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Quality estimation: learning of these different notions, but requires labelled data Translation Quality Assessment: Evaluation and Estimation 65 / 66
  • 106. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Quality estimation: learning of these different notions, but requires labelled data Solution: Think of what quality means in your scenario Measure significance Measure agreement if manual metrics Translation Quality Assessment: Evaluation and Estimation 65 / 66
  • 107. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Quality estimation: learning of these different notions, but requires labelled data Solution: Think of what quality means in your scenario Measure significance Measure agreement if manual metrics Use various metrics Translation Quality Assessment: Evaluation and Estimation 65 / 66
  • 108. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Quality estimation: learning of these different notions, but requires labelled data Solution: Think of what quality means in your scenario Measure significance Measure agreement if manual metrics Use various metrics Invent your own metric! Translation Quality Assessment: Evaluation and Estimation 65 / 66
  • 109. Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Translation Quality Assessment: Evaluation and Estimation Lucia Specia University of Sheffield l.specia@sheffield.ac.uk EXPERT Winter School, 12 November 2013 Translation Quality Assessment: Evaluation and Estimation 66 / 66 Conclusions