ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Semantic Similarity Measures in Machine Translation
Evaluation
Hanna B´echara
ESR12
Expert Project
June 27, 2015
Hanna B´echara June 27, 2015 1 / 21

Machine Translation Evaluation
How do we deﬁne translation quality?

Fluency? Grammaticality? Readability?

Post-editing eﬀort?

How well it matches a reference translation?

Meaning Preservation?

Meaning Preservation!!

Semantic Textual Similarity
STS Explained
Semantic Textual Similarity (STS) captures the notion that some
texts are more similar than others

STS Explained
Semantic Textual Similarity (STS) captures the notion that some
texts are more similar than others
5 The two sentences are completely equivalent, as they mean the same
thing.
4 The two sentences are mostly equivalent, but some unimportant
details differ.
3 The two sentences are roughly equivalent, but some important
information differs/missing.
2 The two sentences are not equivalent, but share some details.
1 The two sentences are not equivalent, but are on the same topic.
0 The two sentences are on different topics.

Examples
Example 1
Sentence 1: A brown dog is attacking another animal in front of the man
in pants
Sentence 2:Two dogs are ﬁghting

Examples
Example 1
in pants
Example 2
Sentence 1: A man is chopping butter into a container.
Sentence 2: A woman is cutting shrimps.

Examples
Example 1
in pants
Example 2
Sentence 1: A man is chopping butter into a container.
Sentence 2: A woman is cutting shrimps.
Example 3
Sentence 1: A cat is playing with a watermelon on a ﬂoor.
Sentence 2: A man is pouring oil into a pan.

How do we estimate STS?
Crowd-Sourced Similarity Ratings
Created for SemEval Workshops
Expert’s SemEval Submission
SVM Regressor
Estimates score between 0 and 5
Train on human annotated sentences provided by the SemEval Shared
Tasks
Trained on a variety of features

Methodology
Research Question
Can we estimate the score X as a function of R (relatedness) and
bA (Quality of A)?

Methodology
DGT Translation Memory
DGT-Translation Memory (EN-FR)
500 sentences x 5 most similar matches
Evaluation: S-BLEU (0–1) – Reference French Translations

Methodology
Machine Learning Task
Features
1 Baseline Experiments: 17 QuEst features
2 STS score for source sentence pair
3 S-BLEU score for Sentence Pair A
4 S-BLEU score comparing A to B (MT outputs)

Methodology
Features
3 S-BLEU score for Sentence Pair A
SVM Regression Model
Predicts a score between 0–1
2000 sentences for training – 500 sentences for testing

Methodology
Results
Mean Baseline QuEst Baseline (17) STS (3) Combined (20)
MAE 0.16 0.12 0.108 0.09
Table: Predicting the BLEU scores for DGT-TM - Mean Absolute Error

Methodology
SICK
SICK (Sentences Involving Compositional Knowledge )
4500 sentence pairs
Evaluation: S-BLEU Backtranslations

Methodology
Features
2 S-BLEU score for Sentence A

Methodology
Results
Mean Baseline STS (3)
MAE 0.216 0.193
Table: Predicting the S-BLEU scores for SICK’s Backtranslations - Mean
Absolute Error

Methodology
Designing our Own
Objective
Create a dataset of semantically related sentences their machine
translations, and their quality.

Methodology
Data Preparation
Extracted sentences from the FLICKR images dataset used for
previous SemEval tasks
Each pair has a human similarity rating between 0-5
Each sentence has a French machine translation and quality score for
each translation, between 1 and 5, assigned through manual
evaluation
Each French sentence pair produced by the machine translation is
also assigned a similarity rating through manual evaluation.

Methodology
Example
Sentence A
A group of kids is playing in a yard and an old man is standing in the background
Sentence B
A group of boys in a yard is playing and a man is standing in the background
Semantic Similarity between A and B: 4.5
Sentence A - MT Output
Un groupe d’enfants joue dans une cour et un vieil homme est debout dans l’arri`ere-plan
Sentence B - MT Output
Un groupe de gar¸cons dans une cour joue et un homme est debout dans l’arri`ere-plan
Semantic Similarity between A - MT Output and B - MT Output: ?

Methodology
Example
Sentence A
eurozone unemployment at record 12 percent
Sentence B
eurozone unemployment hits record 12.1 % in march
Semantic Similarity between A and B: 4.5
Sentence A - MT Output
lors de la zone euro 12 % de chômage record
Sentence B - MT Output
le chômage frappe 12.1 % de la zone euro en marche procès-verbalan
Semantic Similarity between A - MT Output and B - MT Output: ?

Methodology
Experiments
Features Sets
2 STS score for sentence pair
3 Human evaluation score for Pair B (MT Output)
4 S-BLEU score comparing Pair A to Pair B (MT outputs)

Methodology
Results
Preliminary Results show that STS information can improve over
the baseline
Baseline Baseline + STS
MAE 0.639 0.575

Methodology
Summing up...
Results show that semantically motivated features can improve over
the quality estimation baseline

Methodology
Summing up...
We can learn the quality of a Sentence B if we have a semantically
similar sentence A with a determined quality

Methodology
Summing up...
We can learn the quality of a Sentence B if we have a semantically
similar sentence A with a determined quality
However, we require access to semantically similar sentences

The End (For Now)
Enjoy the Weekend!

ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015

Recommended

Recommended

More Related Content

More from RIILP

More from RIILP (20)

Recently uploaded

Recently uploaded (20)

ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015