SlideShare uma empresa Scribd logo
1 de 18
BLEU: a Method for Automatic
Evaluation of Machine Translation
(BiLingual Evaluation Understudy)
  Kishore Papineni, Salim Roukos, Todd
        Ward, and Wei-Jing Zhu
  Proceedings of the 40th Annual Meeting of the
  Association for Computational Linguistics (ACL),
       Philadelphia, July 2002, pp. 311- 318
Viewpoint
• The idea: the closer a machine translation is to a
  professional human translation, the better it is.
• To judge the quality
   – Numerical metric
• So, MT evaluation system requires:
   1. A numerical “translation closeness” metric
   2. A corpus of good quality human reference translations
• Word error rate metric
   – Idea: use of weighted average of variable length phrase
     matches against the reference translations
   – 参照変換に対して可変長フレーズ一致の加重平均を
     使用 (Google Translate)
Baseline BLEU Metric
• The primary programming task for a BLEU
  implementor is to compare n-grams of the
  candidate with the n-grams of the reference
  translation and count the number of matches

• So, we look at computing unigram matches
n-gram precision
• Precision measure
   – Counts up the number of candidate translation words
     ( unigrams ) which occur in any reference translation and
     then divides by the total number of words in the candidate
     translation
• However, MT generates improbable, high-precision
  translations like the example result below
   – A ref word considered exhausted after a matching
     candidate word is identified
Modified n-gram precision
• Modified unigram precision
    – Counts the maximum number of times a word occurs in any single reference
      translation
    – Clips the total count of each candidate word by its maximum reference count
    – Adds these clipped counts up
    – Divides by the total (unclipped) number of candidate words
• Modified n-gram precision
    – All candidate n-gram counts & corresponding maximum reference counts are
      collected
    – The candidate counts are clipped by their corresponding reference maximum
      value, summed and divided by the total number of candidate n-grams
Modified n-gram precision on text
                  blocks
•   Basic unit of evaluation is the sentence
•   Compute the n-gram matches sentence by sentence
•   Add clipped n-gram counts for all the candidate sentences
•   Divide by the number of candidate n-grams in the test corpus to compute
    a modified precision score
Ranking systems
• Human translation & machine translation
• 4 reference translations for each of 127 source sentences
• Result:




•   From this result:
     –   Single n-gram precision score can distinguish good/bad translations
•   To be useful, the metric must distinguish between two human translations that do not differ so
    greatly in quality
Ranking systems
• Translations done by:
    – Lacking native proficiency in both SL/TL
    – Native English speaker
    – Three commercial systems




• Result:
    – The systems in result order is the same rank order by
      human judges
Combining the modified n-gram
             precisions
• The result, in prev. slide, shows:
  – It decays roughly exponentially with n
  – mod. unigram precision > bigram > trigram
• BLEU uses the average logarithm with uniform
  weights (BLEUは一様重み付き平均の対数を
  使用しています)
Recall
• BLEU considers multiple reference translations,
  each of which may use a different word choice
  to translate the same source word.
• A good candidate translation will only use
  (recall) one of these possible choices, but not
  all. Indeed, recalling all choices leads to a bad
  translation
Sentence brevity penalty
• Candidate translations longer than references are penalized by the
  modified n-gram precision measure
• Brevity penalty factor:
    – A high-scoring candidate translation must match the reference translations in
      length, in word choice and in word order
        • Brevity penalty 1.0: candidate’s length is the same as any reference translations length.
• c: the length of the candidate translation
• r: the effective reference corpus length
• exp(1 - r/c): brevity penalty
BLEU details
• Take the geometric mean of the test corpus’ modified precision scores and
  then multiply the result by an exponential brevity penalty factor.
• We first compute the geometric average of the modified n-gram precisions,
  pn, using n-grams up to length N and positive weights wn summing to one.



•   To make the behavior apparent
The BLEU Evaluation
• The BLEU metric ranges from 0 to 1
• 1 is very rare: only for perfect match
• The more, the better
• Human translation score 0.3468 against four references and scored 0.2571
  against two references
• Table 1: 5 systems against two reference
•   Is the difference in BLEU metric reliable?
•   What is the variance of the BLEU score?
•   If we were to pick another random set of 500 sentences, would we still judge S3 to
    be better than S2?




• 20 blocks of 25 sentences each on BLEU metric
• Computed the means, variances, paired t-statistics
• What the Table2 indicates is:
     – 500 sentences in Table 1 and 25 sentences in Table 2
     – t-statistics of 1.7 or above is considered 95% significant
Evaluation
• Two groups of people, each group has 10 ppl
  – Monolingual group
  – Bilingual group
• Evaluated previous 5 systems
• Evaluation Rate: 1 (very bad) to 5 (very good)
• There were some liberal evaluations than
  others
Pairwise Judgments
BLEU predictions
BLEU vs Bi, Mono-lingual Judgements

Mais conteúdo relacionado

Mais procurados

Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
Rama Irsheidat
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
Po-Chuan Chen
 

Mais procurados (20)

NLP
NLPNLP
NLP
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Bert
BertBert
Bert
 
Question answering
Question answeringQuestion answering
Question answering
 
Language models
Language modelsLanguage models
Language models
 
BERT
BERTBERT
BERT
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems
[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems
[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems
 
Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Training language models to follow instructions with human feedback.pdf
Training language models to follow instructions
with human feedback.pdfTraining language models to follow instructions
with human feedback.pdf
Training language models to follow instructions with human feedback.pdf
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 

Destaque

LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Deview2013 naver labs_nsmt_외부공개버전_김준석
Deview2013 naver labs_nsmt_외부공개버전_김준석Deview2013 naver labs_nsmt_외부공개버전_김준석
Deview2013 naver labs_nsmt_외부공개버전_김준석
NAVER D2
 
猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測
Shuyo Nakatani
 
Active Learning 入門
Active Learning 入門Active Learning 入門
Active Learning 入門
Shuyo Nakatani
 

Destaque (8)

LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
はじめてのルベーグ積分
はじめてのルベーグ積分はじめてのルベーグ積分
はじめてのルベーグ積分
 
Deview2013 naver labs_nsmt_외부공개버전_김준석
Deview2013 naver labs_nsmt_외부공개버전_김준석Deview2013 naver labs_nsmt_외부공개버전_김준석
Deview2013 naver labs_nsmt_외부공개버전_김준석
 
猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測
 
Harmons App
Harmons AppHarmons App
Harmons App
 
[2A4]DeepLearningAtNAVER
[2A4]DeepLearningAtNAVER[2A4]DeepLearningAtNAVER
[2A4]DeepLearningAtNAVER
 
Active Learning 入門
Active Learning 入門Active Learning 入門
Active Learning 入門
 
画像キャプションの自動生成
画像キャプションの自動生成画像キャプションの自動生成
画像キャプションの自動生成
 

Semelhante a 5. bleu

Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
Lifeng (Aaron) Han
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMT
kantanmt
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Lifeng (Aaron) Han
 

Semelhante a 5. bleu (20)

Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
TOIN - TAUS Tokyo Forum 2015
TOIN - TAUS Tokyo Forum 2015TOIN - TAUS Tokyo Forum 2015
TOIN - TAUS Tokyo Forum 2015
 
TransQuest
TransQuestTransQuest
TransQuest
 
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_sa
 
Conversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognitionConversational transfer learning for emotion recognition
Conversational transfer learning for emotion recognition
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMT
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
 
What is machine translation
What is machine translationWhat is machine translation
What is machine translation
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutions
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Intrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word EmbeddingsIntrinsic and Extrinsic Evaluations of Word Embeddings
Intrinsic and Extrinsic Evaluations of Word Embeddings
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 

Mais de Hiroshi Matsumoto

Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Hiroshi Matsumoto
 
Improving translation via targeted paraphrasing
Improving translation via targeted paraphrasingImproving translation via targeted paraphrasing
Improving translation via targeted paraphrasing
Hiroshi Matsumoto
 
10.combination of sm_tn_rbmt
10.combination of sm_tn_rbmt10.combination of sm_tn_rbmt
10.combination of sm_tn_rbmt
Hiroshi Matsumoto
 
9. cgc parser with_norml_std
9. cgc parser with_norml_std9. cgc parser with_norml_std
9. cgc parser with_norml_std
Hiroshi Matsumoto
 

Mais de Hiroshi Matsumoto (19)

Phrase linguistic classification and generalization for improving statistical...
Phrase linguistic classification and generalization for improving statistical...Phrase linguistic classification and generalization for improving statistical...
Phrase linguistic classification and generalization for improving statistical...
 
Paraphrasing Swedish Compound Nouns in Machine Translation
Paraphrasing Swedish Compound Nouns in Machine TranslationParaphrasing Swedish Compound Nouns in Machine Translation
Paraphrasing Swedish Compound Nouns in Machine Translation
 
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
 
Summary of Dialectal to standard Arabic paraphrasing to improve Arabic-Englis...
Summary of Dialectal to standard Arabic paraphrasing to improve Arabic-Englis...Summary of Dialectal to standard Arabic paraphrasing to improve Arabic-Englis...
Summary of Dialectal to standard Arabic paraphrasing to improve Arabic-Englis...
 
Improving translation via targeted paraphrasing
Improving translation via targeted paraphrasingImproving translation via targeted paraphrasing
Improving translation via targeted paraphrasing
 
Summary: A Sense-Based Translation Model for Statistical Machine Translation
Summary: A Sense-Based Translation Model for Statistical Machine TranslationSummary: A Sense-Based Translation Model for Statistical Machine Translation
Summary: A Sense-Based Translation Model for Statistical Machine Translation
 
Summary of Rule-based Reordering Space in Statistical Machine Translation
Summary of Rule-based Reordering Space in Statistical Machine TranslationSummary of Rule-based Reordering Space in Statistical Machine Translation
Summary of Rule-based Reordering Space in Statistical Machine Translation
 
Predicting Power Relations between Participants in Written Dialog from a Sing...
Predicting Power Relations between Participants in Written Dialog from a Sing...Predicting Power Relations between Participants in Written Dialog from a Sing...
Predicting Power Relations between Participants in Written Dialog from a Sing...
 
Modeling Irony in Twitter
Modeling Irony in TwitterModeling Irony in Twitter
Modeling Irony in Twitter
 
Factored translationmodel
Factored translationmodelFactored translationmodel
Factored translationmodel
 
10.combination of sm_tn_rbmt
10.combination of sm_tn_rbmt10.combination of sm_tn_rbmt
10.combination of sm_tn_rbmt
 
9. cgc parser with_norml_std
9. cgc parser with_norml_std9. cgc parser with_norml_std
9. cgc parser with_norml_std
 
8. relearnt rbmt
8. relearnt rbmt8. relearnt rbmt
8. relearnt rbmt
 
7. ebmt based on st sm
7. ebmt based on st sm7. ebmt based on st sm
7. ebmt based on st sm
 
Summary of English Japanese Translation by MSR-MT
Summary of English Japanese Translation by MSR-MTSummary of English Japanese Translation by MSR-MT
Summary of English Japanese Translation by MSR-MT
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translation
 
Mt framework nagao_makoto
Mt framework nagao_makotoMt framework nagao_makoto
Mt framework nagao_makoto
 
Approach to japanese english automatic translation by Susumu Kuno
Approach to japanese english automatic translation by Susumu KunoApproach to japanese english automatic translation by Susumu Kuno
Approach to japanese english automatic translation by Susumu Kuno
 
Machine translation
Machine translationMachine translation
Machine translation
 

Último

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

5. bleu

  • 1. BLEU: a Method for Automatic Evaluation of Machine Translation (BiLingual Evaluation Understudy) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311- 318
  • 2. Viewpoint • The idea: the closer a machine translation is to a professional human translation, the better it is. • To judge the quality – Numerical metric • So, MT evaluation system requires: 1. A numerical “translation closeness” metric 2. A corpus of good quality human reference translations • Word error rate metric – Idea: use of weighted average of variable length phrase matches against the reference translations – 参照変換に対して可変長フレーズ一致の加重平均を 使用 (Google Translate)
  • 3. Baseline BLEU Metric • The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches • So, we look at computing unigram matches
  • 4. n-gram precision • Precision measure – Counts up the number of candidate translation words ( unigrams ) which occur in any reference translation and then divides by the total number of words in the candidate translation • However, MT generates improbable, high-precision translations like the example result below – A ref word considered exhausted after a matching candidate word is identified
  • 5. Modified n-gram precision • Modified unigram precision – Counts the maximum number of times a word occurs in any single reference translation – Clips the total count of each candidate word by its maximum reference count – Adds these clipped counts up – Divides by the total (unclipped) number of candidate words • Modified n-gram precision – All candidate n-gram counts & corresponding maximum reference counts are collected – The candidate counts are clipped by their corresponding reference maximum value, summed and divided by the total number of candidate n-grams
  • 6. Modified n-gram precision on text blocks • Basic unit of evaluation is the sentence • Compute the n-gram matches sentence by sentence • Add clipped n-gram counts for all the candidate sentences • Divide by the number of candidate n-grams in the test corpus to compute a modified precision score
  • 7. Ranking systems • Human translation & machine translation • 4 reference translations for each of 127 source sentences • Result: • From this result: – Single n-gram precision score can distinguish good/bad translations • To be useful, the metric must distinguish between two human translations that do not differ so greatly in quality
  • 8. Ranking systems • Translations done by: – Lacking native proficiency in both SL/TL – Native English speaker – Three commercial systems • Result: – The systems in result order is the same rank order by human judges
  • 9. Combining the modified n-gram precisions • The result, in prev. slide, shows: – It decays roughly exponentially with n – mod. unigram precision > bigram > trigram • BLEU uses the average logarithm with uniform weights (BLEUは一様重み付き平均の対数を 使用しています)
  • 10. Recall • BLEU considers multiple reference translations, each of which may use a different word choice to translate the same source word. • A good candidate translation will only use (recall) one of these possible choices, but not all. Indeed, recalling all choices leads to a bad translation
  • 11. Sentence brevity penalty • Candidate translations longer than references are penalized by the modified n-gram precision measure • Brevity penalty factor: – A high-scoring candidate translation must match the reference translations in length, in word choice and in word order • Brevity penalty 1.0: candidate’s length is the same as any reference translations length. • c: the length of the candidate translation • r: the effective reference corpus length • exp(1 - r/c): brevity penalty
  • 12. BLEU details • Take the geometric mean of the test corpus’ modified precision scores and then multiply the result by an exponential brevity penalty factor. • We first compute the geometric average of the modified n-gram precisions, pn, using n-grams up to length N and positive weights wn summing to one. • To make the behavior apparent
  • 13. The BLEU Evaluation • The BLEU metric ranges from 0 to 1 • 1 is very rare: only for perfect match • The more, the better • Human translation score 0.3468 against four references and scored 0.2571 against two references • Table 1: 5 systems against two reference
  • 14. Is the difference in BLEU metric reliable? • What is the variance of the BLEU score? • If we were to pick another random set of 500 sentences, would we still judge S3 to be better than S2? • 20 blocks of 25 sentences each on BLEU metric • Computed the means, variances, paired t-statistics • What the Table2 indicates is: – 500 sentences in Table 1 and 25 sentences in Table 2 – t-statistics of 1.7 or above is considered 95% significant
  • 15. Evaluation • Two groups of people, each group has 10 ppl – Monolingual group – Bilingual group • Evaluated previous 5 systems • Evaluation Rate: 1 (very bad) to 5 (very good) • There were some liberal evaluations than others
  • 18. BLEU vs Bi, Mono-lingual Judgements