SlideShare uma empresa Scribd logo
1 de 16
SMT & Data Quality Understanding Data Quality Issues Kirti Vashee – kirti. vashee@asiaonline.net
Making Machine Translation Work for You  Customization is Key to Quality Reference Monolingual 	SMT utilizes existing linguistic resources to create customer specific and domain focused systems including: All Legacy TM  - Cleaned and Normalized Dictionaries & Glossaries Old versions of Manuals Examples of high quality Monolingual Data Bilingual Customized Translation System
A Custom Engine is Only as Good as the Data Used The more clean high quality in domain data that a custom  engine is built with, the higher quality the translation output. Golden Rule 5 Rules For Creating “GREAT” Custom Engines ,[object Object]
Fewer post edits required on translation output
 Faster engine maturity
 Variable markers
 Custom tags, HTML, XML, Rich Text etc.
 Telegraphic style text (i.e. “pilot crash lands plane” vs.                                              “the pilot crash landed the plane”)
 Poor quality translations
 Misaligned segments
 Misclassified content (out of domain)
 Mixed language content,[object Object]
What is Foundation Data and it’s Purpose? Foundation data is a foundation from which to build a custom engine from. Foundation data is not sufficient on its own to deliver a high quality engine. Custom data is required. Foundation data reduces the amount of data a client needs to provide, lowering the barriers to entry. Asia Online has prepared data for and trained hundreds of foundation engines using foundation data only.  Not intended as production release engines A foundation engine is in no way symbolic of quality that acustom engine in the same language pair would deliver Intended to verify process and any language specific handling that is required Will not typically be high quality as they have not been normalized, or focused on a specific purpose Consist mainly of bilingual data Limited monolingual data. Monolingual data is a key part of customization and every client has a different desired grammatical style. Add your custom data to foundation data to get quality
Data Used to Build a Custom Engine  1. Bilingual Source and Target Language Pre-Aligned Non-Aligned ,[object Object]
 Translation Memories   (TMX, XLIFF, CSV, etc.) ,[object Object],Minimum: 20,000 Segments Recommend Minimum: 100,000+ Segments Ideal : 500,000+ Segments – the more the better in domain text  2. Monolingual Target Language ,[object Object]
 URLs of similar style and grammar in target languageMinimum: 500MB after cleaning – plain text Recommend Minimum: 1GB+ after cleaning – plain text Ideal: 3-4GB+ after cleaning – plain text  3. Tuning and Test Data ,[object Object]

Mais conteúdo relacionado

Semelhante a Understanding the Importance of Clean, Consistent Data for Statistical Machine Translation Systems

Workshop on the tauyou machine translation platform
Workshop on the tauyou machine translation platformWorkshop on the tauyou machine translation platform
Workshop on the tauyou machine translation platformtauyou
 
110210 care com presentation for poland v2
110210 care com  presentation for poland v2110210 care com  presentation for poland v2
110210 care com presentation for poland v2carecom
 
How To Buy Data Warehouse
How To Buy Data WarehouseHow To Buy Data Warehouse
How To Buy Data WarehouseEric Sun
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)Varish Bajaj
 
A 10 Point Localisation Plan For Games
A 10 Point Localisation Plan For GamesA 10 Point Localisation Plan For Games
A 10 Point Localisation Plan For GamesShamusd
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyRTTS
 
New Database and Application Development Technology
New Database and Application Development TechnologyNew Database and Application Development Technology
New Database and Application Development TechnologyMaurice Staal
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...Welocalize
 
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?tauyou
 
Lexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLoriThicke
 
Choosing The Right Tools For The Right Job
Choosing The Right Tools For The Right JobChoosing The Right Tools For The Right Job
Choosing The Right Tools For The Right Jobguest6159b2
 
Unlocking the Power of Your Data: Working with Databases in FME
Unlocking the Power of Your Data: Working with Databases in FMEUnlocking the Power of Your Data: Working with Databases in FME
Unlocking the Power of Your Data: Working with Databases in FMESafe Software
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTDr. Haxel Consult
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 

Semelhante a Understanding the Importance of Clean, Consistent Data for Statistical Machine Translation Systems (20)

Workshop on the tauyou machine translation platform
Workshop on the tauyou machine translation platformWorkshop on the tauyou machine translation platform
Workshop on the tauyou machine translation platform
 
110210 care com presentation for poland v2
110210 care com  presentation for poland v2110210 care com  presentation for poland v2
110210 care com presentation for poland v2
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
How To Buy Data Warehouse
How To Buy Data WarehouseHow To Buy Data Warehouse
How To Buy Data Warehouse
 
data base management system (DBMS)
data base management system (DBMS)data base management system (DBMS)
data base management system (DBMS)
 
A 10 Point Localisation Plan For Games
A 10 Point Localisation Plan For GamesA 10 Point Localisation Plan For Games
A 10 Point Localisation Plan For Games
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
New Database and Application Development Technology
New Database and Application Development TechnologyNew Database and Application Development Technology
New Database and Application Development Technology
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
 
Mr bi amrp
Mr bi amrpMr bi amrp
Mr bi amrp
 
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
 
Help File Proposal
Help File ProposalHelp File Proposal
Help File Proposal
 
Lexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLexcelera MT Breaking Compromises
Lexcelera MT Breaking Compromises
 
Choosing The Right Tools For The Right Job
Choosing The Right Tools For The Right JobChoosing The Right Tools For The Right Job
Choosing The Right Tools For The Right Job
 
Unlocking the Power of Your Data: Working with Databases in FME
Unlocking the Power of Your Data: Working with Databases in FMEUnlocking the Power of Your Data: Working with Databases in FME
Unlocking the Power of Your Data: Working with Databases in FME
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
 
Tera stream ETL
Tera stream ETLTera stream ETL
Tera stream ETL
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 

Mais de LUSPIO LanguageCamp

LUSPIO Translation Automation Conference (LTAC) 2011
LUSPIO Translation Automation Conference (LTAC) 2011 LUSPIO Translation Automation Conference (LTAC) 2011
LUSPIO Translation Automation Conference (LTAC) 2011 LUSPIO LanguageCamp
 
"Traduttese": tendenze e implicazioni
"Traduttese": tendenze e implicazioni"Traduttese": tendenze e implicazioni
"Traduttese": tendenze e implicazioniLUSPIO LanguageCamp
 
Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...
Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...
Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...LUSPIO LanguageCamp
 
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...LUSPIO LanguageCamp
 
Linguaggi controllati: il caso italiano
Linguaggi controllati: il caso italianoLinguaggi controllati: il caso italiano
Linguaggi controllati: il caso italianoLUSPIO LanguageCamp
 

Mais de LUSPIO LanguageCamp (6)

LUSPIO Translation Automation Conference (LTAC) 2011
LUSPIO Translation Automation Conference (LTAC) 2011 LUSPIO Translation Automation Conference (LTAC) 2011
LUSPIO Translation Automation Conference (LTAC) 2011
 
"Traduttese": tendenze e implicazioni
"Traduttese": tendenze e implicazioni"Traduttese": tendenze e implicazioni
"Traduttese": tendenze e implicazioni
 
Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...
Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...
Come può il traduttore vivere del proprio lavoro, a.k.a.: traduzioni a due ce...
 
MyMemory
MyMemoryMyMemory
MyMemory
 
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
Sistemi autore, linguaggio controllato e manualistica aziendale: scrivere per...
 
Linguaggi controllati: il caso italiano
Linguaggi controllati: il caso italianoLinguaggi controllati: il caso italiano
Linguaggi controllati: il caso italiano
 

Understanding the Importance of Clean, Consistent Data for Statistical Machine Translation Systems

  • 1. SMT & Data Quality Understanding Data Quality Issues Kirti Vashee – kirti. vashee@asiaonline.net
  • 2. Making Machine Translation Work for You Customization is Key to Quality Reference Monolingual SMT utilizes existing linguistic resources to create customer specific and domain focused systems including: All Legacy TM - Cleaned and Normalized Dictionaries & Glossaries Old versions of Manuals Examples of high quality Monolingual Data Bilingual Customized Translation System
  • 3.
  • 4. Fewer post edits required on translation output
  • 5. Faster engine maturity
  • 7. Custom tags, HTML, XML, Rich Text etc.
  • 8. Telegraphic style text (i.e. “pilot crash lands plane” vs. “the pilot crash landed the plane”)
  • 9. Poor quality translations
  • 11. Misclassified content (out of domain)
  • 12.
  • 13. What is Foundation Data and it’s Purpose? Foundation data is a foundation from which to build a custom engine from. Foundation data is not sufficient on its own to deliver a high quality engine. Custom data is required. Foundation data reduces the amount of data a client needs to provide, lowering the barriers to entry. Asia Online has prepared data for and trained hundreds of foundation engines using foundation data only. Not intended as production release engines A foundation engine is in no way symbolic of quality that acustom engine in the same language pair would deliver Intended to verify process and any language specific handling that is required Will not typically be high quality as they have not been normalized, or focused on a specific purpose Consist mainly of bilingual data Limited monolingual data. Monolingual data is a key part of customization and every client has a different desired grammatical style. Add your custom data to foundation data to get quality
  • 14.
  • 15.
  • 16.
  • 17. Examples of what you want the output to look like
  • 18. Guides the engines optimization strategy
  • 19. Blind test data evaluate translation quality and quality improvement3,000-6,000 Segments (can be extracted from existing TMs)
  • 20.
  • 21.
  • 22. Word and Phrase PatternsUser Input Human Translation
  • 23. Quality Data Makes A Difference Clean and Consistent Data A statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training. Controlled Data Fewer translation options for the same source segment, and “clean” translations lead to better foundation patterns. Common Data Higher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems. Current Data Ensure that the most current TM is used in the training data. Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style
  • 24.
  • 25. Mixing a wide variety of bilingual domain data together
  • 26.
  • 27. With “Clean Data” Correction is Possible Typically about 10-20 examples for each clean word of phrase. Each correction has statistical relevance and impact can be clearly seen. Corrections usually involve adding data to fill gaps. Far less correction of actual errors. Clean data means cause of errors can be understood and corrected. Concordance used to create unbiased examples/phrases and ensure scope covered. Large volumes of dirty data prohibits manual correction. Individual corrections would not be statistically relevant. Manual corrections would compete against 1,000’s of bad examples. Impractical to create enough examples manually. Understanding the cause of errors is difficult. Slows training and overall processing time. Requires more resources to process excess data. Only solution is to acquire more dirty data and hope problem is fixed. But may get worse or cause new errors.
  • 28.
  • 29.
  • 30.
  • 31. Training Data: Volume vs. Quality *Data optimized for TM tools may often not be suitable for SMT
  • 33. Relative BLEU score comparisons The datasets that were cleaner at the outset produced better results and tend to benefit and improve consistently from Asia Online’s light cleaning efforts Dataset A had less data but still produced better results than Dataset B that had twice the data volume “Dirty” and noisy data has unpredictable results and is much harder to correct and improve
  • 34. Key Observations Consolidating clean data results in better quality SMT systems Some TM Tool optimized data may be considered dirty for SMT Data cleaning is a critical and necessary step for high quality SMT engines Consistent Terminology produces significant benefits in SMT Normalization of formatting and terminology will boost SMT engine quality Introducing known dirty data can reduce SMT engine quality Smaller amounts of clean data can outperform systems built with as much as 2X dirty data Systems built with clean data and consistent terminology tend to perform better and improve faster