Understanding the Importance of Clean, Consistent Data for Statistical Machine Translation Systems
1. SMT & Data Quality Understanding Data Quality Issues Kirti Vashee – kirti. vashee@asiaonline.net
2. Making Machine Translation Work for You Customization is Key to Quality Reference Monolingual SMT utilizes existing linguistic resources to create customer specific and domain focused systems including: All Legacy TM - Cleaned and Normalized Dictionaries & Glossaries Old versions of Manuals Examples of high quality Monolingual Data Bilingual Customized Translation System
13. What is Foundation Data and it’s Purpose? Foundation data is a foundation from which to build a custom engine from. Foundation data is not sufficient on its own to deliver a high quality engine. Custom data is required. Foundation data reduces the amount of data a client needs to provide, lowering the barriers to entry. Asia Online has prepared data for and trained hundreds of foundation engines using foundation data only. Not intended as production release engines A foundation engine is in no way symbolic of quality that acustom engine in the same language pair would deliver Intended to verify process and any language specific handling that is required Will not typically be high quality as they have not been normalized, or focused on a specific purpose Consist mainly of bilingual data Limited monolingual data. Monolingual data is a key part of customization and every client has a different desired grammatical style. Add your custom data to foundation data to get quality
14.
15.
16.
17. Examples of what you want the output to look like
19. Blind test data evaluate translation quality and quality improvement3,000-6,000 Segments (can be extracted from existing TMs)
20.
21.
22. Word and Phrase PatternsUser Input Human Translation
23. Quality Data Makes A Difference Clean and Consistent Data A statistical engine learns from data in the training corpus. Language Studio Pro™ contains many tools to help ensure that the data is scrubbed clean prior to training. Controlled Data Fewer translation options for the same source segment, and “clean” translations lead to better foundation patterns. Common Data Higher data volume, in the same subject area, reinforces statistical relationships. Slight variations of the same information add robustness to the systems. Current Data Ensure that the most current TM is used in the training data. Outdated high frequency TM can have an undue negative impact on the translation output and should be normalized to current style
24.
25. Mixing a wide variety of bilingual domain data together
26.
27. With “Clean Data” Correction is Possible Typically about 10-20 examples for each clean word of phrase. Each correction has statistical relevance and impact can be clearly seen. Corrections usually involve adding data to fill gaps. Far less correction of actual errors. Clean data means cause of errors can be understood and corrected. Concordance used to create unbiased examples/phrases and ensure scope covered. Large volumes of dirty data prohibits manual correction. Individual corrections would not be statistically relevant. Manual corrections would compete against 1,000’s of bad examples. Impractical to create enough examples manually. Understanding the cause of errors is difficult. Slows training and overall processing time. Requires more resources to process excess data. Only solution is to acquire more dirty data and hope problem is fixed. But may get worse or cause new errors.
28.
29.
30.
31. Training Data: Volume vs. Quality *Data optimized for TM tools may often not be suitable for SMT
33. Relative BLEU score comparisons The datasets that were cleaner at the outset produced better results and tend to benefit and improve consistently from Asia Online’s light cleaning efforts Dataset A had less data but still produced better results than Dataset B that had twice the data volume “Dirty” and noisy data has unpredictable results and is much harder to correct and improve
34. Key Observations Consolidating clean data results in better quality SMT systems Some TM Tool optimized data may be considered dirty for SMT Data cleaning is a critical and necessary step for high quality SMT engines Consistent Terminology produces significant benefits in SMT Normalization of formatting and terminology will boost SMT engine quality Introducing known dirty data can reduce SMT engine quality Smaller amounts of clean data can outperform systems built with as much as 2X dirty data Systems built with clean data and consistent terminology tend to perform better and improve faster