WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization World

WeMT Tools and
Processes
TAUS Showcase October 2013
By Olga Beregovaya

copyright © welocalize 2013. all rights reserved. www.welocalize.com

We’ll talk about:

• MT Programs
• Metrics
• Engines
• Language Tools

www.welocalize.com

Current MT Programs
Dell – 27 languages
Autodesk – 11 languages
PayPal - 8 languages
Cisco – 17 languages between 3 tiers
Intuit – 20+languages
Microsoft (pre-project support)
McAfee (pilot)
… many more in pilot stage

MT Program: Path-to-Success
Components
A set of MT engines – “mix and match”
TMT Selection Mechanisms
Post-editing Environment
Processes and metrics
Data gathering and reporting tool – what,
how much, how fast and at what effort
EDUCATION EDUCATION EDUCATION
CHANGE

The recipe
for success

Process and Workflow
All aspects of the localization ecosystem are
taken into consideration

MT KPIs:

Selecting the right MT engine
By using our MT engine selection Scorecard we make sure all
important KPIs are taken into consideration at selection time

Empowerment through education
Internal, by the use of customized Toolkits; external, through
specialised Trainings.

The feedback loop
Constructive communication from post-editor to MT
provider

 Productivity: Throughputs
 Productivity: Delta
 Quality: LQA
 Quality: Automatic Scores
 Cost
 GlobalSight: Connectivity
 GlobalSight: Tagging
 Human Evaluation
 Customization: Internal/External
 Customization: Time

MT Program Design - Source
o
o
o
o

o

o

Source content classification (i.e. marketing/UI/UA/UGC)
Length of the source segment
Source segment morpho-syntactic complexity
Presence/absence of pre-defined glossary terms or multi-word glossary
elements, UI elements, numeric variables, product lists, ‘do-not-translate’
and transliteration lists
Tag density - Metadata attributes and their representation in localization
industry standard formats (“tags”)
ROC – quality levels based on content use (“impact”)

3D Model: Expected productivity mapped to desired quality levels and source
content complexity


MT Engine Selection Scorecard
Productivity - Throughputs
Number of post-edited words per hour
Productivity - Delta
Percentage difference between translation and postediting time
Cost
Extrapolation, cost per word
CMS - Connectivity
We have tested and used
Is there a connector in place?
different engines so we’ve seen
Quality/Nature of source
the good, the bad and the ugly;
now we can better appreciate
Quality (Final) - LQA
what we have
Internal quality verification
Quality (MT) - Automatic Scores
A set of automatic scoring systems is used

Scorecard - Metrics
Overall data
Productivity metrics

Automatic Scoring
Human Evaluation

Toolkits and Trainings
Our experience:
 Most translators know and have experienced post-editing but they have
limited knowledge of any other related aspect (automatic scoring, output
differences between RBMT and SMT...)
 The majority of people who work in localization have heard about MT but
most of them still find it a daunting subject.
Our answer:
 Continuous MT and PE related trainings and documentation for language
providers
 Customized Toolkits for different internal departments (Production, Quality,
Sales, Vendor Management)

Transparency and Ownership
Theory – knowledge foundations
Practice – customized PE sessions for different client accounts

Transparency – process, engine selection/customization, evaluations
Training helps a lot - After I was told
some of the background information
and tips and tricks for certain
engines/outputs, I was much more
relaxed and happy to give MT a go.

Responsibility – valid evaluations, constructive feedback, quality ownership

Legacy data – best prediction tool
> Statistics from legacy knowledge base

The feedback loop
For me the biggest
advantage would be
the possibility to
implement a client
terminology list [in SMT]

I wish we could easily fix
the corpus for outdated
terminology and
characters

Teach the engine to properly
cope with sentences containing
more than one verb and/or
verbs in progressive form

engine retraining improved significantly the
handling of tags and spaces around tags,
this is a productive achievement as it saves
us a lot of manual corrections.

Feedback and Engine Improvement

“Beyond the Engine” Tools
• Teaminology - crowdsourcing platform for centralized term governance; simultaneous
concordance search of TMs and term bases => clean training data
• Dispatcher - A global community content translation application that connects user
generated content (UGC) including live chats, social media, forums, comments and
knowledge bases to customized machine translation (MT) engines for real-time
translation
• Source Candidate Scorer – scoring of candidate sentences against historically good and
bad sentences based on POS and perplexity
• Corpus Preparation Toolkit – set of application to maximize data preparation for MT
engine training

Source Candidate Scorer
Source
Candidate
Scorer

Compares your source content to “the good” and “the bad”
legacy segments and estimates potential suitability for MT

Corpus Preparation Suite
Variety of tools to prepare corpus for training MT engines such as:
•
•
•
•
•
•
•

Deleting formatting tags from TMX
Removing double spaces
Removing duplicated punctuation (e.g. commas)
Deleting segments where source = target
Deleting segments containing only URLs
Escaping characters
Removing duplicate sentences


Corpus Preparation: TM Creator
Aggregates training data from various relevant sources

TM Creator

Corpus Preparation: TMX Splitter

Extracts the relevant training corpus
based on the TMX metadata

Welocalize Moses Implementation
• Why? Far more control over engine quality since we can control corpus
preparation and output post-processing
• Control over metadata handling
• Ties into our company open-source philosophy
• Have experienced personnel in-house
• Can extend and customize Moses functionality as necessary
• Have connector to TMS (GlobalSight)
RESULTS: In our internal tests with Moses/DoMT, we are getting automated
scores similar to commercial engines for the languages into which we localize
most.
Same feedback received from human evaluators


… And it works!
We are in the position to offer realistic discounts and aggressive
timelines providing quality levels appropriate for the content


“Work-in-progress” Projects

• Ongoing improvements to our adaptation of iOmegaT tool
(Welocalize/CNGL)
• Industry Partner in CNGL “Source Content Profiler” project
• Adoption of TMTPrime (CNGL) - MT vs. Fuzzy Match selection
mechanism
• Language and content-specific pre-processing for the inhouse Moses deployment
• Teaminology – adding linguistic intelligence


Contact
Language_Tools_Group_all@welocalize.com
We speak MT - the language of the future
Welocalize, Inc.
www.welocalize.com
Headquarters
241 East 4th St. Suite 207
Frederick, Maryland 21701 USA
[t] +1.301.668.0330
[t] +1.800.370.9515 Toll Free
[f] +1.301.668.0335
[e] marketing@welocalize.com


WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization World

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Destaque

Destaque (20)

Semelhante a WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization World

Semelhante a WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization World (20)

Mais de Welocalize

Mais de Welocalize (9)

Último

Último (20)

WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization World

Notas do Editor