Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017

The Web, The Database
and The Neural
Manuel Herranz, CEO
Pangeanic TAUS Tokyo, April 2017
What changes in EN-JP?

The Aim
After building 1000’s of MT systems for different purposes and clients,
we realized shortcoming in several areas for which existing tools were
“locked”, had no innovation, were too inflexible, or presented several
shortcomings.
We needed systems that talked to each other, yet were independent.
This is the result of a EU research project (ActivaTM) and a national
project in Spain (Cor)

The Web
Cor
Eases estimation in any translation format (doc or web)
National research project with EU funding
Full platform
Use by Pangeanic, LSPs, 3rd parties
CMS agnostic – extracts text and converts to xliff
(doc or web)

The Web
Cor
Translate sections of a web only (batches)
Detect new content or content that has been eliminated to update language versions

The Web
Eases estimation in any translation format (doc or web)
Documents, too.

The Database
ActivaTM
Elastic Search-based
All language assets in one database, irrespective of
tool that created them
Deep learning for tag handling
CAT-tool agnostic (solves interoperability issues)
Automatic fuzzy match repair.
More powerful (strict) fuzzy matching than traditional
CAT-tools
Subsegment split

The Database
Matrix (triangulate to create new language pairs)
Statistics on all segment units, words, domains
Remote access, API
Pre-filter prior to MT (TM+MT)

The Neural Artificial Neural Networks for SMT
History of ANN-based Machine Translation and Language
Modelling for SMT:
1997 [Castano & Casacuberta 97] (JAUME I &
U.Politécnica): Machine translation using neural
networks and finite-state models
(PangeaMT: https://www.prhlt.upv.es/wp/research-
areas/mt-showcase)
2007 [Schwenk & Costa-jussa 07]: Smooth bilingual
n-gram translation.
2012 [Le & Allauzen 12, Schwenk 12]: Continuous
space translation models with neural networks.
2014 [Devlin & Zbib 14]: Fast and robust neural
networks for SMT
Conventional SMT
Use of statistics has been controversial in
computational linguistics:
Chomsky 1969: ... the notion ’probability of a
sentence’ is an entirely useless one, under any
known interpretation of this term.
Considered to be true by most experts in (rule-
based) natural language processing and artificial
intelligence
History of Statistical Approach to MT
1989-94: IBM’s pioneering work
since 1996: only a few teams favored SMT:
U.Politécnica Valencia, RWTH Aachen, HKUST,
CMU
2006/2007 Google Translate
2006-2012 Euromatrix
2009: PangeaMT

Training data:
TAUS data for Electronics Computer Hardware (ECH) plus SOFT (IT) 4,6M sentences / 56M words (EN)
EN and JA tokenized (tokenizer.perl and Mecab respectively)
The Neural
Seemingly…. Not such a big difference
Results EN->JA :

The Neural
BLEU: higher is better
TER: lower is better
WER: lower is better
BLEU: detects precision in ngrams
TER: derived from the Levenshtein distance, working at the character level
WER: derived from the Levenshtein distance, working at the word level
Results EN->JA:

The Neural Results EN->JA by length:
In smaller sentences (0-10 words), our SMT system gets better results in BLEU, but if we take a look to the TER and
WER, we see that in character and word level, NMT has better results that results in less postedition effort.
In medium sentences (11-25), NMT gets always better results in BLEU, WER and TER.
In long sentences (26++), NMT tends to have same results than PangeaMT.
BLEU TER WER

The Neural
A: Very good, perfect or very light post-editing
B: OK but needs light post-editingt
C: Not good but some meaning can be understood.
D: Not good at all. Needs HT.
Do we need new metrics? BLEU
does not seem to correlate well
to perception of NMT being
much better.

The Neural
Tests in F/I/G/S, RU, PT point to a very strong preference towards NMT (results to be published in May).
On average: from a set of 250 sentences, around 60%-65% were good or very good (A or B). ES/PT/IT results similar to FR
Evaluation: Translation companies and professional freelance translators

Questions
NMT scary? Almost there? (as good as
human)?
Just a matter time (data and connectors)
to make NMT ubiquitous?
Where will be in 3 years, 5 years?
Translation Companies need to change
business model and become something
else?

Thank you!
m.herranz@pangeanic.com

Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017

Semelhante a Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017 (20)

Mais de Manuel Herranz

Mais de Manuel Herranz (10)

Último

Último (20)

Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017