State of the Machine Translation by Intento (July 2018)

STATE OF THE
MACHINE TRANSLATION
by Intento

July 2018

July 2018© Intento, Inc.
About
At Intento, we make Cloud Cognitive AI easy to discover,
access, and evaluate for a speciﬁc use.
—
Evaluation is a pain for everyone: to compare different services,
you have to sign a lot of contracts and integrate many APIs.
—
As we show in this report, the Machine Translation landscape is
complex, with 4x difference in quality and 195x difference in
price across pre-build models available from different vendors.
—
We deliver this overview report for FREE. To evaluate on your
own dataset, reach us at hello@inten.to
2

Intento MT Gateway
- that’s how we run such evaluations
Vendor-agnostic
API
Sync and async
modes
CLI tools and
SDKs
Works with ﬁles
of any size
Much faster due
to hyper-
threading
Get your
API key at
inten.to
3

Important highlights
Amazon and SAP went from preview to production
—
Amazon, Baidu, IBM, Microsoft, and PROMT increased language coverage
—
For 7 language pairs, available MT quality raised more than 5% since Mar
2018: en-ko (▲25%), en-nl (▲11%), nl-en (▲14%), ru-de (▲8%), ja-fr
(▲10%), en-cs (▲5%), en-tr (▲7%) (see slide 15)
—
For 13 language pairs, the best MT provider has changed since Mar 2018:
en-zh, de-ru, ru-de, en-tr, en-pt, nl-en, en-nl, ja-en, zh-it, cs-en, en-cs, en-
it, ru-en
—
To get the best quality across 48 language pairs, one needs 9 engines (see
slide 18)
4

Overview
1 TRANSLATION QUALITY
2 PRICING
3 LANGUAGE COVERAGE
4 HISTORICAL PROGRESS
5 CONCLUSIONS
48
Language Pairs
19
Machine Translation
Engines
5

Benchmark changes
since March 2018
Added 3 engines: ModernMT*, Alibaba**, Youdao**
—
Updated to new versions: IBM (v3/NMT), Microsoft (v3/
NMT)
—
Updated SAP*** and Amazon from preview to public
—
Added detailed best and optimal engines chart (slides
18-19)
—
Added Pricing section (slide 21)
* evaluated on one language pair (cost prohibitive)
** unavailable outside of China yet
*** not evaluated (cost prohibitive & unstable)
6

Machine Translation Engines*
Evaluated
* We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide
web-based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated.
Alibaba Cloud
Machine Translation
Amazon
Translate
Baidu
Translate API
DeepL
API
Google Cloud
Translation API
GTCom
YeeCloud MT
IBM Watson NMT
Language Translator
IBM Watson SMT
Language Translator
Microsoft NMT
Translator Text API
Microsoft SMT
Translator Text API
ModernMT
API
PROMT
Cloud API
SAP
Translation Hub
SDL Language Cloud
Translation Toolkit
Systran PNMT
Enterprise Server
Systran REST
Translation API
Tencent Cloud
TMT API (preview)
Yandex
Translate API
Youdao Cloud
Translation API
7

1Translation Quality
1.1 Evaluation Methodology
1.2 Available MT Quality
1.3 Top-Performing Engines
1.4 Best General-Purpose Engines
1.5 Optimal General-Purpose Engines
1.6 Price vs. Performance
8

Evaluation methodology (I)
Translation quality is evaluated by computing LEPOR score
between reference translations and the MT output (Slide 11).
—
Currently, our goal is to evaluate the performance of translation
between the most popular languages (Slide 12).
—
We use public datasets from StatMT/WMT, CASMACAT News
Commentary and Tatoeba (Slide 13).
—
We have performed LEPOR metric convergence analysis to
identify the minimal viable number of segments in the dataset.
See Slide 14 for some details.
9

Evaluation methodology (II)
We judge that the MT quality of service A is better than that of
B for the language pair C if:
- mean LEPOR score of A is greater than LEPOR of B for the
pair C, and
- lower bound of the LEPOR 95% conﬁdence interval of A is
greater than the upper bound of the LEPOR conﬁdence
interval of B for the pair C. See Slide 14 for example.
—
Different language pairs (and different datasets) impose different
translation complexity. To compare overall MT performance of
different services, we regularize LEPOR scores across all
language pairs (See Appendix A for more details).
10

LEPOR score
LEPOR: automatic machine translation evaluation metric
considering the enhanced Length Penalty, n-gram Position
difference Penalty and Recall
—
In our evaluation, we used hLEPORA v.3.1:
—
(best metric from ACL-WMT 2013 contest)
https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt
https://github.com/aaronlifenghan/aaron-project-lepor
LIKE BLEU,
BUT BETTER
11

48
Language
Pairs
* https://w3techs.com/technologies/overview/content_language/all
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
P4 - <0.1% websites
—
We focus on the en-P1,
P1-en and P1-P1
(partially)
en ru ja de es fr pt it zh cs tr ﬁ ro ko ar nl
en ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
ru ✓ ✓ ✓ ✓ ✓
ja ✓ ✓ ✓
de ✓ ✓ ✓ ✓ ✓
es ✓ ✓
fr ✓ ✓ ✓ ✓
pt ✓
it ✓ ✓ ✓
zh ✓ ✓ ✓
cs ✓
tr ✓
ﬁ ✓
ro ✓
ko ✓
ar ✓
nl ✓
12

Datasets
WMT-2013 (translation task, news domain)
en-es, es-en
fr-en, en-fr
cs-en, en-cs, de-en, en-de, ro-en, en-ro, ﬁ-en, en-ﬁ, ru-en, en-ru, tr-en, en-tr
zh-en, en-zh
NewsCommentary-2011
en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es-zh, fr-
ru, fr-es, it-pt, zh-it, en-ar, ar-en, en-nl, nl-en, fr-de, de-fr, de-it, ja-zh, zh-ja
Tatoeba
en-ko, ko-en
13

We used 900 - 3000 sentences per language pair. The metric stabilizes and adding
more from the same domain won’t change the outcome.
number of sentences
regularisedhLEPORscores
Aggregated across all language pairs Examples for individual language pairs:
LEPOR Convergence
Conﬁ-
dence

interval
Aggre-
gated
mean
14

en 2 6 3 6 4 5 5 4 2 3 1 2 1 2 1
ru 2 3 3 3 2
ja 4 2 4
de 5 3 3 4 4
es 5 3
fr 6 3 5 8
pt 5
it 8 2 5
zh 4 4 4
cs 4
tr 4
ﬁ 2
ro 3
ko 1
ar 5
nl 1
$$
$$
Available
MT
Quality Maximal
Available

hLEPOR score:
>80 %
70 %
60 %
50 %
40 %
<40 %
Minimal price
for this quality,
per 1M char*:
$$$ ≥$20
$$ $10-15
$ <$10
No. of

top-performing

MT Providers**
* base pricing tier
** up to 5% worse than the leader,
SMT and NMT counted separately
$$
$$
$$
$$
$$
$$$
$$
$$
$$
$$
$$
$
$$
$$$$$ $$
$
$
$$
$$
$$
$$
$
$$
$
$$$
$$
$$
$$
$$
$$ $$ $$$$
$$
$$
$
$$
$$$
$$
$$
$$ $$$
$
$$
15

Sample pair analysis: English-Chinese
LEPOR

score Providers
Price range

(per 1M characters)
71 % Tencent (preview)
70 % Google, GTCom $10-20
68 % Baidu $7
66.5 % Systran PNMT, Amazon $15-?
65 % Microsoft, IBM NMT $10-21.4
based on
WMT-17

dataset
BEST
QUALITY:
Tencent (preview)
TOP 5%: Tencent, Google, GTCom,
Baidu
BEST PRICE
IN TOP 5%:
Baidu
16

optimal
Provides the lowest price
among the top 5% MT
engines for a language
pair
0
10
20
30
40
50
google
deepl
am
azon
yandex
ibm
-nm
t
prom
t
m
sft-nm
t
tencent
ibm
-sm
t
baidu
systran-pnm
tgtcom
m
sft-sm
t
sdl-sm
t
m
odernm
t
across 48 language pairs*
TOP Performing MT Providers
best
Provides the best MT
Quality for a language
pair
top 5%
Within 5% of the best
available MT Quality for a
language pair
17

en
ru
ja
de
es
fr
pt
it
zh
cs
tr
ﬁ
ro
ko
ar
nl
Best
general-
purpose
MT
engines
MT Engines
google
deepl
amazon
yandex
ibm-nmt
promt
msft-nmt
ibm-smt
tencent
18

en
ru
ja
de
es
fr
pt
it
zh
cs
tr
ﬁ
ro
ko
ar
nl
* Cheapest with a
performance within
5% of the best
available for this
language pair
Optimal*
general-
purpose
MT
engines
MT Engines
msft-nmt
yandex
msft-smt
baidu
google
amazon
ibm-nmt
promt
ibm-smt
19

Price vs. Performance*
AFFORDABILITY
PERFORMANCE
As of March 2018
ACCURATE
NOT
PUB
LIC
COST-EFFECTIVE
Performance

Regularized hLEPOR
score aggregated
across all language
pairs in the dataset

Aﬀordability = 1/price

Using public volume-
based pricing tiers

Legend

• performance range:

- regularized average

- max across all pairs

- min across all pairs

• price range
* only production-ready engines shown 20

2Public pricing
USD
per 1M
symbols
* +20% for some language pairs
** estimation based on 4.79 symbols per word
21

3Language Coverage
3.1 Supported and Unique per Provider
3.2 Coverage by Language Popularity
22

1
100
10000
G
oogle
Yandex
M
icrosoftN
M
TM
icrosoftSM
T
Baidu
Tencent
Systran
Systran
PN
M
T
PRO
M
T
SDL
Language

C
loud
Youdao
SAP
M
odernM
T
DeepL
IBM
N
M
T
Am
azon
IBM
SM
T
Alibaba
G
TC
om
2
11
2
56
138
119
1 074
3 022
6
8
20
24
34
424447
72
104106110110
210
812
3 7823 660
8 556
10 712
Total
Unique
Supported and Unique Language Pairs
Unique
language pairs
- supported
exclusively by
one provider
23

Language popularity
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
P4 - <0.1% websites
* https://w3techs.com/technologies/overview/content_language/all
A total of
29070
pairs possible,
13098
are supported
across all providers
P1
en, ru, ja, de, es, fr,
pt, it, zh
P2
pl, fa, tr, nl, ko, cs, ar,
vi, el, sv in, ro, hu
P3
da, sk, ﬁ, th, bg, he, lt, uk, hr,
no, nb, sr, ca, sl, lv, et
P4
hi, az, bs, ms, is, mk, bn, eu, ka, sq, gl,
mn, kk, hy, se, uz, kr, ur, ta, nn, af, be,
si, my, br, ne, sw, km, ﬁl, ml, pa, …
24

100% 100% 63%
31%
P1 P2 P3 P4
P1
P2
P3
P4
60%
100%
100%
100%
63%
100% 100%
100%
63%
63% 60%
99%
Language coverage
by popularity
45%
of possible
language pairs
25

Language coverage
by service provider
Google Cloud
Translation API
Yandex
Translate API
Microsoft
Translator Text
API (SMT)
Microsoft
Translator Text
API (NMT)
Baidu
Translate API
Tencent Cloud
TMT API
(preview)
Systran REST
Translation API
Systran PNMT
Enterprise
Server
PROMT
Cloud API
SDL Language
Cloud Translation
Toolkit
Youdao Cloud
Translation API
SAP Translation
Hub
ModernMT
API
DeepL
API
IBM Watson
Language
Translator (NMT)
Amazon
Translate
IBM Watson
Language
Translator (SMT)
Alibaba
Translate
GTCom
YeeCloud MT
26

4 Historical Progress
4.1 Number of Cloud MT Vendors
4.2 MT Quality
4.3 Performance/Price Efﬁciency
27

Independent Cloud MT Vendors
with pre-built models
Commercial
Alibaba, Amazon,
Baidu, DeepL,
Google, GTCom,
IBM, Microsoft,
ModernMT, PROMT,
SAP, SDL, Systran,
Yandex, Youdao
Preview
Tencent
0
4
8
12
16
Jul 17 Nov 17 Mar 18 Jul 2018
Preview
Commercial
Intento, Inc. • July 2018
28

30 %
40 %
50 %
60 %
70 %
80 %
Best pair
Worst pair
1 1
Best available
MT Quality
Number of
language pairs
available at this level
of LEPOR quality
out of 14 pairs we
evaluated since July
2017 (ru, de, cs, tr,
ﬁ, ro, zh to en and
back)
8
4
2
7
4
2
7
4
2
2
7
4
1
29

1
12
Best available
Performance/Price Efficiency
Efficiency =
(hLEPOR in %)² /
(USD per 1M
symbols)
—
Number of
language pairs
available at this level
of efficiency out of
14 pairs we
evaluated since July
2017 (ru, de, cs, tr,
fi, ro, zh to en and
back)
100
200
300
400
500
600
700
800
900
Best pair Worst pair
3
2
3
2
3
1
1
1
6
3
2
3
1
4
3
1
3
4
3
4
30

5 Conclusions
Machine Translation quality and efﬁciency improves
monthly, but far from being ideal, hence clever MT choice
is a must.
—
In the same time, the MT landscape gets more
fragmented as focus shifts from having the best
algorithms to having the best data.
—
Even for the general domain, having the best quality
across 48 language pairs requires 9 engines used
simultaneously.
31

Custom version
of this report
You may the evaluation for your project using
our vendor-agnostic API and command-line
tools.
—
Also we may help with translating your corpus
via multiple vendors or handling the whole
evaluation for your project.
—
Reach us at hello@inten.to
32

Evaluate vendors
on your own data
with no effort
—
up to +230% quality and
-87% price by choosing
the right vendor
—
save 12wk of engineering
and data science efforts
Manage and
optimise vendor
portfolio with our
smart routing AI
—
use the best vendor for each
language pair and domain
with no hassle
Single integration
and contract to
multiple vendors
and models
— 
save upfront 5-7wk per each
vendor API
—
save 1d per month per each
vendor API
Intento Single API 
routes requests to the best models
Reach us for pricing and contract
33

STATE OF THE
MACHINE TRANSLATION
by Intento (https://inten.to)

July 2018
Konstantin Savenkov
ks@inten.to
(415) 429-0021
2150 Shattuck Ave
Berkeley CA 94705
34

Appendix A
Overall performance of the MT services across many language
pairs is computed in the following way:
1. [Standardisation] We compute mean language-standardised
LEPOR score (or z-score) for each provider.
2. [Scale adjustment] We restore the original scale by multiplying
z-score for each MT provider by the global LEPOR standard
deviation and adding the global mean LEPOR score.
35

State of the Machine Translation by Intento (July 2018)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a State of the Machine Translation by Intento (July 2018)

Semelhante a State of the Machine Translation by Intento (July 2018) (20)

Mais de Konstantin Savenkov

Mais de Konstantin Savenkov (18)

Último

Último (20)

State of the Machine Translation by Intento (July 2018)