Mais conteúdo relacionado Semelhante a State of the Machine Translation by Intento (July 2018) (20) Mais de Konstantin Savenkov (18) State of the Machine Translation by Intento (July 2018)2. July 2018© Intento, Inc.
About
At Intento, we make Cloud Cognitive AI easy to discover,
access, and evaluate for a specific use.
—
Evaluation is a pain for everyone: to compare different services,
you have to sign a lot of contracts and integrate many APIs.
—
As we show in this report, the Machine Translation landscape is
complex, with 4x difference in quality and 195x difference in
price across pre-build models available from different vendors.
—
We deliver this overview report for FREE. To evaluate on your
own dataset, reach us at hello@inten.to
2
3. July 2018© Intento, Inc.
Intento MT Gateway
- that’s how we run such evaluations
Vendor-agnostic
API
Sync and async
modes
CLI tools and
SDKs
Works with files
of any size
Much faster due
to hyper-
threading
Get your
API key at
inten.to
3
4. July 2018© Intento, Inc.
Important highlights
Amazon and SAP went from preview to production
—
Amazon, Baidu, IBM, Microsoft, and PROMT increased language coverage
—
For 7 language pairs, available MT quality raised more than 5% since Mar
2018: en-ko (▲25%), en-nl (▲11%), nl-en (▲14%), ru-de (▲8%), ja-fr
(▲10%), en-cs (▲5%), en-tr (▲7%) (see slide 15)
—
For 13 language pairs, the best MT provider has changed since Mar 2018:
en-zh, de-ru, ru-de, en-tr, en-pt, nl-en, en-nl, ja-en, zh-it, cs-en, en-cs, en-
it, ru-en
—
To get the best quality across 48 language pairs, one needs 9 engines (see
slide 18)
4
5. July 2018© Intento, Inc.
Overview
1 TRANSLATION QUALITY
2 PRICING
3 LANGUAGE COVERAGE
4 HISTORICAL PROGRESS
5 CONCLUSIONS
48
Language Pairs
19
Machine Translation
Engines
5
6. July 2018© Intento, Inc.
Benchmark changes
since March 2018
Added 3 engines: ModernMT*, Alibaba**, Youdao**
—
Updated to new versions: IBM (v3/NMT), Microsoft (v3/
NMT)
—
Updated SAP*** and Amazon from preview to public
—
Added detailed best and optimal engines chart (slides
18-19)
—
Added Pricing section (slide 21)
* evaluated on one language pair (cost prohibitive)
** unavailable outside of China yet
*** not evaluated (cost prohibitive & unstable)
6
7. July 2018© Intento, Inc.
Machine Translation Engines*
Evaluated
* We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide
web-based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated.
Alibaba Cloud
Machine Translation
Amazon
Translate
Baidu
Translate API
DeepL
API
Google Cloud
Translation API
GTCom
YeeCloud MT
IBM Watson NMT
Language Translator
IBM Watson SMT
Language Translator
Microsoft NMT
Translator Text API
Microsoft SMT
Translator Text API
ModernMT
API
PROMT
Cloud API
SAP
Translation Hub
SDL Language Cloud
Translation Toolkit
Systran PNMT
Enterprise Server
Systran REST
Translation API
Tencent Cloud
TMT API (preview)
Yandex
Translate API
Youdao Cloud
Translation API
7
8. July 2018© Intento, Inc.
1Translation Quality
1.1 Evaluation Methodology
1.2 Available MT Quality
1.3 Top-Performing Engines
1.4 Best General-Purpose Engines
1.5 Optimal General-Purpose Engines
1.6 Price vs. Performance
8
9. July 2018© Intento, Inc.
Evaluation methodology (I)
Translation quality is evaluated by computing LEPOR score
between reference translations and the MT output (Slide 11).
—
Currently, our goal is to evaluate the performance of translation
between the most popular languages (Slide 12).
—
We use public datasets from StatMT/WMT, CASMACAT News
Commentary and Tatoeba (Slide 13).
—
We have performed LEPOR metric convergence analysis to
identify the minimal viable number of segments in the dataset.
See Slide 14 for some details.
9
10. July 2018© Intento, Inc.
Evaluation methodology (II)
We judge that the MT quality of service A is better than that of
B for the language pair C if:
- mean LEPOR score of A is greater than LEPOR of B for the
pair C, and
- lower bound of the LEPOR 95% confidence interval of A is
greater than the upper bound of the LEPOR confidence
interval of B for the pair C. See Slide 14 for example.
—
Different language pairs (and different datasets) impose different
translation complexity. To compare overall MT performance of
different services, we regularize LEPOR scores across all
language pairs (See Appendix A for more details).
10
11. July 2018© Intento, Inc.
LEPOR score
LEPOR: automatic machine translation evaluation metric
considering the enhanced Length Penalty, n-gram Position
difference Penalty and Recall
—
In our evaluation, we used hLEPORA v.3.1:
—
(best metric from ACL-WMT 2013 contest)
https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt
https://github.com/aaronlifenghan/aaron-project-lepor
LIKE BLEU,
BUT BETTER
11
12. July 2018© Intento, Inc.
48
Language
Pairs
* https://w3techs.com/technologies/overview/content_language/all
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
P4 - <0.1% websites
—
We focus on the en-P1,
P1-en and P1-P1
(partially)
en ru ja de es fr pt it zh cs tr fi ro ko ar nl
en ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
ru ✓ ✓ ✓ ✓ ✓
ja ✓ ✓ ✓
de ✓ ✓ ✓ ✓ ✓
es ✓ ✓
fr ✓ ✓ ✓ ✓
pt ✓
it ✓ ✓ ✓
zh ✓ ✓ ✓
cs ✓
tr ✓
fi ✓
ro ✓
ko ✓
ar ✓
nl ✓
12
13. July 2018© Intento, Inc.
Datasets
WMT-2013 (translation task, news domain)
en-es, es-en
WMT-2015 (translation task, news domain)
fr-en, en-fr
WMT-2016 (translation task, news domain)
cs-en, en-cs, de-en, en-de, ro-en, en-ro, fi-en, en-fi, ru-en, en-ru, tr-en, en-tr
WMT-2017 (translation task, news domain)
zh-en, en-zh
NewsCommentary-2011
en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es-zh, fr-
ru, fr-es, it-pt, zh-it, en-ar, ar-en, en-nl, nl-en, fr-de, de-fr, de-it, ja-zh, zh-ja
Tatoeba
en-ko, ko-en
13
14. July 2018© Intento, Inc.
We used 900 - 3000 sentences per language pair. The metric stabilizes and adding
more from the same domain won’t change the outcome.
number of sentences
regularisedhLEPORscores
Aggregated across all language pairs Examples for individual language pairs:
LEPOR Convergence
Confi-
dence
interval
Aggre-
gated
mean
14
15. July 2018© Intento, Inc.
en ru ja de es fr pt it zh cs tr fi ro ko ar nl
en 2 6 3 6 4 5 5 4 2 3 1 2 1 2 1
ru 2 3 3 3 2
ja 4 2 4
de 5 3 3 4 4
es 5 3
fr 6 3 5 8
pt 5
it 8 2 5
zh 4 4 4
cs 4
tr 4
fi 2
ro 3
ko 1
ar 5
nl 1
$$
$$
Available
MT
Quality Maximal
Available
hLEPOR score:
>80 %
70 %
60 %
50 %
40 %
<40 %
Minimal price
for this quality,
per 1M char*:
$$$ ≥$20
$$ $10-15
$ <$10
No. of
top-performing
MT Providers**
* base pricing tier
** up to 5% worse than the leader,
SMT and NMT counted separately
$$
$$
$$
$$
$$
$$$
$$
$$
$$
$$
$$
$
$$
$$$$$ $$
$
$
$$
$$
$$
$$
$
$$
$
$$$
$$
$$
$$
$$
$$ $$ $$$$
$$
$$
$
$$
$$$
$$
$$
$$ $$$
$
$$
15
16. July 2018© Intento, Inc.
Sample pair analysis: English-Chinese
LEPOR
score Providers
Price range
(per 1M characters)
71 % Tencent (preview)
70 % Google, GTCom $10-20
68 % Baidu $7
66.5 % Systran PNMT, Amazon $15-?
65 % Microsoft, IBM NMT $10-21.4
based on
WMT-17
dataset
BEST
QUALITY:
Tencent (preview)
TOP 5%: Tencent, Google, GTCom,
Baidu
BEST PRICE
IN TOP 5%:
Baidu
16
17. July 2018© Intento, Inc.
optimal
Provides the lowest price
among the top 5% MT
engines for a language
pair
0
10
20
30
40
50
google
deepl
am
azon
yandex
ibm
-nm
t
prom
t
m
sft-nm
t
tencent
ibm
-sm
t
baidu
systran-pnm
tgtcom
m
sft-sm
t
sdl-sm
t
m
odernm
t
across 48 language pairs*
TOP Performing MT Providers
best
Provides the best MT
Quality for a language
pair
top 5%
Within 5% of the best
available MT Quality for a
language pair
17
18. July 2018© Intento, Inc.
en ru ja de es fr pt it zh cs tr fi ro ko ar nl
en
ru
ja
de
es
fr
pt
it
zh
cs
tr
fi
ro
ko
ar
nl
Best
general-
purpose
MT
engines
MT Engines
google
deepl
amazon
yandex
ibm-nmt
promt
msft-nmt
ibm-smt
tencent
18
19. July 2018© Intento, Inc.
en ru ja de es fr pt it zh cs tr fi ro ko ar nl
en
ru
ja
de
es
fr
pt
it
zh
cs
tr
fi
ro
ko
ar
nl
* Cheapest with a
performance within
5% of the best
available for this
language pair
Optimal*
general-
purpose
MT
engines
MT Engines
msft-nmt
yandex
msft-smt
baidu
google
amazon
ibm-nmt
promt
ibm-smt
19
20. July 2018© Intento, Inc.
Price vs. Performance*
AFFORDABILITY
PERFORMANCE
As of March 2018
ACCURATE
NOT
PUB
LIC
COST-EFFECTIVE
Performance
Regularized hLEPOR
score aggregated
across all language
pairs in the dataset
Affordability = 1/price
Using public volume-
based pricing tiers
Legend
• performance range:
- regularized average
- max across all pairs
- min across all pairs
• price range
* only production-ready engines shown 20
21. July 2018© Intento, Inc.
2Public pricing
USD
per 1M
symbols
* +20% for some language pairs
** estimation based on 4.79 symbols per word
21
22. July 2018© Intento, Inc.
3Language Coverage
3.1 Supported and Unique per Provider
3.2 Coverage by Language Popularity
22
23. July 2018© Intento, Inc.
1
100
10000
G
oogle
Yandex
M
icrosoftN
M
TM
icrosoftSM
T
Baidu
Tencent
Systran
Systran
PN
M
T
PRO
M
T
SDL
Language
C
loud
Youdao
SAP
M
odernM
T
DeepL
IBM
N
M
T
Am
azon
IBM
SM
T
Alibaba
G
TC
om
2
11
2
56
138
119
1 074
3 022
6
8
20
24
34
424447
72
104106110110
210
812
3 7823 660
8 556
10 712
Total
Unique
Supported and Unique Language Pairs
Unique
language pairs
- supported
exclusively by
one provider
23
24. July 2018© Intento, Inc.
Language popularity
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
P4 - <0.1% websites
* https://w3techs.com/technologies/overview/content_language/all
A total of
29070
pairs possible,
13098
are supported
across all providers
P1
en, ru, ja, de, es, fr,
pt, it, zh
P2
pl, fa, tr, nl, ko, cs, ar,
vi, el, sv in, ro, hu
P3
da, sk, fi, th, bg, he, lt, uk, hr,
no, nb, sr, ca, sl, lv, et
P4
hi, az, bs, ms, is, mk, bn, eu, ka, sq, gl,
mn, kk, hy, se, uz, kr, ur, ta, nn, af, be,
si, my, br, ne, sw, km, fil, ml, pa, …
24
25. July 2018© Intento, Inc.
100% 100% 63%
31%
P1 P2 P3 P4
P1
P2
P3
P4
60%
100%
100%
100%
63%
100% 100%
100%
63%
63% 60%
99%
Language coverage
by popularity
45%
of possible
language pairs
25
26. July 2018© Intento, Inc.
Language coverage
by service provider
Google Cloud
Translation API
Yandex
Translate API
Microsoft
Translator Text
API (SMT)
Microsoft
Translator Text
API (NMT)
Baidu
Translate API
Tencent Cloud
TMT API
(preview)
Systran REST
Translation API
Systran PNMT
Enterprise
Server
PROMT
Cloud API
SDL Language
Cloud Translation
Toolkit
Youdao Cloud
Translation API
SAP Translation
Hub
ModernMT
API
DeepL
API
IBM Watson
Language
Translator (NMT)
Amazon
Translate
IBM Watson
Language
Translator (SMT)
Alibaba
Translate
GTCom
YeeCloud MT
26
27. July 2018© Intento, Inc.
4 Historical Progress
4.1 Number of Cloud MT Vendors
4.2 MT Quality
4.3 Performance/Price Efficiency
27
28. July 2018© Intento, Inc.
Independent Cloud MT Vendors
with pre-built models
Commercial
Alibaba, Amazon,
Baidu, DeepL,
Google, GTCom,
IBM, Microsoft,
ModernMT, PROMT,
SAP, SDL, Systran,
Yandex, Youdao
Preview
Tencent
0
4
8
12
16
Jul 17 Nov 17 Mar 18 Jul 2018
Preview
Commercial
Intento, Inc. • July 2018
28
29. July 2018© Intento, Inc.
30 %
40 %
50 %
60 %
70 %
80 %
Jul 17 Nov 17 Mar 18 Jul 18
Best pair
Worst pair
1 1
Best available
MT Quality
Number of
language pairs
available at this level
of LEPOR quality
out of 14 pairs we
evaluated since July
2017 (ru, de, cs, tr,
fi, ro, zh to en and
back)
8
4
2
7
4
2
Intento, Inc. • July 2018
7
4
2
2
7
4
1
29
30. July 2018© Intento, Inc.
1
12
Best available
Performance/Price Efficiency
Efficiency =
(hLEPOR in %)² /
(USD per 1M
symbols)
—
Number of
language pairs
available at this level
of efficiency out of
14 pairs we
evaluated since July
2017 (ru, de, cs, tr,
fi, ro, zh to en and
back)
100
200
300
400
500
600
700
800
900
Jul 17 Nov 17 Mar 18 Jul 18
Best pair Worst pair
3
2
3
2
3
1
1
1
6
3
2
Intento, Inc. • July 2018
3
1
4
3
1
3
4
3
4
30
31. July 2018© Intento, Inc.
5 Conclusions
Machine Translation quality and efficiency improves
monthly, but far from being ideal, hence clever MT choice
is a must.
—
In the same time, the MT landscape gets more
fragmented as focus shifts from having the best
algorithms to having the best data.
—
Even for the general domain, having the best quality
across 48 language pairs requires 9 engines used
simultaneously.
31
32. July 2018© Intento, Inc.
Custom version
of this report
You may the evaluation for your project using
our vendor-agnostic API and command-line
tools.
—
Also we may help with translating your corpus
via multiple vendors or handling the whole
evaluation for your project.
—
Reach us at hello@inten.to
32
33. July 2018© Intento, Inc.
Evaluate vendors
on your own data
with no effort
—
up to +230% quality and
-87% price by choosing
the right vendor
—
save 12wk of engineering
and data science efforts
Manage and
optimise vendor
portfolio with our
smart routing AI
—
use the best vendor for each
language pair and domain
with no hassle
Single integration
and contract to
multiple vendors
and models
—
save upfront 5-7wk per each
vendor API
—
save 1d per month per each
vendor API
Intento Single API
routes requests to the best models
Reach us for pricing and contract
33
34. STATE OF THE
MACHINE TRANSLATION
by Intento (https://inten.to)
July 2018
Konstantin Savenkov
ks@inten.to
(415) 429-0021
2150 Shattuck Ave
Berkeley CA 94705
34
35. July 2018© Intento, Inc.
Appendix A
Overall performance of the MT services across many language
pairs is computed in the following way:
1. [Standardisation] We compute mean language-standardised
LEPOR score (or z-score) for each provider.
2. [Scale adjustment] We restore the original scale by multiplying
z-score for each MT provider by the global LEPOR standard
deviation and adding the global mean LEPOR score.
35