Neural Machine Translation with Sockeye

Neural Machine Translation
with Sockeye Encoder-Decoder
with Attention

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cyrus M. Vahid
Principal Deep Learning
Solutions Architect at
Amazon AI

In This Video
• Statistical language models
• Neural machine translations and long-term dependencies
• Encoder-decoder architecture
• Sockeye

Translation
I have been dreaming of this day
Ich habe von diesem Tag geträumt.

Problems of Translation
To perform translation we need to solve the following problems:
• Modeling: Decide what our model ! " #; % will look like. What
parameters will it have, and how will the parameters specify a
probability distribution?
• Learning: We need a method to learn appropriate values for
parameters from training data.
• Search: We need to find the most probable sentence (solving
“argmax"). This process of searching for the best hypothesis is often
called decoding.

Statistical Language Models

Statistical Prior
The translation systems thus needs to be a language models itself,
meaning being capable of building utterances that are naturally sounding
for a given vocabulary. More formally:
!(#; %) '() %*+ ,-)./ 0'1',.2.1+

• Count-Based n-gram Language
Models
• More Formally:
Count Based Models and n-grams

• The problem is we can not make sentences we have not seen before
in training data:
• Example: If there is a sequence of four words such as ”I am from Utah”
that does not exist in training set, then
Count-Based n-gram LM – The Problem
!"#$%&' (, *+, ,-.+, /0*ℎ = 0
∴
5 67 = /0*ℎ 68 = (, 69 = *+, 6: = ,-.+ = 0

• The solution is to focus on the n-previous words. These models are
called n-grams.
• The maximum likelihood formula thus take the following form:
• Example:
Count-Based n-gram LM – The Solution
bi-gram
“) *+”, ”*+ ./0+”, *12 “./0+ 34*ℎ” */6 *77 8/69614 :1 ;0/839
∴
= “) *+ ./0+ >4*ℎ” > 0

• In bigrams we have the same problem of assigning probability 0 if
there is a sequence of 2 words that does not exist in training data
• In trigams we have the same problem of assigning probability 0 if
there is a sequence of 3 words that does not exist in training data.
• …
• In n-grams we have the same problem of assigning probability 0 if
there is a sequence of n words that does not exist in training data
Count-Based n-gram LM – Problem Persists

Measuring Accuracy

Evaluation of Accuracy - Likelihood
• Likelihood: Likelihood is probability of some observed data—e.g.
test, given a model.
! ℇ#$%#; ' = )
* +ℇ#$%#
!(*; ')
• Example
|ℇ#$%#| = /01
E1 = I am from pittsburg, ! *; ' = 2.1×109: ∗
E2 = I study at a university ! *; ' = 1.1×109<
E3 = my mother is from utah, ! *; ' = 2×109=
! ℇ#$%#; ' = )
* +ℇ#$%#
!(*; ') = >. / × /091 × /. / × /09? × >. 0 × /09@ = A. 1> × /09>B
*probabilities are arbitrary just to demonstrate the range

Evaluation of Accuracy – Log Likelihood
• !ℎ#$%#&: log+ ∏-./
0
1- = ∑-./
0
log+ 1-; 516&78#: log(1. ;) = log 1 + log(;)
• We have seen probabilities of likelihood are very tiny. Multiplying
these small values make the probabilities even smaller numbers. It is
easier to work with log probabilities for measuring accuracy.
log > ℇ@AB@; C = log+ D
E F ℇGHIG
> 5; C = J
E F ℇGHIG
log( >(5; C)) .
• Example:
*probabilities are arbitrary just to demonstrate the range
|ℇLMNL| = OPQ
E1 = I am from pittsburg, T U; V = 2.1×10[
∴ log(P E; C = log 2.1 + −6 = −5.67
∗
E2 = I study at a university T U; V = 1.1×10[e
∴ log(P E; C = log 1.1 + −8 = −7.96
E3 = my mother is from utah, T U; V = 2×10[h ∴ log(P E; C = log 2.0 + −9 = −8.70
T ℇLMNL; V = D
U iℇLMNL
T(U; V) = J
E F ℇGHIG
log( >(5; C)) = −j. Qk − k. lQ − m. kP = −nn. oo

Evaluation of Accuracy – Perplexity
• It is common to divide log-likelihood by the number of words in the
corpus. This makes cross-corpora measurement easier.
length ℇ()*( = ,
- . ℇ/01/
|3|
445 ℇ()*(; 7 = 89(;<=(>(?/01/;@))/C)DE((ℇ/01/)
• This accuracy indicator is called perplexity.

Missing: Semantic and Causal
Structure

Neural Models - Motivation
• Statistical models loose semantic and causal structures as
well as hidden information.
• Log-Linear models were introduced to address such
issues, but they have other problems:

NMT - Motivation
• In separate contexts et-2 = “farmers” is compatible with et = “hay” and
et-1 = “eats”.
• If we are using feature sets depending on et-1 and et-2, then a
sentence like “farmers eat hey” cannot be ruled out, even though it is
an unnatural sentence or perhaps an unnatural farmer.

Neural Language Models

Step1: Word Embedding
• In Neural Language Models we represent words as word embedding,
a vector representation of words.
• Then we concatinate these words to the length of the context (3 for
trigrams).

Step2: Uncovering Relatedness
• Different elements of a vector can potentially reflect different aspect
of a word. For instance, different elements could represent
countability, being and animal, …
• We run the embedding vectors through hidden layer in order to learn
combination features
https://www.tensorflow.org/tutorials/word2vec

Step3: Score & Probability
• Next, we calculate the score vector for each word. This is done by
performing an affine* transform of the hidden vector h with a weight
matrix and adding a bias vector.
• Finally, we get a probability estimate p by running the calculated
scores through a softmax function.
* An affine transformation is any transformation that preserves collinearity (i.e., all points lying on a line initially still lie on
a line after transformation) and ratios of distances (e.g., the midpoint of a line segment remains the midpoint after transformation). Ref:
Wolfram Mathworld

Benefits of Neural Models
• Better generalization of contexts: n-gram models treat words as
independent entities. Word embedding allows grouping of similar
words that perform similarly in prediction of the next word.
• More generalizable combination of words into contexts: In an n-
gram language model, we would have to remember parameters for
all combinations of {cow; horse; goat} X {consume; chew; ingest} to
represent the context of things farm animals eat". Hidden layer are
more efficient in representing the concept.

Benefits of Neural Models
• Ability to skip previous words: n-gram models refer to all sequential
words depending the length of the context. Neural Models can skip
words.
• Simplistic Models: SMT models require complex data-structures, in
production often using 80 GBs or more of heap space. High quality
NMT production models are generally less than 1 GB.

We are there yet!

Examples of Long-term Dependency
He does not have much confidence in himself.

Long Term Dependencies

Dependencies – Reference&Context
• The Spanish King officially abdicated in … of his …, Felipe. Felipe will
be confirmed tomorrow as the new Spanish … .

• The Spanish King officially abdicated in favour of his son, Felipe.
Felipe will be confirmed tomorrow as the new Spanish king. As the
new king, he will be the soverign.
Reference Context Reference & Context

• The Spanish King officially abdicated in favour of his son, Felipe.
Felipe will be confirmed tomorrow as the new Spanish king. As the
new king, he will be the soverign.
• Their duo was known as "Buck and Bubbles”.
Bubbles tapped along as Buck played on the stride piano and sang
until Bubbles was totally tapped out.
Reference Context Reference & Context

Dependencies – Register/Topic
• CNN is short for Convolutional Neural Network
• In our quest to implement perfect NLP tools, we have developed
state of the art RNNs. Now we can use them to … .

Dependencies – Register/Topic
• CNN is short for Convolutional Neural Network
• In our quest to implement perfect NLP tools, we have developed
state of the art RNNs. Now we can use them to wreck a nice beach.
Register/Topic

Neural Machine Translation

• We still intend to calculate !(#; %), but before doing so we would like to
calculate the initial state of the target model based on another RNN
over the surce sentence '.
• First RNN ”encodes” source sentence ', and second RNN “decodes” the
initial state into target sentence #.
Inferring a Seq. from Another Seq.
http://opennmt.net/

Encoder-Decoder Arch.
A model based on
two LSTM
networks, one that
takes a sentence
and captures it
into an output
layer that is to be
used as input for
another LSTM that
performs
translation

Generating Output
• So far we have seen how to make a probability model. In order to perform
translation, we need to generate output. Generally this is performed using
search in parallel corpora, the corpora of target language.
• Search is performed based on several Criteria:
• Random Sampling: Randomly selecting an output E from a probability model.
#$ ~ &($|))
• Greedy 1-Best Search: Finding E that maximizes & $ ) ; #$ = -./0-12 & $|)
• Beam Search: Finding the n options with the highest probability & $|)

Random Sampling
• The method is used when we would like to find several outputs for an
input such as dialog models in chat bots.
• Ancestral Sampling: Sampling variable values one at a time and
gradually conditioning to more context.
• The method is unbiased, but has high variance and is inefficient.
• !""#$% & '%('%"%)*" + = “*ℎ%”, “12*3'”, “4156%7”, “3)*3”, ”*ℎ%”, “"*18%”, “. ”, < %3" >
• 1* *<$% "*%( * "1$(5% 1 43'7 ='3$ 7<"*'<>#*<3) ? %@ ̂%B
@CB
) #)*<5 EF =< %3" >
RNN: F ℎG
H
the
the|F
ℎB
H
actor
actor|the
ℎB
H
walked
Walked|the, actor
onto …

Greedy (Best) Search
• This is very similar to random
sampling with one difference: The
goal is to search for the word in
target corpora that maximizes
probability.
̂"# = %&'(%)* log(/#,*
1
)
• Greedy search is not guaranteed to
find the translation with highest
probability, but is efficient in terms
of memory and computation.
https://github.com/tensorflow/nmt

Beam Search
• Beam Search is a search method that
searches for b best hypothesis at
each time step. b is called width of
the beam.
• Larger beam size has a strong length
bias.
• b is a hyperparameter that should be
selected to maximize accuracy.
• Not very efficient.

Limitations
• Long distance
dependencies
• Storing variable length
sentences in fixed size
vectors.
He purchased a car, so that he
could drive to work and not
spend his time on train.
Er kaufte ein Auto, damit er zur
Arbeit fahren und seine Zeit nicht
im Zug verbringen konnte

Attentional NMT

Attention
• The attention model comes
between the encoder and the
decoder and helps the decoder
to pick only the encoded inputs
that are important for each step
of the decoding process.
• For each encoded input from
the encoder RNN, the attention
mechanism calculates its
importance
https://aws.amazon.com/blogs/ai/train-neural-machine-translation-models-with-sockeye/

Visualization of Importance
https://aws.amazon.com/blogs/ai/amazon-at-wmt-
improving-machine-translation-with-user-feedback/
https://aws.amazon.com/blogs/ai/train-neural-machine-
translation-models-with-sockeye/

Attention Mechanism
https://talbaumel.github.io/attention/

Attention Mechanism
• Each input word has an attention vector by running the RNN in both directions.
• Then we concatenate the two vectors to have a bidirectional representation.
• We can further concatenate these vectors into a matrix whose columns represent
input words.
• we calculate a vector αt or attention vector that can be used to combine
together the columns of H into a vector ct:

Calculating Attention Score
• First, we update the hidden state to ℎ"
($)
based on the word representation and
context vectors from the previous target time step
• Based on this ℎ"
($)
, we calculate an attention score &"
• We then normalize the attention score using softmax.
• We now have a context vector '" and hidden state ℎ"
($)
for time step t, which we
can pass on down to downstream tasks.

Calculating Attention Score
Method Pros Cons Formula
Dot Product - Simple
- No additional
Parameters
- Input and output
encoding in the same
space
attn_score ℎ"
#
, ℎ%
&
: = ℎ"
# )
ℎ%
&
Bilinear Functions - Input and output of
different sized
- More expensive than
⊗
attn_score ℎ"
#
, ℎ%
&
: = ℎ"
# )
+,ℎ%
&
+, -. / 0-12/3 43/1.5637/4-61
MLP - More flexible than
⊗
- Fewer parameters
than BLF
attn_score ℎ"
#
, ℎ%
&
≔
9,:
;
tanh +,@[ℎ%
&
; ℎ"
#
]
+,@/1D 9,: /32 92-Eℎ4 7/43-F /1D G2H463 65
4ℎ2 5-3.4 /1D .2H61D 0/I23. 65 4ℎ2 JKL
32.M2H4-G20I

Attention Mechanism
https://talbaumel.github.io/attention/
!"#: %&&'(&)*( +'),ℎ&.
/": /*(&'0& 1'/&*2
!" = %&&'(&)*( 1'/&*2

Translation Performance
• BLEU: The BLEU score measures similarity to human translation. It
measures how many words overlap in a given translation when
compared to a reference translation. BLEU scores range from 0-100,
the higher the score, the more the translation correlates to a human
translation.
• TER: The TER score measures the amount of editing that a translator
would have to perform to change a translation so it exactly matches a
reference translation. By repeating this analysis on a large number of
sample translations, it is possible to estimate the post-editing effort
required for a project. TER scores also range from 0-100.

Translation Performance

Sockeye/MXNet/AWS

Sockeye Encode-Decoder Framework
• Sockeye is a sequence-to-sequence framework with attention for
Neural Machine Translation based on Apache MXNet Incubating. It
implements the well-known encoder-decoder architecture with
attention.
• Github repo: https://github.com/awslabs/sockeye
• Tutorials: https://github.com/awslabs/sockeye/tree/master/tutorials

Examples
• python3 -m sockeye.train -s train.source
-t train.target -vs dev.source -vt
dev.target --num-embed 32 --rnn-num-
hidden 64 --attention-type dot --use-
cpu --metrics perplexity accuracy --
max-num-checkpoint-not-improved 3 -o
seqcopy_model
• python -m sockeye.translate -m
seqcopy_model

Sockeye Features
• Beam search inference
• Easy ensembling of multiple models
• Residual connections between RNN layers (Wu et al., 2016) [Deep LSTM
with parallelism]
• Lexical biasing of output layer predictions (Arthur et al., 2016) [Low
Frequency Words]
• Modeling coverage (Tu et al., 2016) [Keeping attention history to
reduce over and under translation]
• Context gating (Tu et al., 2017) [Improving adequacy of translation by
controlling rations of source and target context]

Sockeye Features Contd.
• Cross-entropy label smoothing (e.g., Pereyra et al., 2017)
• Layer normalization (Ba et al, 2016) [improving training time]
• Multiple supported attention mechanisms [dot, mlp, bilinear,
multihead-dot, encoder last state, location]
• Multiple model architectures (Encoder-Decoder Wu et al., 2016,
Convolutional Gehring et al, 2017, Transformer Vaswani et al, 2017,)

Metrics and Visualization

MXNet
• Sockeye uses MXNet as underlying deep learning framework.
• MXNet benchmark shows near linear performance across multiple
machines.

Multi-GPU Training
Training can be carried out on
multiple GPUs by either specifying
multiple GPU device ids: --device-ids
0 1 2 3, or specifying the number
GPUs required: --device-ids -n
attempts to acquire n GPUs.

Amazon SageMaker
• A fully managed service that enables data scientists and developers to
quickly and easily build production-ready machine-learning models.
• Amazon SageMaker includes several built-in state of the art algorithms
that are fine-tuned to run on distributed environments.
https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
• Amongst built-in algorithms there is a Sequence2Sequence algorithm
implemented using Sockeye.
• All you need to do it to tune model’s hyperparameters and passing your
data to the model.
• Another useful built-in algorithm is Amazon BlazingText and provides
Word2Vec.

Demo: NMT in SageMaker

References
• arXiv:1703.01619v1 [cs.CL] 5 Mar 2017: Neural Machine Translation
and Sequence-to-sequence Models: A Tutorial. Graham Neubig,
Language Technologies Institute, Carnegie Mellon University
• All references and images are from arXiv:1703.01619v1 [cs.CL] 5 Mar
2017 unless directly referenced on the page.
• Sockeye documentation:
http://sockeye.readthedocs.io/en/latest/index.html

© 2018 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part,
without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections or feedback on
the course, please email us at: aws-course-feedback@amazon.com. For all other questions, contact us at: https://aws.amazon.com/contact-
us/aws-training/. All trademarks are the property of their owners.
Thank You

Neural Machine Translation with Sockeye

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Neural Machine Translation with Sockeye

Semelhante a Neural Machine Translation with Sockeye (20)

Mais de Apache MXNet

Mais de Apache MXNet (20)

Último

Último (20)

Neural Machine Translation with Sockeye