SlideShare uma empresa Scribd logo
1 de 63
Baixar para ler offline
Neural Machine Translation
with Sockeye Encoder-Decoder
with Attention
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cyrus M. Vahid
Principal Deep Learning
Solutions Architect at
Amazon AI
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
In This Video
• Statistical language models
• Neural machine translations and long-term dependencies
• Encoder-decoder architecture
• Sockeye
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Translation
I have been dreaming of this day
Ich habe von diesem Tag geträumt.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problems of Translation
To perform translation we need to solve the following problems:
• Modeling: Decide what our model ! " #; % will look like. What
parameters will it have, and how will the parameters specify a
probability distribution?
• Learning: We need a method to learn appropriate values for
parameters from training data.
• Search: We need to find the most probable sentence (solving
“argmax"). This process of searching for the best hypothesis is often
called decoding.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Statistical Language Models
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Statistical Prior
The translation systems thus needs to be a language models itself,
meaning being capable of building utterances that are naturally sounding
for a given vocabulary. More formally:
!(#; %) '() %*+ ,-)./ 0'1',.2.1+
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Count-Based n-gram Language
Models
• More Formally:
Count Based Models and n-grams
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• The problem is we can not make sentences we have not seen before
in training data:
• Example: If there is a sequence of four words such as ”I am from Utah”
that does not exist in training set, then
Count-Based n-gram LM – The Problem
!"#$%&' (, *+, ,-.+, /0*ℎ = 0
∴
5 67 = /0*ℎ 68 = (, 69 = *+, 6: = ,-.+ = 0
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• The solution is to focus on the n-previous words. These models are
called n-grams.
• The maximum likelihood formula thus take the following form:
• Example:
Count-Based n-gram LM – The Solution
bi-gram
“) *+”, ”*+ ./0+”, *12 “./0+ 34*ℎ” */6 *77 8/69614 :1 ;0/839
∴
= “) *+ ./0+ >4*ℎ” > 0
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• In bigrams we have the same problem of assigning probability 0 if
there is a sequence of 2 words that does not exist in training data
• In trigams we have the same problem of assigning probability 0 if
there is a sequence of 3 words that does not exist in training data.
• …
• In n-grams we have the same problem of assigning probability 0 if
there is a sequence of n words that does not exist in training data
Count-Based n-gram LM – Problem Persists
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Measuring Accuracy
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Evaluation of Accuracy - Likelihood
• Likelihood: Likelihood is probability of some observed data—e.g.
test, given a model.
! ℇ#$%#; ' = )
* +ℇ#$%#
!(*; ')
• Example
|ℇ#$%#| = /01
E1 = I am from pittsburg,	! *; ' = 2.1×109: ∗
E2 = I study at a university ! *; ' = 1.1×109<
E3 = my mother is from utah, ! *; ' = 2×109=
! ℇ#$%#; ' = )
* +ℇ#$%#
!(*; ') = >. / × /091 × /. / × /09? × >. 0 × /09@ = A. 1> × /09>B
*probabilities are arbitrary just to demonstrate the range
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Evaluation of Accuracy – Log Likelihood
• !ℎ#$%#&: log+ ∏-./
0
1- = ∑-./
0
log+ 1-; 516&78#: log(1. ;) = log 1 + log(;)
• We have seen probabilities of likelihood are very tiny. Multiplying
these small values make the probabilities even smaller numbers. It is
easier to work with log probabilities for measuring accuracy.
log > ℇ@AB@; C = log+ D
E F ℇGHIG
> 5; C = J
E F ℇGHIG
log( >(5; C)) .
• Example:
*probabilities are arbitrary just to demonstrate the range
|ℇLMNL| = OPQ
E1 = I am from pittsburg,	T U; V = 2.1×10[
∴ log(P E; C = log 2.1 + −6 = −5.67
∗
E2 = I study at a university T U; V = 1.1×10[e
∴ log(P E; C = log 1.1 + −8 = −7.96
E3 = my mother is from utah, T U; V = 2×10[h ∴ log(P E; C = log 2.0 + −9 = −8.70
T ℇLMNL; V = D
U iℇLMNL
T(U; V) = J
E F ℇGHIG
log( >(5; C)) = −j. Qk − k. lQ − m. kP = −nn. oo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Evaluation of Accuracy – Perplexity
• It is common to divide log-likelihood by the number of words in the
corpus. This makes cross-corpora measurement easier.
length ℇ()*( = ,
- . ℇ/01/
|3|
445 ℇ()*(; 7 = 89(;<=(>(?/01/;@))/C)DE((ℇ/01/)
• This accuracy indicator is called perplexity.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Missing: Semantic and Causal
Structure
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Neural Models - Motivation
• Statistical models loose semantic and causal structures as
well as hidden information.
• Log-Linear models were introduced to address such
issues, but they have other problems:
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NMT - Motivation
• In separate contexts et-2 = “farmers” is compatible with et = “hay” and
et-1 = “eats”.
• If we are using feature sets depending on et-1 and et-2, then a
sentence like “farmers eat hey” cannot be ruled out, even though it is
an unnatural sentence or perhaps an unnatural farmer.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Neural Language Models
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step1: Word Embedding
• In Neural Language Models we represent words as word embedding,
a vector representation of words.
• Then we concatinate these words to the length of the context (3 for
trigrams).
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step2: Uncovering Relatedness
• Different elements of a vector can potentially reflect different aspect
of a word. For instance, different elements could represent
countability, being and animal, …
• We run the embedding vectors through hidden layer in order to learn
combination features
https://www.tensorflow.org/tutorials/word2vec
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step3: Score & Probability
• Next, we calculate the score vector for each word. This is done by
performing an affine* transform of the hidden vector h with a weight
matrix and adding a bias vector.
• Finally, we get a probability estimate p by running the calculated
scores through a softmax function.
* An affine transformation is any transformation that preserves collinearity (i.e., all points lying on a line initially still lie on
a line after transformation) and ratios of distances (e.g., the midpoint of a line segment remains the midpoint after transformation). Ref:
Wolfram Mathworld
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of Neural Models
• Better generalization of contexts: n-gram models treat words as
independent entities. Word embedding allows grouping of similar
words that perform similarly in prediction of the next word.
• More generalizable combination of words into contexts: In an n-
gram language model, we would have to remember parameters for
all combinations of {cow; horse; goat} X {consume; chew; ingest} to
represent the context of things farm animals eat". Hidden layer are
more efficient in representing the concept.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of Neural Models
• Ability to skip previous words: n-gram models refer to all sequential
words depending the length of the context. Neural Models can skip
words.
• Simplistic Models: SMT models require complex data-structures, in
production often using 80 GBs or more of heap space. High quality
NMT production models are generally less than 1 GB.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
We are there yet!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Examples of Long-term Dependency
He does not have much confidence in himself.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Long Term Dependencies
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dependencies – Reference&Context
• The Spanish King officially abdicated in … of his …, Felipe. Felipe will
be confirmed tomorrow as the new Spanish … .
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dependencies – Reference&Context
• The Spanish King officially abdicated in favour of his son, Felipe.
Felipe will be confirmed tomorrow as the new Spanish king. As the
new king, he will be the soverign.
Reference Context Reference & Context
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dependencies – Reference&Context
• The Spanish King officially abdicated in favour of his son, Felipe.
Felipe will be confirmed tomorrow as the new Spanish king. As the
new king, he will be the soverign.
• Their duo was known as "Buck and Bubbles”.
Bubbles tapped along as Buck played on the stride piano and sang
until Bubbles was totally tapped out.
Reference Context Reference & Context
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dependencies – Register/Topic
• CNN is short for Convolutional Neural Network
• In our quest to implement perfect NLP tools, we have developed
state of the art RNNs. Now we can use them to … .
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dependencies – Register/Topic
• CNN is short for Convolutional Neural Network
• In our quest to implement perfect NLP tools, we have developed
state of the art RNNs. Now we can use them to wreck a nice beach.
Register/Topic
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Neural Machine Translation
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problems of Translation
To perform translation we need to solve the following problems:
• Modeling: Decide what our model ! " #; % will look like. What
parameters will it have, and how will the parameters specify a
probability distribution?
• Learning: We need a method to learn appropriate values for
parameters from training data.
• Search: We need to find the most probable sentence (solving
“argmax"). This process of searching for the best hypothesis is often
called decoding.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• We still intend to calculate !(#; %), but before doing so we would like to
calculate the initial state of the target model based on another RNN
over the surce sentence '.
• First RNN ”encodes” source sentence ', and second RNN “decodes” the
initial state into target sentence #.
Inferring a Seq. from Another Seq.
http://opennmt.net/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Encoder-Decoder Arch.
A model based on
two LSTM
networks, one that
takes a sentence
and captures it
into an output
layer that is to be
used as input for
another LSTM that
performs
translation
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generating Output
• So far we have seen how to make a probability model. In order to perform
translation, we need to generate output. Generally this is performed using
search in parallel corpora, the corpora of target language.
• Search is performed based on several Criteria:
• Random Sampling: Randomly selecting an output E		from a probability model.
#$ ~ &($|))
• Greedy 1-Best Search: Finding E that maximizes & $ ) ; #$ = -./0-12 & $|)
• Beam Search: Finding the n options with the highest probability & $|)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Random Sampling
• The method is used when we would like to find several outputs for an
input such as dialog models in chat bots.
• Ancestral Sampling: Sampling variable values one at a time and
gradually conditioning to more context.
• The method is unbiased, but has high variance and is inefficient.
• !""#$% & '%('%"%)*" + = “*ℎ%”, “12*3'”, “4156%7”, “3)*3”, ”*ℎ%”, “"*18%”, “. ”, < %3" >
• 1* *<$% "*%( * "1$(5% 1 43'7 ='3$ 7<"*'<>#*<3) ? %@ ̂%B
@CB
) #)*<5 EF =< %3" >
RNN: F ℎG
H
the
the|F
ℎB
H
actor
actor|the
ℎB
H
walked
Walked|the,	actor
onto …
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Greedy (Best) Search
• This is very similar to random
sampling with one difference: The
goal is to search for the word in
target corpora that maximizes
probability.
̂"# = %&'(%)* log(/#,*
1
)
• Greedy search is not guaranteed to
find the translation with highest
probability, but is efficient in terms
of memory and computation.
https://github.com/tensorflow/nmt
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Beam Search
• Beam Search is a search method that
searches for b best hypothesis at
each time step. b is called width of
the beam.
• Larger beam size has a strong length
bias.
• b is a hyperparameter that should be
selected to maximize accuracy.
• Not very efficient.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Limitations
• Long distance
dependencies
• Storing variable length
sentences in fixed size
vectors.
He purchased a car, so that he
could drive to work and not
spend his time on train.
Er kaufte ein Auto, damit er zur
Arbeit fahren und seine Zeit nicht
im Zug verbringen konnte
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Attentional NMT
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Attention
• The attention model comes
between the encoder and the
decoder and helps the decoder
to pick only the encoded inputs
that are important for each step
of the decoding process.
• For each encoded input from
the encoder RNN, the attention
mechanism calculates its
importance
https://aws.amazon.com/blogs/ai/train-neural-machine-translation-models-with-sockeye/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Visualization of Importance
https://aws.amazon.com/blogs/ai/amazon-at-wmt-
improving-machine-translation-with-user-feedback/
https://aws.amazon.com/blogs/ai/train-neural-machine-
translation-models-with-sockeye/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Attention Mechanism
https://talbaumel.github.io/attention/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Attention Mechanism
• Each input word has an attention vector by running the RNN in both directions.
• Then we concatenate the two vectors to have a bidirectional representation.
• We can further concatenate these vectors into a matrix whose columns represent
input words.
• we calculate a vector αt or attention vector that can be used to combine
together the columns of H into a vector ct:
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Calculating Attention Score
• First, we update the hidden state to ℎ"
($)
based on the word representation and
context vectors from the previous target time step
• Based on this ℎ"
($)
, we calculate an attention score &"
• We then normalize the attention score using softmax.
• We now have a context vector '" and hidden state ℎ"
($)
for time step t, which we
can pass on down to downstream tasks.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Calculating Attention Score
Method Pros Cons Formula
Dot Product - Simple
- No additional
Parameters
- Input and output
encoding in the same
space
attn_score ℎ"
#
, ℎ%
&
: = ℎ"
# )
ℎ%
&
Bilinear Functions - Input and output of
different sized
- More expensive than
⊗
attn_score ℎ"
#
, ℎ%
&
: = ℎ"
# )
+,ℎ%
&
+, -. / 0-12/3 43/1.5637/4-61
MLP - More flexible than
⊗
- Fewer parameters
than BLF
attn_score ℎ"
#
, ℎ%
&
≔
9,:
;
tanh +,@[ℎ%
&
; ℎ"
#
]
+,@/1D 9,: /32 92-Eℎ4 7/43-F /1D G2H463 65
4ℎ2 5-3.4 /1D .2H61D 0/I23. 65 4ℎ2 JKL
32.M2H4-G20I
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Attention Mechanism
https://talbaumel.github.io/attention/
!"#: %&&'(&)*( +'),ℎ&.
/": /*(&'0& 1'/&*2
!" = %&&'(&)*( 1'/&*2
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Translation Performance
• BLEU: The BLEU score measures similarity to human translation. It
measures how many words overlap in a given translation when
compared to a reference translation. BLEU scores range from 0-100,
the higher the score, the more the translation correlates to a human
translation.
• TER: The TER score measures the amount of editing that a translator
would have to perform to change a translation so it exactly matches a
reference translation. By repeating this analysis on a large number of
sample translations, it is possible to estimate the post-editing effort
required for a project. TER scores also range from 0-100.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Translation Performance
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sockeye/MXNet/AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sockeye Encode-Decoder Framework
• Sockeye is a sequence-to-sequence framework with attention for
Neural Machine Translation based on Apache MXNet Incubating. It
implements the well-known encoder-decoder architecture with
attention.
• Github repo: https://github.com/awslabs/sockeye
• Tutorials: https://github.com/awslabs/sockeye/tree/master/tutorials
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Examples
• python3 -m sockeye.train -s train.source 
-t train.target  -vs dev.source  -vt
dev.target  --num-embed 32  --rnn-num-
hidden 64  --attention-type dot  --use-
cpu  --metrics perplexity accuracy  --
max-num-checkpoint-not-improved 3  -o
seqcopy_model
• python -m sockeye.translate -m
seqcopy_model
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sockeye Features
• Beam search inference
• Easy ensembling of multiple models
• Residual connections between RNN layers (Wu et al., 2016) [Deep LSTM
with parallelism]
• Lexical biasing of output layer predictions (Arthur et al., 2016) [Low
Frequency Words]
• Modeling coverage (Tu et al., 2016) [Keeping attention history to
reduce over and under translation]
• Context gating (Tu et al., 2017) [Improving adequacy of translation by
controlling rations of source and target context]
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sockeye Features Contd.
• Cross-entropy label smoothing (e.g., Pereyra et al., 2017)
• Layer normalization (Ba et al, 2016) [improving training time]
• Multiple supported attention mechanisms [dot, mlp, bilinear,
multihead-dot, encoder last state, location]
• Multiple model architectures (Encoder-Decoder Wu et al., 2016,
Convolutional Gehring et al, 2017, Transformer Vaswani et al, 2017,)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metrics and Visualization
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MXNet
• Sockeye uses MXNet as underlying deep learning framework.
• MXNet benchmark shows near linear performance across multiple
machines.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multi-GPU Training
Training can be carried out on
multiple GPUs by either specifying
multiple GPU device ids: --device-ids
0 1 2 3, or specifying the number
GPUs required: --device-ids -n
attempts to acquire n GPUs.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
• A fully managed service that enables data scientists and developers to
quickly and easily build production-ready machine-learning models.
• Amazon SageMaker includes several built-in state of the art algorithms
that are fine-tuned to run on distributed environments.
https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
• Amongst built-in algorithms there is a Sequence2Sequence algorithm
implemented using Sockeye.
• All you need to do it to tune model’s hyperparameters and passing your
data to the model.
• Another useful built-in algorithm is Amazon BlazingText and provides
Word2Vec.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo: NMT in SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
References
• arXiv:1703.01619v1 [cs.CL] 5 Mar 2017: Neural Machine Translation
and Sequence-to-sequence Models: A Tutorial. Graham Neubig,
Language Technologies Institute, Carnegie Mellon University
• All references and images are from arXiv:1703.01619v1 [cs.CL] 5 Mar
2017 unless directly referenced on the page.
• Sockeye documentation:
http://sockeye.readthedocs.io/en/latest/index.html
© 2018 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part,
without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections or feedback on
the course, please email us at: aws-course-feedback@amazon.com. For all other questions, contact us at: https://aws.amazon.com/contact-
us/aws-training/. All trademarks are the property of their owners.
Thank You

Mais conteúdo relacionado

Semelhante a Neural Machine Translation with Sockeye

Building Applications with Apache MXNet
Building Applications with Apache MXNetBuilding Applications with Apache MXNet
Building Applications with Apache MXNetApache MXNet
 
Multivariate Time Series
Multivariate Time SeriesMultivariate Time Series
Multivariate Time SeriesApache MXNet
 
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Amazon Web Services
 
Amazon SageMaker Ground Truth: Build High-Quality and Accurate ML Training Da...
Amazon SageMaker Ground Truth: Build High-Quality and Accurate ML Training Da...Amazon SageMaker Ground Truth: Build High-Quality and Accurate ML Training Da...
Amazon SageMaker Ground Truth: Build High-Quality and Accurate ML Training Da...Amazon Web Services
 
Creating Movie Magic with Computational Simulations (AIS303) - AWS re:Invent ...
Creating Movie Magic with Computational Simulations (AIS303) - AWS re:Invent ...Creating Movie Magic with Computational Simulations (AIS303) - AWS re:Invent ...
Creating Movie Magic with Computational Simulations (AIS303) - AWS re:Invent ...Amazon Web Services
 
Add Intelligence to Applications with AWS ML Services
Add Intelligence to Applications with AWS ML ServicesAdd Intelligence to Applications with AWS ML Services
Add Intelligence to Applications with AWS ML ServicesAmazon Web Services
 
Innovating with Machine Learning on AWS - Travel & Hospitality (November 2018)
Innovating with Machine Learning on AWS - Travel & Hospitality (November 2018)Innovating with Machine Learning on AWS - Travel & Hospitality (November 2018)
Innovating with Machine Learning on AWS - Travel & Hospitality (November 2018)Julien SIMON
 
AWS Machine Learning Week SF: Build Intelligent Applications with AWS ML Serv...
AWS Machine Learning Week SF: Build Intelligent Applications with AWS ML Serv...AWS Machine Learning Week SF: Build Intelligent Applications with AWS ML Serv...
AWS Machine Learning Week SF: Build Intelligent Applications with AWS ML Serv...Amazon Web Services
 
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Amazon Web Services
 
Sequence-to-Sequence Modeling with Apache MXNet, Sockeye, and Amazon SageMake...
Sequence-to-Sequence Modeling with Apache MXNet, Sockeye, and Amazon SageMake...Sequence-to-Sequence Modeling with Apache MXNet, Sockeye, and Amazon SageMake...
Sequence-to-Sequence Modeling with Apache MXNet, Sockeye, and Amazon SageMake...Amazon Web Services
 
DataXDay - Machine learning models at scale with Amazon SageMaker
DataXDay - Machine learning models at scale with Amazon SageMaker DataXDay - Machine learning models at scale with Amazon SageMaker
DataXDay - Machine learning models at scale with Amazon SageMaker DataXDay Conference by Xebia
 
Add Intelligence to Applications with AWS ML Services: Machine Learning Week ...
Add Intelligence to Applications with AWS ML Services: Machine Learning Week ...Add Intelligence to Applications with AWS ML Services: Machine Learning Week ...
Add Intelligence to Applications with AWS ML Services: Machine Learning Week ...Amazon Web Services
 
Machine Learning Fundamentals
Machine Learning FundamentalsMachine Learning Fundamentals
Machine Learning FundamentalsSigOpt
 
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 AWSKRUG - AWS한국사용자모임
 
MCL310_Building Deep Learning Applications with Apache MXNet and Gluon
MCL310_Building Deep Learning Applications with Apache MXNet and GluonMCL310_Building Deep Learning Applications with Apache MXNet and Gluon
MCL310_Building Deep Learning Applications with Apache MXNet and GluonAmazon Web Services
 
An Introduction to Reinforcement Learning with Amazon SageMaker
An Introduction to Reinforcement Learning with Amazon SageMakerAn Introduction to Reinforcement Learning with Amazon SageMaker
An Introduction to Reinforcement Learning with Amazon SageMakerAmazon Web Services
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Amazon Web Services
 
Deep Learning, Demystified
Deep Learning, DemystifiedDeep Learning, Demystified
Deep Learning, DemystifiedGabe Hollombe
 

Semelhante a Neural Machine Translation with Sockeye (20)

Building Applications with Apache MXNet
Building Applications with Apache MXNetBuilding Applications with Apache MXNet
Building Applications with Apache MXNet
 
Multivariate Time Series
Multivariate Time SeriesMultivariate Time Series
Multivariate Time Series
 
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
 
Deep Learning with MXNet
Deep Learning with MXNetDeep Learning with MXNet
Deep Learning with MXNet
 
Amazon SageMaker Ground Truth: Build High-Quality and Accurate ML Training Da...
Amazon SageMaker Ground Truth: Build High-Quality and Accurate ML Training Da...Amazon SageMaker Ground Truth: Build High-Quality and Accurate ML Training Da...
Amazon SageMaker Ground Truth: Build High-Quality and Accurate ML Training Da...
 
Creating Movie Magic with Computational Simulations (AIS303) - AWS re:Invent ...
Creating Movie Magic with Computational Simulations (AIS303) - AWS re:Invent ...Creating Movie Magic with Computational Simulations (AIS303) - AWS re:Invent ...
Creating Movie Magic with Computational Simulations (AIS303) - AWS re:Invent ...
 
Add Intelligence to Applications with AWS ML Services
Add Intelligence to Applications with AWS ML ServicesAdd Intelligence to Applications with AWS ML Services
Add Intelligence to Applications with AWS ML Services
 
Innovating with Machine Learning on AWS - Travel & Hospitality (November 2018)
Innovating with Machine Learning on AWS - Travel & Hospitality (November 2018)Innovating with Machine Learning on AWS - Travel & Hospitality (November 2018)
Innovating with Machine Learning on AWS - Travel & Hospitality (November 2018)
 
AWS Machine Learning Week SF: Build Intelligent Applications with AWS ML Serv...
AWS Machine Learning Week SF: Build Intelligent Applications with AWS ML Serv...AWS Machine Learning Week SF: Build Intelligent Applications with AWS ML Serv...
AWS Machine Learning Week SF: Build Intelligent Applications with AWS ML Serv...
 
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
Automatic Model Tuning Using Amazon SageMaker (AIM412) - AWS re:Invent 2018
 
Sequence-to-Sequence Modeling with Apache MXNet, Sockeye, and Amazon SageMake...
Sequence-to-Sequence Modeling with Apache MXNet, Sockeye, and Amazon SageMake...Sequence-to-Sequence Modeling with Apache MXNet, Sockeye, and Amazon SageMake...
Sequence-to-Sequence Modeling with Apache MXNet, Sockeye, and Amazon SageMake...
 
DataXDay - Machine learning models at scale with Amazon SageMaker
DataXDay - Machine learning models at scale with Amazon SageMaker DataXDay - Machine learning models at scale with Amazon SageMaker
DataXDay - Machine learning models at scale with Amazon SageMaker
 
Add Intelligence to Applications with AWS ML Services: Machine Learning Week ...
Add Intelligence to Applications with AWS ML Services: Machine Learning Week ...Add Intelligence to Applications with AWS ML Services: Machine Learning Week ...
Add Intelligence to Applications with AWS ML Services: Machine Learning Week ...
 
Machine Learning Fundamentals
Machine Learning FundamentalsMachine Learning Fundamentals
Machine Learning Fundamentals
 
AI Today
AI TodayAI Today
AI Today
 
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019 RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
RoboMaker로 DeepRacer 자율 주행차 만들기 :: 유정열 - AWS Community Day 2019
 
MCL310_Building Deep Learning Applications with Apache MXNet and Gluon
MCL310_Building Deep Learning Applications with Apache MXNet and GluonMCL310_Building Deep Learning Applications with Apache MXNet and Gluon
MCL310_Building Deep Learning Applications with Apache MXNet and Gluon
 
An Introduction to Reinforcement Learning with Amazon SageMaker
An Introduction to Reinforcement Learning with Amazon SageMakerAn Introduction to Reinforcement Learning with Amazon SageMaker
An Introduction to Reinforcement Learning with Amazon SageMaker
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
 
Deep Learning, Demystified
Deep Learning, DemystifiedDeep Learning, Demystified
Deep Learning, Demystified
 

Mais de Apache MXNet

Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingApache MXNet
 
Fine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringFine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringApache MXNet
 
Introduction to GluonNLP
Introduction to GluonNLPIntroduction to GluonNLP
Introduction to GluonNLPApache MXNet
 
Introduction to object tracking with Deep Learning
Introduction to object tracking with Deep LearningIntroduction to object tracking with Deep Learning
Introduction to object tracking with Deep LearningApache MXNet
 
Introduction to GluonCV
Introduction to GluonCVIntroduction to GluonCV
Introduction to GluonCVApache MXNet
 
Introduction to Computer Vision
Introduction to Computer VisionIntroduction to Computer Vision
Introduction to Computer VisionApache MXNet
 
Image Segmentation: Approaches and Challenges
Image Segmentation: Approaches and ChallengesImage Segmentation: Approaches and Challenges
Image Segmentation: Approaches and ChallengesApache MXNet
 
Introduction to Deep face detection and recognition
Introduction to Deep face detection and recognitionIntroduction to Deep face detection and recognition
Introduction to Deep face detection and recognitionApache MXNet
 
Generative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNetGenerative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNetApache MXNet
 
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.aiDeep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.aiApache MXNet
 
Using Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNetUsing Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNetApache MXNet
 
AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...Apache MXNet
 
MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetApache MXNet
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018Apache MXNet
 
Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018Apache MXNet
 
ONNX and Edge Deployments
ONNX and Edge DeploymentsONNX and Edge Deployments
ONNX and Edge DeploymentsApache MXNet
 
Distributed Inference with MXNet and Spark
Distributed Inference with MXNet and SparkDistributed Inference with MXNet and Spark
Distributed Inference with MXNet and SparkApache MXNet
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model CompressionApache MXNet
 
Building Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet GluonBuilding Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet GluonApache MXNet
 

Mais de Apache MXNet (20)

Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Fine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringFine-tuning BERT for Question Answering
Fine-tuning BERT for Question Answering
 
Introduction to GluonNLP
Introduction to GluonNLPIntroduction to GluonNLP
Introduction to GluonNLP
 
Introduction to object tracking with Deep Learning
Introduction to object tracking with Deep LearningIntroduction to object tracking with Deep Learning
Introduction to object tracking with Deep Learning
 
Introduction to GluonCV
Introduction to GluonCVIntroduction to GluonCV
Introduction to GluonCV
 
Introduction to Computer Vision
Introduction to Computer VisionIntroduction to Computer Vision
Introduction to Computer Vision
 
Image Segmentation: Approaches and Challenges
Image Segmentation: Approaches and ChallengesImage Segmentation: Approaches and Challenges
Image Segmentation: Approaches and Challenges
 
Introduction to Deep face detection and recognition
Introduction to Deep face detection and recognitionIntroduction to Deep face detection and recognition
Introduction to Deep face detection and recognition
 
Generative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNetGenerative Adversarial Networks (GANs) using Apache MXNet
Generative Adversarial Networks (GANs) using Apache MXNet
 
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.aiDeep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
Deep Learning With Apache MXNet On Video by Ben Taylor @ ziff.ai
 
Using Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNetUsing Java to deploy Deep Learning models with MXNet
Using Java to deploy Deep Learning models with MXNet
 
AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...
 
MXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNetMXNet Paris Workshop - Intro To MXNet
MXNet Paris Workshop - Intro To MXNet
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
 
Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018Apache MXNet EcoSystem - ACNA2018
Apache MXNet EcoSystem - ACNA2018
 
ONNX and Edge Deployments
ONNX and Edge DeploymentsONNX and Edge Deployments
ONNX and Edge Deployments
 
Distributed Inference with MXNet and Spark
Distributed Inference with MXNet and SparkDistributed Inference with MXNet and Spark
Distributed Inference with MXNet and Spark
 
AI On the Edge: Model Compression
AI On the Edge: Model CompressionAI On the Edge: Model Compression
AI On the Edge: Model Compression
 
Building Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet GluonBuilding Content Recommendation Systems using MXNet Gluon
Building Content Recommendation Systems using MXNet Gluon
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Neural Machine Translation with Sockeye

  • 1. Neural Machine Translation with Sockeye Encoder-Decoder with Attention
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cyrus M. Vahid Principal Deep Learning Solutions Architect at Amazon AI
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. In This Video • Statistical language models • Neural machine translations and long-term dependencies • Encoder-decoder architecture • Sockeye
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Translation I have been dreaming of this day Ich habe von diesem Tag geträumt.
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problems of Translation To perform translation we need to solve the following problems: • Modeling: Decide what our model ! " #; % will look like. What parameters will it have, and how will the parameters specify a probability distribution? • Learning: We need a method to learn appropriate values for parameters from training data. • Search: We need to find the most probable sentence (solving “argmax"). This process of searching for the best hypothesis is often called decoding.
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Statistical Language Models
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Statistical Prior The translation systems thus needs to be a language models itself, meaning being capable of building utterances that are naturally sounding for a given vocabulary. More formally: !(#; %) '() %*+ ,-)./ 0'1',.2.1+
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Count-Based n-gram Language Models • More Formally: Count Based Models and n-grams
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • The problem is we can not make sentences we have not seen before in training data: • Example: If there is a sequence of four words such as ”I am from Utah” that does not exist in training set, then Count-Based n-gram LM – The Problem !"#$%&' (, *+, ,-.+, /0*ℎ = 0 ∴ 5 67 = /0*ℎ 68 = (, 69 = *+, 6: = ,-.+ = 0
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • The solution is to focus on the n-previous words. These models are called n-grams. • The maximum likelihood formula thus take the following form: • Example: Count-Based n-gram LM – The Solution bi-gram “) *+”, ”*+ ./0+”, *12 “./0+ 34*ℎ” */6 *77 8/69614 :1 ;0/839 ∴ = “) *+ ./0+ >4*ℎ” > 0
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • In bigrams we have the same problem of assigning probability 0 if there is a sequence of 2 words that does not exist in training data • In trigams we have the same problem of assigning probability 0 if there is a sequence of 3 words that does not exist in training data. • … • In n-grams we have the same problem of assigning probability 0 if there is a sequence of n words that does not exist in training data Count-Based n-gram LM – Problem Persists
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Measuring Accuracy
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Evaluation of Accuracy - Likelihood • Likelihood: Likelihood is probability of some observed data—e.g. test, given a model. ! ℇ#$%#; ' = ) * +ℇ#$%# !(*; ') • Example |ℇ#$%#| = /01 E1 = I am from pittsburg, ! *; ' = 2.1×109: ∗ E2 = I study at a university ! *; ' = 1.1×109< E3 = my mother is from utah, ! *; ' = 2×109= ! ℇ#$%#; ' = ) * +ℇ#$%# !(*; ') = >. / × /091 × /. / × /09? × >. 0 × /09@ = A. 1> × /09>B *probabilities are arbitrary just to demonstrate the range
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Evaluation of Accuracy – Log Likelihood • !ℎ#$%#&: log+ ∏-./ 0 1- = ∑-./ 0 log+ 1-; 516&78#: log(1. ;) = log 1 + log(;) • We have seen probabilities of likelihood are very tiny. Multiplying these small values make the probabilities even smaller numbers. It is easier to work with log probabilities for measuring accuracy. log > ℇ@AB@; C = log+ D E F ℇGHIG > 5; C = J E F ℇGHIG log( >(5; C)) . • Example: *probabilities are arbitrary just to demonstrate the range |ℇLMNL| = OPQ E1 = I am from pittsburg, T U; V = 2.1×10[ ∴ log(P E; C = log 2.1 + −6 = −5.67 ∗ E2 = I study at a university T U; V = 1.1×10[e ∴ log(P E; C = log 1.1 + −8 = −7.96 E3 = my mother is from utah, T U; V = 2×10[h ∴ log(P E; C = log 2.0 + −9 = −8.70 T ℇLMNL; V = D U iℇLMNL T(U; V) = J E F ℇGHIG log( >(5; C)) = −j. Qk − k. lQ − m. kP = −nn. oo
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Evaluation of Accuracy – Perplexity • It is common to divide log-likelihood by the number of words in the corpus. This makes cross-corpora measurement easier. length ℇ()*( = , - . ℇ/01/ |3| 445 ℇ()*(; 7 = 89(;<=(>(?/01/;@))/C)DE((ℇ/01/) • This accuracy indicator is called perplexity.
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Missing: Semantic and Causal Structure
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Neural Models - Motivation • Statistical models loose semantic and causal structures as well as hidden information. • Log-Linear models were introduced to address such issues, but they have other problems:
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. NMT - Motivation • In separate contexts et-2 = “farmers” is compatible with et = “hay” and et-1 = “eats”. • If we are using feature sets depending on et-1 and et-2, then a sentence like “farmers eat hey” cannot be ruled out, even though it is an unnatural sentence or perhaps an unnatural farmer.
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Neural Language Models
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Step1: Word Embedding • In Neural Language Models we represent words as word embedding, a vector representation of words. • Then we concatinate these words to the length of the context (3 for trigrams).
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Step2: Uncovering Relatedness • Different elements of a vector can potentially reflect different aspect of a word. For instance, different elements could represent countability, being and animal, … • We run the embedding vectors through hidden layer in order to learn combination features https://www.tensorflow.org/tutorials/word2vec
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Step3: Score & Probability • Next, we calculate the score vector for each word. This is done by performing an affine* transform of the hidden vector h with a weight matrix and adding a bias vector. • Finally, we get a probability estimate p by running the calculated scores through a softmax function. * An affine transformation is any transformation that preserves collinearity (i.e., all points lying on a line initially still lie on a line after transformation) and ratios of distances (e.g., the midpoint of a line segment remains the midpoint after transformation). Ref: Wolfram Mathworld
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of Neural Models • Better generalization of contexts: n-gram models treat words as independent entities. Word embedding allows grouping of similar words that perform similarly in prediction of the next word. • More generalizable combination of words into contexts: In an n- gram language model, we would have to remember parameters for all combinations of {cow; horse; goat} X {consume; chew; ingest} to represent the context of things farm animals eat". Hidden layer are more efficient in representing the concept.
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of Neural Models • Ability to skip previous words: n-gram models refer to all sequential words depending the length of the context. Neural Models can skip words. • Simplistic Models: SMT models require complex data-structures, in production often using 80 GBs or more of heap space. High quality NMT production models are generally less than 1 GB.
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. We are there yet!
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Examples of Long-term Dependency He does not have much confidence in himself.
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Long Term Dependencies
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dependencies – Reference&Context • The Spanish King officially abdicated in … of his …, Felipe. Felipe will be confirmed tomorrow as the new Spanish … .
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dependencies – Reference&Context • The Spanish King officially abdicated in favour of his son, Felipe. Felipe will be confirmed tomorrow as the new Spanish king. As the new king, he will be the soverign. Reference Context Reference & Context
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dependencies – Reference&Context • The Spanish King officially abdicated in favour of his son, Felipe. Felipe will be confirmed tomorrow as the new Spanish king. As the new king, he will be the soverign. • Their duo was known as "Buck and Bubbles”. Bubbles tapped along as Buck played on the stride piano and sang until Bubbles was totally tapped out. Reference Context Reference & Context
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dependencies – Register/Topic • CNN is short for Convolutional Neural Network • In our quest to implement perfect NLP tools, we have developed state of the art RNNs. Now we can use them to … .
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dependencies – Register/Topic • CNN is short for Convolutional Neural Network • In our quest to implement perfect NLP tools, we have developed state of the art RNNs. Now we can use them to wreck a nice beach. Register/Topic
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Neural Machine Translation
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problems of Translation To perform translation we need to solve the following problems: • Modeling: Decide what our model ! " #; % will look like. What parameters will it have, and how will the parameters specify a probability distribution? • Learning: We need a method to learn appropriate values for parameters from training data. • Search: We need to find the most probable sentence (solving “argmax"). This process of searching for the best hypothesis is often called decoding.
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • We still intend to calculate !(#; %), but before doing so we would like to calculate the initial state of the target model based on another RNN over the surce sentence '. • First RNN ”encodes” source sentence ', and second RNN “decodes” the initial state into target sentence #. Inferring a Seq. from Another Seq. http://opennmt.net/
  • 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encoder-Decoder Arch. A model based on two LSTM networks, one that takes a sentence and captures it into an output layer that is to be used as input for another LSTM that performs translation
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Generating Output • So far we have seen how to make a probability model. In order to perform translation, we need to generate output. Generally this is performed using search in parallel corpora, the corpora of target language. • Search is performed based on several Criteria: • Random Sampling: Randomly selecting an output E from a probability model. #$ ~ &($|)) • Greedy 1-Best Search: Finding E that maximizes & $ ) ; #$ = -./0-12 & $|) • Beam Search: Finding the n options with the highest probability & $|)
  • 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Random Sampling • The method is used when we would like to find several outputs for an input such as dialog models in chat bots. • Ancestral Sampling: Sampling variable values one at a time and gradually conditioning to more context. • The method is unbiased, but has high variance and is inefficient. • !""#$% & '%('%"%)*" + = “*ℎ%”, “12*3'”, “4156%7”, “3)*3”, ”*ℎ%”, “"*18%”, “. ”, < %3" > • 1* *<$% "*%( * "1$(5% 1 43'7 ='3$ 7<"*'<>#*<3) ? %@ ̂%B @CB ) #)*<5 EF =< %3" > RNN: F ℎG H the the|F ℎB H actor actor|the ℎB H walked Walked|the, actor onto …
  • 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Greedy (Best) Search • This is very similar to random sampling with one difference: The goal is to search for the word in target corpora that maximizes probability. ̂"# = %&'(%)* log(/#,* 1 ) • Greedy search is not guaranteed to find the translation with highest probability, but is efficient in terms of memory and computation. https://github.com/tensorflow/nmt
  • 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Beam Search • Beam Search is a search method that searches for b best hypothesis at each time step. b is called width of the beam. • Larger beam size has a strong length bias. • b is a hyperparameter that should be selected to maximize accuracy. • Not very efficient.
  • 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Limitations • Long distance dependencies • Storing variable length sentences in fixed size vectors. He purchased a car, so that he could drive to work and not spend his time on train. Er kaufte ein Auto, damit er zur Arbeit fahren und seine Zeit nicht im Zug verbringen konnte
  • 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Attentional NMT
  • 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Attention • The attention model comes between the encoder and the decoder and helps the decoder to pick only the encoded inputs that are important for each step of the decoding process. • For each encoded input from the encoder RNN, the attention mechanism calculates its importance https://aws.amazon.com/blogs/ai/train-neural-machine-translation-models-with-sockeye/
  • 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Visualization of Importance https://aws.amazon.com/blogs/ai/amazon-at-wmt- improving-machine-translation-with-user-feedback/ https://aws.amazon.com/blogs/ai/train-neural-machine- translation-models-with-sockeye/
  • 45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Attention Mechanism https://talbaumel.github.io/attention/
  • 46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Attention Mechanism • Each input word has an attention vector by running the RNN in both directions. • Then we concatenate the two vectors to have a bidirectional representation. • We can further concatenate these vectors into a matrix whose columns represent input words. • we calculate a vector αt or attention vector that can be used to combine together the columns of H into a vector ct:
  • 47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Calculating Attention Score • First, we update the hidden state to ℎ" ($) based on the word representation and context vectors from the previous target time step • Based on this ℎ" ($) , we calculate an attention score &" • We then normalize the attention score using softmax. • We now have a context vector '" and hidden state ℎ" ($) for time step t, which we can pass on down to downstream tasks.
  • 48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Calculating Attention Score Method Pros Cons Formula Dot Product - Simple - No additional Parameters - Input and output encoding in the same space attn_score ℎ" # , ℎ% & : = ℎ" # ) ℎ% & Bilinear Functions - Input and output of different sized - More expensive than ⊗ attn_score ℎ" # , ℎ% & : = ℎ" # ) +,ℎ% & +, -. / 0-12/3 43/1.5637/4-61 MLP - More flexible than ⊗ - Fewer parameters than BLF attn_score ℎ" # , ℎ% & ≔ 9,: ; tanh +,@[ℎ% & ; ℎ" # ] +,@/1D 9,: /32 92-Eℎ4 7/43-F /1D G2H463 65 4ℎ2 5-3.4 /1D .2H61D 0/I23. 65 4ℎ2 JKL 32.M2H4-G20I
  • 49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Attention Mechanism https://talbaumel.github.io/attention/ !"#: %&&'(&)*( +'),ℎ&. /": /*(&'0& 1'/&*2 !" = %&&'(&)*( 1'/&*2
  • 50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Translation Performance • BLEU: The BLEU score measures similarity to human translation. It measures how many words overlap in a given translation when compared to a reference translation. BLEU scores range from 0-100, the higher the score, the more the translation correlates to a human translation. • TER: The TER score measures the amount of editing that a translator would have to perform to change a translation so it exactly matches a reference translation. By repeating this analysis on a large number of sample translations, it is possible to estimate the post-editing effort required for a project. TER scores also range from 0-100.
  • 51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Translation Performance
  • 52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sockeye/MXNet/AWS
  • 53. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sockeye Encode-Decoder Framework • Sockeye is a sequence-to-sequence framework with attention for Neural Machine Translation based on Apache MXNet Incubating. It implements the well-known encoder-decoder architecture with attention. • Github repo: https://github.com/awslabs/sockeye • Tutorials: https://github.com/awslabs/sockeye/tree/master/tutorials
  • 54. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Examples • python3 -m sockeye.train -s train.source -t train.target -vs dev.source -vt dev.target --num-embed 32 --rnn-num- hidden 64 --attention-type dot --use- cpu --metrics perplexity accuracy -- max-num-checkpoint-not-improved 3 -o seqcopy_model • python -m sockeye.translate -m seqcopy_model
  • 55. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sockeye Features • Beam search inference • Easy ensembling of multiple models • Residual connections between RNN layers (Wu et al., 2016) [Deep LSTM with parallelism] • Lexical biasing of output layer predictions (Arthur et al., 2016) [Low Frequency Words] • Modeling coverage (Tu et al., 2016) [Keeping attention history to reduce over and under translation] • Context gating (Tu et al., 2017) [Improving adequacy of translation by controlling rations of source and target context]
  • 56. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sockeye Features Contd. • Cross-entropy label smoothing (e.g., Pereyra et al., 2017) • Layer normalization (Ba et al, 2016) [improving training time] • Multiple supported attention mechanisms [dot, mlp, bilinear, multihead-dot, encoder last state, location] • Multiple model architectures (Encoder-Decoder Wu et al., 2016, Convolutional Gehring et al, 2017, Transformer Vaswani et al, 2017,)
  • 57. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metrics and Visualization
  • 58. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. MXNet • Sockeye uses MXNet as underlying deep learning framework. • MXNet benchmark shows near linear performance across multiple machines.
  • 59. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Multi-GPU Training Training can be carried out on multiple GPUs by either specifying multiple GPU device ids: --device-ids 0 1 2 3, or specifying the number GPUs required: --device-ids -n attempts to acquire n GPUs.
  • 60. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker • A fully managed service that enables data scientists and developers to quickly and easily build production-ready machine-learning models. • Amazon SageMaker includes several built-in state of the art algorithms that are fine-tuned to run on distributed environments. https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html • Amongst built-in algorithms there is a Sequence2Sequence algorithm implemented using Sockeye. • All you need to do it to tune model’s hyperparameters and passing your data to the model. • Another useful built-in algorithm is Amazon BlazingText and provides Word2Vec.
  • 61. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo: NMT in SageMaker
  • 62. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References • arXiv:1703.01619v1 [cs.CL] 5 Mar 2017: Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. Graham Neubig, Language Technologies Institute, Carnegie Mellon University • All references and images are from arXiv:1703.01619v1 [cs.CL] 5 Mar 2017 unless directly referenced on the page. • Sockeye documentation: http://sockeye.readthedocs.io/en/latest/index.html
  • 63. © 2018 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections or feedback on the course, please email us at: aws-course-feedback@amazon.com. For all other questions, contact us at: https://aws.amazon.com/contact- us/aws-training/. All trademarks are the property of their owners. Thank You