SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Transformer: Attention is all you need
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
2019. 05. 30.
Jeong-Gwan Lee
1
2
Table of contents
¨ RNN-based Encoder-Decoder model
• RNN Encoder-Decoder model
• Attention mechanism
• Limitation of RNN
¨ Transformer
• Encoder Part
• Embedding & Positional Encoding
• Scaled Dot-Product Self-Attention & Multi-Head Self-Attention
• Position-wise Feed-Forward Networks
• Decoder Part
• Masked Self-Attention
• Encoder-Decoder Attention
• Output Part & Inference Visualization
¨ Why Self-Attention?
¨ Training & Results
¨ Appendix
Machine Translation Model
3
Machine Translation problem
Input sequence
Output sequence
나는 소년입니다.
I am a boy.
Decoder
RNNs
<GO>
4
RNN Encoder-Decoder model
è Static context vector(c) leads to loss of information,
especially in long sentences.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, Yoshua Bengio
"Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” EMNLP 2014
Encoder
RNNs RNNs RNNs
…
RNNs RNNs RNNs RNNs
…
Output sequence
Input sequence
Different context vector for
generating next word?
5
6
Attention mechanism
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align
and translate." arXiv preprint arXiv:1409.0473 (2014).
è Attention mechanism can model ci (attended context
vector) regardless of the input length.Encoder
Decoder
Define
where,
”Alignment” model scores how well the input position j
matches with the output position i.
i-1 i
i-1 i
Convex combination
7
Attention mechanism
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align
and translate." arXiv preprint arXiv:1409.0473 (2014).
Encoder
Decoder
Define
where,
Query Key
i-1 i
i-1 i
Query : “Criteria” to generate(related to) this word
Key : Multiple candidate to attend
Alignment : Query Key match scoring
Value : Representation of each key
Value
Attention(Query, Key, Value) = Alignment(Query, Key) * Value
8
Limits of recurrent neural network
¨ RNN generates each hidden state in series.
• This sequential nature precludes parallelization in training time.
¨ RNN has a long-term dependency problem.
• It’s hard to memorize long-term contextual information.
¨ RNN can see one word at a time.
• CNN can see more words at a time.
Cats which ate , were?
was?
full… …
9
Transformer
¨ Encoder-Decoder model based on
(self-)attention mechanism, without
recurrence and convolutions
¨ Regardless of sequence length, capture
long-term dependencies between input
and output and allow more parallelization.
¨ Basically targeting for translation
tasks(WMT’14 EN-GE, EN-FR), It achieved
state of the art BLEU score.
10
Encoder part
¨ Encoder : 6 identical layers
• Multi-Head self-attention layer
+ Position-wise feed-forward network
• Residual connection[1]
& Layer normalization[2]
• To facilitate these residual connections,
all sub-layers, and embedding layers
produce dimensions
[1] Lei Ba, Jimmy, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016).
[2] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer
vision and pattern recognition. 2016.
11
Embedding & Positional Encoding
(T)
(T, dmodel)
(T, dmodel)
q
q q
q
¨ It uses learned embeddings to convert the
input tokens and output tokens to vectors of
dimension .
¨ Transformer gets input with length T at one
time, and the word embedding itself doesn’t
have positional information.
à Positional encoding injects information
about the position of the tokens.
¨ Transformer easily learns to attend by
relative positions, since for any fixed offset
𝑘, 𝑃𝐸 𝑝𝑜𝑠+𝑘 can be represented as a linear
function of 𝑃𝐸 𝑝𝑜𝑠.
the wavelengths form a geometric
progression from 2𝜋 to 10000 ⋅ 2𝜋
12
Embedding & Positional Encoding
(T)
(T, dmodel)
(T, dmodel)
Tq
q q
q
q=0
q=1
q=255
¨ It uses learned embeddings to convert the
input tokens and output tokens to vectors of
dimension .
¨ Transformer gets input with length T at one
time, and the word embedding itself doesn’t
have positional information.
à Positional encoding injects information
about the position of the tokens.
dmodel /2
q=2
q=200
q=201
pos=3
pos=20
pos=10
pos=2
0.99
0.98
1.0
0.78
0.785
-0.7
the wavelengths form a geometric
progression from 2𝜋 to 10000 ⋅ 2𝜋
13
Self-Attention
I
am
a
boy
I
am
a
boy
I
am
a
boy
I
am
a
boy
I
am
a
boy
I
am
a
boy
I
am
a
boy
I
am
a
boy
è Self-Attention allow each word to refer to other words in same sequence, to
compute a better encoded representation of the sequence.
Key Query QueryKey
Key KeyQuery Query
Attention = Alignment(Query, Key) * Value
Query : Criteria to generate(related to) this word
Key : Multiple candidate to attend
Alignment : Query Key match scoring
Value : Representation of each key
14
Scaled Dot-Product Self-Attention(single head)
Linear
Input
LinearLinear
(T, dmodel)
(T, dmodel)
(T, T)
(T, T)
(T, T)
(T, T)
(T, dmodel)
(T, dmodel)
(T, dmodel)
Convex combinationAttention mechanism
(T, T) (T, dmodel)
(T, dmodel)
Value(T, dmodel)
15
Scaled Dot-Product Self-Attention(single head)
Linear
Input
LinearLinear
(T, dmodel)
(T, dmodel)
(T, T)
(T, T)
(T, T)
(T, T)
(T, dmodel)
(T, dmodel)
(T, dmodel)
(T, dmodel) (dmodel ,T)
Why scaled by ?
(T, dmodel)
16
Scaled Dot-Product Self-Attention(single head)
Why scaled by ?
Linear
Input
LinearLinear
(T, dmodel)
(T, dmodel)
(T, T)
(T, T)
(T, T)
(T, T)
(T, dmodel)
(T, dmodel)
As dmodel be larger,
The dot product of each elements grow large
in magnitude.
(T, dmodel)
(T, dmodel)
Variance Large Variance Small
17
Multi-Head Self-Attention
• It allows the model to jointly attend to
information from different
representation subspaces at different
positions
ex) i=1 : word class, i=2 : pronoun
i=3 : singular/plural
Linear
Input
LinearLinear
(T, dmodel)
(T, dmodel) (T, dmodel) (T, dmodel)
(T, dv) (T, dk) (T, dk)
(T, dv)
(T, hdv)
(T, dmodel)
18
Encoder Zoom In
Linear
Input
LinearLinear
Transformer Encoder Multi-Head Self-Attention Scaled Dot-Product Self-Attention
Encoder Output
19
Position-wise Feed-Forward Networks
… …
…
¨ The representation of every “single” position is fed into same neural
network and is transformed independently.
Relu
20
Decoder part
¨ Decoder : 6 identical layers
• Masked multi-head self-attention layer
+ Multi-head Encoder-decoder attention layer
+ Position-wise feed-forward network
• Masked multi-head self-attention layer
for preventing positions from attending
to subsequent positions.
Encoder Output
21
Masked Self-Attention
Linear
Input
LinearLinear
(T, dmodel)
(T, dmodel)
(T, T)
(T, T)
(T, T)
(T, T)
(T, dmodel)
(T, dmodel)
¨ In Decoder part, according to the auto-regressive
characteristic, a new word is generated based
previous words.
¨ Masking prevents from attending to not-
generated position.
x x x x
x x x
x x
x
<GO>
I
am
a
boy
<GO>
I am a boy
(T, dmodel)
This figure supposes in training phase
t=0
t=1
t=4
22
Encoder-Decoder Attention
Encoder Output
Linear
Decoder Input
LinearLinear
(U, dmodel)
(U, dmodel)
(U, T)
(U, T)
(U, T)
(T, dmodel)
(U, dmodel)
(U, dmodel)
(T, dmodel)
<GO>
I
am
a
boy
나
는 한 소
년
이
다
t=0
t=1
t=4
.
23
Output part & Inference Visualization
¨ It uses the usual learned linear transformation and softmax
function to convert the decoder output to predicted next-
token probabilities.
<GO>
I
EncoderOutput
I
<GO>, I
I
EncoderOutput
am
<GO>,I,am
I
EncoderOutput
a
<GO>,I,am,a
I
EncoderOutput
boy
24
Why self-attention?
☐ Three desiderata
1. Total computational complexity per layer
2. The amount of computation that can be parallelized
(the minimum number of sequential operations)
3. The path length between long-range dependencies
(distance between any two positions)
• Key factor to compute dependencies
25
Why self-attention?
나는
소년
이다
나
는
소
년
이
다
나는
RNN weight 소년
x à
Self-attention Recurrent
kernel
Convolutional
Matrix multiplication
26
Why self-attention?
☐ Self-attention layers are faster than recurrent layers when n << d, which
is most often in machine translations.
☐ For tasks involving very long sequences, self-attention could be restricted
considering only a neighbor of size r.
☐ As side benefit, self-attention could yield more interpretable models.
27
Training
¨ Task (Machine Translation)
1. WMT 2014 English-German dataset (4.5 million sentence pairs)
2. WMT 2014 English-French dataset (36 million sentence pairs)
¨ Training spec
• Adam Optimizer (𝛃1 = 0.9, 𝛃2 = 0.97, and 𝜺 = 10-9)
• Dropout to the output of each sub-layer before residual input, the
embeddings and the positional encodings with Pdrop = 0.1
• (base model) 12 hours with 8 NVIDIA P100 GPUs
28
Results
¨ Machine Translation
1/4
29
Results
¨ Model Variations
Thank you for Attention!
30
31
Appendix1: Self-attention visualization
• Results of encoder self-attention in layer 5 of 6.
• It seems like attending a distant dependency of the verb ’making’,
completing ‘making… more difficult’.
• Different colors represent different heads.
QueryKey
32
Appendix1: Self-attention visualization
• Results of encoder self-attention in layer 5 of 6.
• The heads clearly learned to perform different tasks.

Mais conteúdo relacionado

Mais procurados

Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxDeep Learning Italia
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroBill Liu
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 

Mais procurados (20)

Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep learning
Deep learning Deep learning
Deep learning
 
BERT
BERTBERT
BERT
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 

Semelhante a Attention is All You Need (Transformer)

PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
 
Rethinking action spaces for reinforcement learning in end-to-end dialog agen...
Rethinking action spaces for reinforcement learning in end-to-end dialog agen...Rethinking action spaces for reinforcement learning in end-to-end dialog agen...
Rethinking action spaces for reinforcement learning in end-to-end dialog agen...Jeong-Gwan Lee
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention reviewJune-Woo Kim
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfssuser7f0b19
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflowKeon Kim
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationJason Anderson
 
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...zeeshanshanzy009
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial Ligeng Zhu
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognitionIntel Nervana
 
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14Daniel Lewis
 
Intelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_finalIntelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_finalSuhas Pillai
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformersNEERAJ BAGHEL
 
From_seq2seq_to_BERT
From_seq2seq_to_BERTFrom_seq2seq_to_BERT
From_seq2seq_to_BERTHuali Zhao
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingDongang (Sean) Wang
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingApache MXNet
 

Semelhante a Attention is All You Need (Transformer) (20)

PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
Rethinking action spaces for reinforcement learning in end-to-end dialog agen...
Rethinking action spaces for reinforcement learning in end-to-end dialog agen...Rethinking action spaces for reinforcement learning in end-to-end dialog agen...
Rethinking action spaces for reinforcement learning in end-to-end dialog agen...
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
 
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs...
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Deep Learning Tutorial
Deep Learning Tutorial Deep Learning Tutorial
Deep Learning Tutorial
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognition
 
Attention
AttentionAttention
Attention
 
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
 
Intelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_finalIntelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_final
 
2021 04-01-dalle
2021 04-01-dalle2021 04-01-dalle
2021 04-01-dalle
 
Generating super resolution images using transformers
Generating super resolution images using transformersGenerating super resolution images using transformers
Generating super resolution images using transformers
 
From_seq2seq_to_BERT
From_seq2seq_to_BERTFrom_seq2seq_to_BERT
From_seq2seq_to_BERT
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
ML Visuals.pptx
ML Visuals.pptxML Visuals.pptx
ML Visuals.pptx
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 

Último

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 

Último (20)

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
The SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teamsThe SRE Report 2024 - Great Findings for the teams
The SRE Report 2024 - Great Findings for the teams
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 

Attention is All You Need (Transformer)

  • 1. Transformer: Attention is all you need Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. 2019. 05. 30. Jeong-Gwan Lee 1
  • 2. 2 Table of contents ¨ RNN-based Encoder-Decoder model • RNN Encoder-Decoder model • Attention mechanism • Limitation of RNN ¨ Transformer • Encoder Part • Embedding & Positional Encoding • Scaled Dot-Product Self-Attention & Multi-Head Self-Attention • Position-wise Feed-Forward Networks • Decoder Part • Masked Self-Attention • Encoder-Decoder Attention • Output Part & Inference Visualization ¨ Why Self-Attention? ¨ Training & Results ¨ Appendix
  • 3. Machine Translation Model 3 Machine Translation problem Input sequence Output sequence 나는 소년입니다. I am a boy.
  • 4. Decoder RNNs <GO> 4 RNN Encoder-Decoder model è Static context vector(c) leads to loss of information, especially in long sentences. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, Yoshua Bengio "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” EMNLP 2014 Encoder RNNs RNNs RNNs … RNNs RNNs RNNs RNNs … Output sequence Input sequence
  • 5. Different context vector for generating next word? 5
  • 6. 6 Attention mechanism Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). è Attention mechanism can model ci (attended context vector) regardless of the input length.Encoder Decoder Define where, ”Alignment” model scores how well the input position j matches with the output position i. i-1 i i-1 i Convex combination
  • 7. 7 Attention mechanism Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). Encoder Decoder Define where, Query Key i-1 i i-1 i Query : “Criteria” to generate(related to) this word Key : Multiple candidate to attend Alignment : Query Key match scoring Value : Representation of each key Value Attention(Query, Key, Value) = Alignment(Query, Key) * Value
  • 8. 8 Limits of recurrent neural network ¨ RNN generates each hidden state in series. • This sequential nature precludes parallelization in training time. ¨ RNN has a long-term dependency problem. • It’s hard to memorize long-term contextual information. ¨ RNN can see one word at a time. • CNN can see more words at a time. Cats which ate , were? was? full… …
  • 9. 9 Transformer ¨ Encoder-Decoder model based on (self-)attention mechanism, without recurrence and convolutions ¨ Regardless of sequence length, capture long-term dependencies between input and output and allow more parallelization. ¨ Basically targeting for translation tasks(WMT’14 EN-GE, EN-FR), It achieved state of the art BLEU score.
  • 10. 10 Encoder part ¨ Encoder : 6 identical layers • Multi-Head self-attention layer + Position-wise feed-forward network • Residual connection[1] & Layer normalization[2] • To facilitate these residual connections, all sub-layers, and embedding layers produce dimensions [1] Lei Ba, Jimmy, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016). [2] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  • 11. 11 Embedding & Positional Encoding (T) (T, dmodel) (T, dmodel) q q q q ¨ It uses learned embeddings to convert the input tokens and output tokens to vectors of dimension . ¨ Transformer gets input with length T at one time, and the word embedding itself doesn’t have positional information. à Positional encoding injects information about the position of the tokens. ¨ Transformer easily learns to attend by relative positions, since for any fixed offset 𝑘, 𝑃𝐸 𝑝𝑜𝑠+𝑘 can be represented as a linear function of 𝑃𝐸 𝑝𝑜𝑠. the wavelengths form a geometric progression from 2𝜋 to 10000 ⋅ 2𝜋
  • 12. 12 Embedding & Positional Encoding (T) (T, dmodel) (T, dmodel) Tq q q q q=0 q=1 q=255 ¨ It uses learned embeddings to convert the input tokens and output tokens to vectors of dimension . ¨ Transformer gets input with length T at one time, and the word embedding itself doesn’t have positional information. à Positional encoding injects information about the position of the tokens. dmodel /2 q=2 q=200 q=201 pos=3 pos=20 pos=10 pos=2 0.99 0.98 1.0 0.78 0.785 -0.7 the wavelengths form a geometric progression from 2𝜋 to 10000 ⋅ 2𝜋
  • 13. 13 Self-Attention I am a boy I am a boy I am a boy I am a boy I am a boy I am a boy I am a boy I am a boy è Self-Attention allow each word to refer to other words in same sequence, to compute a better encoded representation of the sequence. Key Query QueryKey Key KeyQuery Query Attention = Alignment(Query, Key) * Value Query : Criteria to generate(related to) this word Key : Multiple candidate to attend Alignment : Query Key match scoring Value : Representation of each key
  • 14. 14 Scaled Dot-Product Self-Attention(single head) Linear Input LinearLinear (T, dmodel) (T, dmodel) (T, T) (T, T) (T, T) (T, T) (T, dmodel) (T, dmodel) (T, dmodel) Convex combinationAttention mechanism (T, T) (T, dmodel) (T, dmodel) Value(T, dmodel)
  • 15. 15 Scaled Dot-Product Self-Attention(single head) Linear Input LinearLinear (T, dmodel) (T, dmodel) (T, T) (T, T) (T, T) (T, T) (T, dmodel) (T, dmodel) (T, dmodel) (T, dmodel) (dmodel ,T) Why scaled by ? (T, dmodel)
  • 16. 16 Scaled Dot-Product Self-Attention(single head) Why scaled by ? Linear Input LinearLinear (T, dmodel) (T, dmodel) (T, T) (T, T) (T, T) (T, T) (T, dmodel) (T, dmodel) As dmodel be larger, The dot product of each elements grow large in magnitude. (T, dmodel) (T, dmodel) Variance Large Variance Small
  • 17. 17 Multi-Head Self-Attention • It allows the model to jointly attend to information from different representation subspaces at different positions ex) i=1 : word class, i=2 : pronoun i=3 : singular/plural Linear Input LinearLinear (T, dmodel) (T, dmodel) (T, dmodel) (T, dmodel) (T, dv) (T, dk) (T, dk) (T, dv) (T, hdv) (T, dmodel)
  • 18. 18 Encoder Zoom In Linear Input LinearLinear Transformer Encoder Multi-Head Self-Attention Scaled Dot-Product Self-Attention Encoder Output
  • 19. 19 Position-wise Feed-Forward Networks … … … ¨ The representation of every “single” position is fed into same neural network and is transformed independently. Relu
  • 20. 20 Decoder part ¨ Decoder : 6 identical layers • Masked multi-head self-attention layer + Multi-head Encoder-decoder attention layer + Position-wise feed-forward network • Masked multi-head self-attention layer for preventing positions from attending to subsequent positions. Encoder Output
  • 21. 21 Masked Self-Attention Linear Input LinearLinear (T, dmodel) (T, dmodel) (T, T) (T, T) (T, T) (T, T) (T, dmodel) (T, dmodel) ¨ In Decoder part, according to the auto-regressive characteristic, a new word is generated based previous words. ¨ Masking prevents from attending to not- generated position. x x x x x x x x x x <GO> I am a boy <GO> I am a boy (T, dmodel) This figure supposes in training phase t=0 t=1 t=4
  • 22. 22 Encoder-Decoder Attention Encoder Output Linear Decoder Input LinearLinear (U, dmodel) (U, dmodel) (U, T) (U, T) (U, T) (T, dmodel) (U, dmodel) (U, dmodel) (T, dmodel) <GO> I am a boy 나 는 한 소 년 이 다 t=0 t=1 t=4 .
  • 23. 23 Output part & Inference Visualization ¨ It uses the usual learned linear transformation and softmax function to convert the decoder output to predicted next- token probabilities. <GO> I EncoderOutput I <GO>, I I EncoderOutput am <GO>,I,am I EncoderOutput a <GO>,I,am,a I EncoderOutput boy
  • 24. 24 Why self-attention? ☐ Three desiderata 1. Total computational complexity per layer 2. The amount of computation that can be parallelized (the minimum number of sequential operations) 3. The path length between long-range dependencies (distance between any two positions) • Key factor to compute dependencies
  • 25. 25 Why self-attention? 나는 소년 이다 나 는 소 년 이 다 나는 RNN weight 소년 x à Self-attention Recurrent kernel Convolutional Matrix multiplication
  • 26. 26 Why self-attention? ☐ Self-attention layers are faster than recurrent layers when n << d, which is most often in machine translations. ☐ For tasks involving very long sequences, self-attention could be restricted considering only a neighbor of size r. ☐ As side benefit, self-attention could yield more interpretable models.
  • 27. 27 Training ¨ Task (Machine Translation) 1. WMT 2014 English-German dataset (4.5 million sentence pairs) 2. WMT 2014 English-French dataset (36 million sentence pairs) ¨ Training spec • Adam Optimizer (𝛃1 = 0.9, 𝛃2 = 0.97, and 𝜺 = 10-9) • Dropout to the output of each sub-layer before residual input, the embeddings and the positional encodings with Pdrop = 0.1 • (base model) 12 hours with 8 NVIDIA P100 GPUs
  • 30. Thank you for Attention! 30
  • 31. 31 Appendix1: Self-attention visualization • Results of encoder self-attention in layer 5 of 6. • It seems like attending a distant dependency of the verb ’making’, completing ‘making… more difficult’. • Different colors represent different heads. QueryKey
  • 32. 32 Appendix1: Self-attention visualization • Results of encoder self-attention in layer 5 of 6. • The heads clearly learned to perform different tasks.