SlideShare uma empresa Scribd logo
1 de 30
Show, Attend and Tell:
Neural Image Caption Generation with Visual Attention
by Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel,
Yoshua Bengio, ICML 2015
Presented by Eun-ji Lee
2015.10.14
Data Mining Research Lab
Sogang University
Contents
1. Introduction
2. Image Caption Generation with Attention Mechanism
a. LSTM Tutorial
b. Model Details: Encoder & Decoder
3. Learning Stochastic “Hard” vs Deterministic “Soft” Attention
a. Stochastic “Hard” Attention
b. Deterministic “Soft” Attention
c. Training Procedure
4. Experiments
1. Introduction
“Scene understanding“
“ Rather than compress an entire image into a static representation, attention allows for
salient features to dynamically come to the forefront as needed.
“hard” attention & “soft attention
2-a. LSTM tutorial (1)
• 𝒙 𝒕 : a input to the memory cell layer at time 𝑡
• 𝑊𝑖, 𝑊𝑓, 𝑊𝑐, 𝑊𝑜, 𝑈𝑖, 𝑈𝑓, 𝑈𝑐, 𝑈 𝑜, 𝑉𝑜 : weight matrices
• 𝒃𝒊, 𝒃 𝒇, 𝒃 𝒄, 𝒃 𝒐 : bias vectors
1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊) (Input gate)
2. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄) (Candidate state)
3. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇) (Forget gate)
4. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏 (Memory Cells’ new state)
5. 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉𝒕−𝟏 + 𝑉𝑜 𝑪𝒕 + 𝒃 𝒐) (Output gate)
6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕) (Outputs, or Hidden states)
http://deeplearning.net/tutorial/lstm.html#lstm
𝒙 𝒕
𝒊 𝒕
𝒐 𝒕
𝒇 𝒕
𝑪 𝒕
𝒉 𝒕
𝑪 𝒕−𝟏
2-a. LSTM tutorial (2)
• 𝒙 𝒕 : a input to the memory cell layer at time 𝑡
• 𝑊𝑖, 𝑊𝑓, 𝑊𝑐, 𝑊𝑜, 𝑈𝑖, 𝑈𝑓, 𝑈𝑐, 𝑈 𝑜, 𝑉𝑜 : weight matrices
• 𝒃𝒊, 𝒃 𝒇, 𝒃 𝒄, 𝒃 𝒐 : bias vectors
1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊)
2. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄)
3. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇)
4. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏
5. 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉𝒕−𝟏 + 𝑉𝑜 𝑪𝒕 + 𝒃 𝒐) ⇒ 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉 𝒕−𝟏 + 𝒃 𝒐)
6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕)
http://deeplearning.net/tutorial/lstm.html#lstm
𝒙 𝒕
𝒊 𝒕
𝒐 𝒕
𝒇 𝒕
𝑪 𝒕
𝒉 𝒕
𝑪 𝒕−𝟏
2-a. LSTM tutorial (3)
1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊)
2. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇)
3. 𝒐 𝒕 = 𝜎 𝑊𝑜 𝒙𝒕 + 𝑈 𝑜 𝒉 𝒕−𝟏 + 𝒃 𝒐
4. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄)
5. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏
6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕)
http://deeplearning.net/tutorial/lstm.html#lstm
𝒙 𝒕
𝒊 𝒕
𝒐 𝒕
𝒇 𝒕
𝑪 𝒕
𝒉 𝒕
𝑪 𝒕−𝟏
2-b. Model Details: Encoder
A model takes a single raw image and generates a caption 𝒚 encoded as a sequence of
1-of-K encoded words.
• Caption :
• Image :
𝑦 = 𝒚 𝟏, … , 𝒚 𝑪 , 𝒚𝒊 ∈ ℝ 𝐾
𝐾: vocab size, 𝐶: caption length
𝐷 : dim. of representation corresponding to a part of the image
𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷
𝒂𝒊
𝒚𝒊
𝒂 𝑳
𝒂 𝟏 ⋯
⋯
𝒂 𝑳
𝒂 𝟏 ⋯
⋯
2-b. Model Details: Encoder
• Caption :
• Image :
“We extract features from a lower convolutional layer unlike previous work which instead
used a fully connected layer.”
𝑦 = 𝒚 𝟏, … , 𝒚 𝑪 , 𝒚𝒊 ∈ ℝ 𝐾
𝐾: vocab size, 𝐶: caption length
𝐷 : dim. of representation corresponding to a part of the image
𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷
• We use a LSTM[1] that produces a caption by generating one word at every time step
conditioned on a context vector, the previous hidden state and the previously
generated words.
2-b. Model Details: Decoder (LSTM)
𝒚 𝒕
𝒛 𝒕 𝒉 𝒕−𝟏
𝒚 𝒕−𝟏
[1] Hochreiter & Schmidhuber, 1997
• 𝐢𝐭 = 𝜎 𝑊𝑖 𝐸𝐲𝐭−𝟏 + 𝑈𝑖 𝐡𝐭−𝟏 + 𝑍𝑖 𝐳𝐭 + 𝐛𝐢 ,
• 𝐟𝐭 = 𝜎 𝑊𝑓 𝐸𝐲𝐭−𝟏 + 𝑈𝑓 𝐡𝐭−𝟏 + 𝑍𝑓 𝐳𝐭 + 𝐛 𝐟 ,
• 𝐜𝐭 = 𝐟𝐜 𝐜𝐭−𝟏 + 𝐢𝐭 tanh 𝑊𝑐 𝐸𝐲𝐭−𝟏 + 𝑈𝑐 𝐡𝐭−𝟏 + 𝑍 𝑐 𝐳𝐭 + 𝐛 𝐜 ,
• 𝐨𝐭 = 𝜎 𝑊𝑜 𝐸𝐲𝐭−𝟏 + 𝑈 𝑜 𝐡𝐭−𝟏 + 𝑍 𝑜 𝐳𝐭 + 𝐛 𝐨 ,
• 𝐡𝐭 = 𝐨𝐭 tanh 𝐜𝐭 .
𝐢𝐭, 𝐟𝐭, 𝐜𝐭, 𝐨𝐭, 𝐡𝐭 are the input, forget, memory, output and hidden state of LSTM.
𝑊∎, 𝑈∎, 𝑍∎ and 𝐛∎ are learned weight matrices and biases.
𝐄 ∈ ℝ 𝑚×𝐾
: an embedding matrix.
𝑚 : embedding dim. 𝑛 : LSTM dim.
𝜎 : logistic sigmoid activation.
2-b. LSTM
• A dynamic representation of the relevant part of the image input at time 𝑡.
𝐳 𝑡 = 𝜙( 𝐚𝑖 , 𝛼𝑖 )
𝑒𝑡𝑖 = 𝑓𝑎𝑡𝑡 𝐚𝑖, 𝐡 𝑡−1
𝛼 𝑡𝑖 =
exp(𝑒𝑡𝑖)
𝑘=1
𝐿
exp(𝑒𝑡𝑘)
2-b. Context vector 𝐳𝐭
- (Stochastic attention) : the probability that location 𝑖 is the right place to focus for producing
the next word.
- (Deterministic attention) : the relative importance to give to location 𝑖 in blending the 𝑎𝑖’s
together.
The weight 𝛼𝑖 of each annotation
vector 𝐚𝑖 is computed by an
attention model 𝑓𝑎𝑡𝑡 for which we
use a multilayer perceptron
conditioned on 𝐡 𝑡−1.
𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷
• The initial memory state and hidden state of the LSTM are predicted by an average of
the annotation vectors fed through two separate MLPs (init,c and init,h):
𝐜0 = 𝑓𝑖𝑛𝑖𝑡,𝑐
1
𝐿
𝑖
𝐿
𝐚𝑖 , 𝐡0 = 𝑓𝑖𝑛𝑖𝑡,ℎ
1
𝐿
𝑖
𝐿
𝐚𝑖
2-b. Initialization (LSTM)
𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷
• We use a deep output layer(Pascanu et al., 2014) to compute the output word
probability.
𝑝 𝐲𝑡 𝐚, 𝐲1
𝑡−1
∝ exp(𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝐡 𝑡 + 𝐋 𝑧 𝐳 𝑡 )
where 𝐋0 ∈ ℝ 𝐾×𝑚
, 𝐋ℎ ∈ ℝ 𝑚×𝑛
, 𝐋 𝑧 ∈ ℝ 𝑚×𝐷
, 𝐄 are learned parameters initialized randomly.
2-b. Output word probability
• Vector exponential
exp 𝐯 = 𝟏 + 𝐯 +
1
2!
𝐯2
+
1
3!
𝐯3
+ ⋯
= 𝟏 cosh 𝐯 +
𝐯
𝐯
sinh( 𝐯 ).
• We represent the location variable 𝑠𝑡 as where the model decides to focus attention
when generating the 𝑡 𝑡ℎ
word. 𝑠𝑡,𝑖 is an indicator one-hot variable which is set to 1 if
the 𝑖-th location (out of 𝐿) is the one used to extract visual features.
𝑝 𝑠𝑡,𝑖 = 1 𝑠𝑗<𝑡, 𝒂 = 𝛼 𝑡,𝑖
𝐳 𝑡 =
𝑖
𝑠𝑡,𝑖 𝐚𝑖
3-a. Stochastic “Hard” Attention
Binary One-hot
00
01
10
11
0001
0010
0100
1000
1
𝐿
𝑖
𝑠𝑡
𝑡 시간에 attention 할 부분
𝐚 = 𝐚 𝟏, … , 𝐚 𝑳 , 𝐚𝒊 ∈ ℝ 𝐷
𝛼𝑖 =
exp(𝑓𝑎𝑡𝑡 𝐚 𝑖,𝐡 𝑡−1 )
𝑘=1
𝐿
exp(𝑓𝑎𝑡𝑡 𝐚𝑖,𝐡 𝑡−1 )
𝐳 𝑡 = 𝜙( 𝐚𝑖 , 𝛼𝑖 )
𝑠𝑡 : attention location var.
• A variational lower bound on the marginal log-likelihood log 𝑝(𝐲|𝐚) of observing the
sequence of words 𝐲 given image features a.
𝐿 𝑠 =
𝑠
𝑝 𝑠 𝐚 log 𝑝(𝐲|𝑠, 𝐚) ≤ log
𝑠
𝑝 𝑠 𝐚 𝑝 𝐲 𝑠, 𝐚 = log 𝑝(𝐲|𝐚)
𝜕𝐿 𝑠
𝜕𝑊
=
𝑠
𝑝(𝑠|𝐚)
𝜕 log 𝑝 𝐲 𝑠, 𝐚
𝜕𝑊
+ log 𝑝 𝐲 𝑠, 𝐚
𝜕 log 𝑝 𝑠 𝐚
𝜕𝑊
3-a. A new objective function 𝐿 𝑠
• Monte Carlo based sampling approximation of the gradient with respect to the model
parameters:
𝑠𝑡
𝑛
∼ Multinoulli 𝐿( 𝛼 𝑡
𝑛
)
𝜕𝐿 𝑠
𝜕𝑊
≈
1
𝑁
𝑛=1
𝑁
𝜕 log 𝑝 𝐲 𝑠 𝑛, 𝐚
𝜕𝑊
+ log 𝑝 𝐲 𝑠 𝑛, 𝐚
𝜕 log 𝑝 𝑠 𝑛 𝐚
𝜕𝑊
3-a. Approximation of the gradient
• 난수를 이용하여 함수의 값을 확률적으로 계산하는 알고리즘.
• 계산하려는 값이 닫힌 형식으로 표현되지 않거나 복잡한 경우 근사적으로 계산할 때 사용.
(ex) 원주율 계산
원 안의 점 개수
전체 점 개수
≈
𝜋
4
Monte Carlo method
𝑠 𝑛
= (𝑠1
𝑛
, 𝑠2
𝑛
, … )
• A moving average baseline
Upon seeing the 𝑘 𝑡ℎmini-batch, the moving average baseline is estimated as an
accumulated sum of the previous log likelihoods with exponential decay:
𝑏 𝑘 = 0.9 × 𝑏 𝑘−1 + 0.1 × log 𝑝(𝐲| 𝑠 𝑘, 𝐚)
• An entropy term on the multinouilli distribution, 𝐻[𝑠], is added.
𝜕𝐿 𝑠
𝜕𝑊
≈
1
𝑁
𝑛=1
𝑁
𝜕 log 𝑝 𝐲 𝑠 𝑛, 𝐚
𝜕𝑊
+ 𝜆 𝑟 log 𝑝 𝐲 𝑠 𝑛, 𝐚 − 𝑏
𝜕 log 𝑝 𝑠 𝑛 𝐚
𝜕𝑊
+ 𝜆 𝑒
𝜕𝐻[ 𝑠 𝑛]
𝜕𝑊
3-a. Variance Reduction
• In making a hard choice at every point, 𝜙( 𝐚𝑖 , 𝛼𝑖 ) is a function that returns a
sampled 𝐚𝑖 at every point in time based upon a multinouilli distribution parameterized
by 𝛼 .
3-a. Stochastic “Hard” Attention
• Take the expectation of the context vector 𝐳 𝑡 directly,
𝔼 𝑝 𝑠𝑡 𝑎 𝐳 𝑡 =
𝑖=1
𝐿
𝛼 𝑡,𝑖 𝐚𝑖
and formulate a deterministic attention model by computing a soft attention
weighted annotation vector 𝜙 𝐚𝑖 , 𝛼𝑖 = 𝑖
𝐿
𝛼𝑖 𝐚𝑖.
• This corresponds to feeding in a soft 𝛼 weighted context into the system.
3-b. Deterministic “Soft” Attention
• Learning the deterministic attention can be understood as approximately optimizing
the marginal likelihood under the 𝑠𝑡.
• The hidden activation of LSTM 𝐡 𝑡 is a linear projection of the stochastic context vector
𝐳 𝑡 followed by tanh non-linearity.
• To the 1 𝑡ℎ order Taylor approximation, the expected value 𝔼 𝑝 𝑠𝑡 𝑎 𝐡 𝑡 is equal to
computing 𝐡 𝑡 using a single forward prop with the expected context vector
𝔼 𝑝 𝑠𝑡 𝑎 𝐳 𝑡 .
3-b. Deterministic “Soft” Attention
• Let 𝐧 𝑡 = 𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝐡 𝑡 + 𝐋 𝑧 𝐳 𝑡
(𝐧 𝑡,𝑖 ∶ 𝐧 𝑡 computed by setting 𝐳 𝑡 = 𝐚𝑖)
• Define the normalized weighted geometric mean(NWGM) for the softmax 𝑘 𝑡ℎ word
prediction:
𝑁𝑊𝐺𝑀 𝑝 𝑦𝑡 = 𝑘 𝑎 =
𝑖 exp 𝑛 𝑡,𝑘,𝑖
𝑝 𝑠𝑡,𝑖=1 𝑎
𝑗 𝑖 exp 𝑛 𝑡,𝑗,𝑖
𝑝 𝑠𝑡,𝑖=1 𝑎
=
exp(𝔼 𝑝 𝑠𝑡 𝑎 [𝑛 𝑡,𝑘])
𝑗 exp(𝔼 𝑝 𝑠𝑡 𝑎 [𝑛 𝑡,𝑗])
3-b. Deterministic “Soft” Attention
• The NWGM can be approximated well by 𝔼[𝐧 𝑡] = 𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝔼[𝐡 𝑡] + 𝐋 𝑧 𝔼[ 𝐳 𝑡] .
(It shows that the NWGM of a softmax unit is obtained by applying softmax to the expectations of the
underlying linear projections.)
• Also, from the results in (Baldi&Sadowski, 2014), 𝑁𝑊𝐺𝑀 𝑝 𝐲𝑡 = 𝑘 𝐚 ≈ 𝔼[𝑝 𝐲𝑡 = 𝑘 𝐚 ]
under softmax activation.
• This means the expectation of the outputs over all possible attention locations
induced by random variable 𝑠𝑡 is computed by simple feedforward propagation with
expected context vector 𝔼[ 𝐳 𝑡].
• In other words, the deterministic attention model is an approximation to the marginal
likelihood over the attention locations.
3-b. Deterministic “Soft” Attention
𝑝 𝑋 𝛼 =
𝜃
𝑝 𝑋 𝜃 𝑝 𝜃 𝛼 𝑑𝜃
Marginal likelihood over 𝜃
• By construction, 𝑖 𝛼 𝑡,𝑖 = 1 as they are the output of a softmax.
• In training the deterministic version of our model, we introduce a form of doubly
stochastic regularization where 𝒕 𝜶 𝒕,𝒊 ≈ 𝟏.
(This can be interpreted as encouraging the model to pay equal attention to every part of the image over
the course of generation.)
• This penalty was important to improve overall BLEU score and this leads to more rich
and descriptive captions.
3-b-1. Doubly Stochastic Attention
𝛼 𝑡𝑖 =
exp(𝑒𝑡𝑖)
𝑘=1
𝐿
exp(𝑒𝑡𝑘)
• In addition, the soft attention model predicts a gating scalar 𝛽 from previous hidden
state 𝐡 𝑡−1 at each time step 𝑡, s.t
𝜙 𝐚𝑖 , 𝛼𝑖 = 𝛽
𝑖
𝐿
𝛼𝑖 𝐚𝑖
where 𝛽𝑡 = 𝜎(𝑓𝛽 𝐡 𝑡−1 ).
• This gating variable lets the decoder decide whether to put more emphasis on
language modeling or on the context at each time step.
• Qualitatively, we observe that the gating variable is larger than the decoder describes
an object in the image.
3-b-1. Doubly Stochastic Attention
• The soft attention model is trained end-to-end by minimizing the following penalized
negative log-likelihood:
𝐿 𝑑 = − log 𝑝 𝑦 𝑎 + 𝜆
𝑖
𝐿
1 −
𝑡
𝐶
𝛼 𝑡𝑖
2
Where we simply fixed 𝜏 to 1.
3-b. Soft Attention Model
• Both variants of our attention model were trained with SGD using adaptive learning
rate algorithms.
• To create 𝑎𝑖, we used Oxford VGGnet pretrained on ImageNet without finetuning. We
use the 14 x 14 x 512 feature map of the 4 𝑡ℎ convolutional layer before max pooling.
This means our decoder operates on the flattened 196 x 512(𝑳 × 𝑫) encoding.
• (MS COCO) Soft attention model took less than 3 days (NVIDIA Titan Black GPU).
• GoogLeNet or Oxford VGG can give a boost in performance over using the AlexNet.
3-c. Training
Flickr8k Flickr30k MS COCO
8,000 images 30,000 images 82,738 images
5 reference sentences / image More than 5 / image
4. Experiments
• Data
• Metric : BLEU (Bilingual Evaluation Understudy)
 An algorithm for evaluating the quality of text which has been machine translated from one natural
language to another.
 Quality is considered to be the correspondence between a machine's output and
that of a human: "the closer a machine translation is to a professional human translation, the better it
is" – this is the central idea behind BLEU.
4. Experiments
• We are able to significantly improve the state of the art performance METEOR on MS
COCO that we speculate is connected to some of the regularization technique and
our lower level representation.
• Our approach is much more flexible, since the model can attend to “non object”
salient regions.
4. Experiments
Reference
• Papers
 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Kelvin Xu et al, ICML
2015
• Useful websites
 딥 러닝 라이브러리 정리, RNN 튜토리얼(한글) : http://aikorea.org/
 LSTM tutorial : http://deeplearning.net/tutorial/lstm.html#lstm
 BLEU: a Method for Automatic Evaluation of Machine Translation
(http://www.aclweb.org/anthology/P02-1040.pdf)

Mais conteúdo relacionado

Mais procurados

Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Universitat Politècnica de Catalunya
 
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis taeseon ryu
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Dongheon Lee
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learningYu Huang
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM健程 杨
 
Cs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelCs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelYanbin Kong
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper reviewtaeseon ryu
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningCharles Deledalle
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 ISungbin Lim
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention NetworksTaeoh Kim
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2harmonylab
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNNPradnya Saval
 
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...Deep Learning JP
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開Hironobu Fujiyoshi
 
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...Deep Learning JP
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief NetworksHasan H Topcu
 

Mais procurados (20)

Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
 
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++
 
Anchor free object detection by deep learning
Anchor free object detection by deep learningAnchor free object detection by deep learning
Anchor free object detection by deep learning
 
Understanding RNN and LSTM
Understanding RNN and LSTMUnderstanding RNN and LSTM
Understanding RNN and LSTM
 
Cs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative ModelCs231n 2017 lecture13 Generative Model
Cs231n 2017 lecture13 Generative Model
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, CaptioningMLIP - Chapter 5 - Detection, Segmentation, Captioning
MLIP - Chapter 5 - Detection, Segmentation, Captioning
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2Generating Diverse High-Fidelity Images with VQ-VAE-2
Generating Diverse High-Fidelity Images with VQ-VAE-2
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
 
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
[DL輪読会]VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Envir...
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
MIRU2020長尾賞受賞論文解説:Attention Branch Networkの展開
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
[DL輪読会]A Higher-Dimensional Representation for Topologically Varying Neural R...
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
 

Destaque

A neural image caption generator
A neural image caption generatorA neural image caption generator
A neural image caption generator홍배 김
 
Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation
Mind’s Eye: A Recurrent Visual Representation for Image Caption GenerationMind’s Eye: A Recurrent Visual Representation for Image Caption Generation
Mind’s Eye: A Recurrent Visual Representation for Image Caption Generationtomoaki0705
 
Meta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural NetworksMeta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural Networks홍배 김
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksMark Chang
 
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크지운 배
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...Deep Learning JP
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingGrigory Sapunov
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)홍배 김
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향홍배 김
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016Taehoon Kim
 

Destaque (11)

A neural image caption generator
A neural image caption generatorA neural image caption generator
A neural image caption generator
 
Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation
Mind’s Eye: A Recurrent Visual Representation for Image Caption GenerationMind’s Eye: A Recurrent Visual Representation for Image Caption Generation
Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation
 
Meta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural NetworksMeta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural Networks
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural Networks
 
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
Deepcheck, 딥러닝 기반의 얼굴인식 출석체크
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
 

Semelhante a Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissectionChenYiHuang5
 
2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layersJAEMINJEONG5
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingDongang (Sean) Wang
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)ananth
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseWei Yang
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2San Kim
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017Masa Kato
 
Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"壮 八幡
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Event classification & prediction using support vector machine
Event classification & prediction using support vector machineEvent classification & prediction using support vector machine
Event classification & prediction using support vector machineRuta Kambli
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoningSan Kim
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embeddingKhang Pham
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
 
Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdfssuser8547f2
 
Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...IJECEIAES
 

Semelhante a Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (20)

Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissection
 
2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defense
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 
Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Event classification & prediction using support vector machine
Event classification & prediction using support vector machineEvent classification & prediction using support vector machine
Event classification & prediction using support vector machine
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Learning a nonlinear embedding by preserving class neibourhood structure 최종
Learning a nonlinear embedding by preserving class neibourhood structure   최종Learning a nonlinear embedding by preserving class neibourhood structure   최종
Learning a nonlinear embedding by preserving class neibourhood structure 최종
 
Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdf
 
Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...Background Estimation Using Principal Component Analysis Based on Limited Mem...
Background Estimation Using Principal Component Analysis Based on Limited Mem...
 

Último

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 

Último (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

  • 1. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention by Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio, ICML 2015 Presented by Eun-ji Lee 2015.10.14 Data Mining Research Lab Sogang University
  • 2. Contents 1. Introduction 2. Image Caption Generation with Attention Mechanism a. LSTM Tutorial b. Model Details: Encoder & Decoder 3. Learning Stochastic “Hard” vs Deterministic “Soft” Attention a. Stochastic “Hard” Attention b. Deterministic “Soft” Attention c. Training Procedure 4. Experiments
  • 3. 1. Introduction “Scene understanding“ “ Rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. “hard” attention & “soft attention
  • 4. 2-a. LSTM tutorial (1) • 𝒙 𝒕 : a input to the memory cell layer at time 𝑡 • 𝑊𝑖, 𝑊𝑓, 𝑊𝑐, 𝑊𝑜, 𝑈𝑖, 𝑈𝑓, 𝑈𝑐, 𝑈 𝑜, 𝑉𝑜 : weight matrices • 𝒃𝒊, 𝒃 𝒇, 𝒃 𝒄, 𝒃 𝒐 : bias vectors 1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊) (Input gate) 2. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄) (Candidate state) 3. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇) (Forget gate) 4. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏 (Memory Cells’ new state) 5. 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉𝒕−𝟏 + 𝑉𝑜 𝑪𝒕 + 𝒃 𝒐) (Output gate) 6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕) (Outputs, or Hidden states) http://deeplearning.net/tutorial/lstm.html#lstm 𝒙 𝒕 𝒊 𝒕 𝒐 𝒕 𝒇 𝒕 𝑪 𝒕 𝒉 𝒕 𝑪 𝒕−𝟏
  • 5. 2-a. LSTM tutorial (2) • 𝒙 𝒕 : a input to the memory cell layer at time 𝑡 • 𝑊𝑖, 𝑊𝑓, 𝑊𝑐, 𝑊𝑜, 𝑈𝑖, 𝑈𝑓, 𝑈𝑐, 𝑈 𝑜, 𝑉𝑜 : weight matrices • 𝒃𝒊, 𝒃 𝒇, 𝒃 𝒄, 𝒃 𝒐 : bias vectors 1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊) 2. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄) 3. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇) 4. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏 5. 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉𝒕−𝟏 + 𝑉𝑜 𝑪𝒕 + 𝒃 𝒐) ⇒ 𝒐 𝒕 = 𝜎(𝑊𝑜 𝒙 𝒕 + 𝑈 𝑜 𝒉 𝒕−𝟏 + 𝒃 𝒐) 6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕) http://deeplearning.net/tutorial/lstm.html#lstm 𝒙 𝒕 𝒊 𝒕 𝒐 𝒕 𝒇 𝒕 𝑪 𝒕 𝒉 𝒕 𝑪 𝒕−𝟏
  • 6. 2-a. LSTM tutorial (3) 1. 𝒊 𝒕 = 𝜎(𝑊𝑖 𝒙 𝒕 + 𝑈𝑖 𝒉 𝒕−𝟏 + 𝒃𝒊) 2. 𝒇 𝒕 = 𝜎(𝑊𝑓 𝒙 𝒕 + 𝑈𝑓 𝒉 𝒕−𝟏 + 𝒃 𝒇) 3. 𝒐 𝒕 = 𝜎 𝑊𝑜 𝒙𝒕 + 𝑈 𝑜 𝒉 𝒕−𝟏 + 𝒃 𝒐 4. 𝑪 𝒕 = 𝑡𝑎𝑛ℎ(𝑊𝑐 𝒙 𝒕 + 𝑈 𝐶 𝒉 𝒕−𝟏 + 𝒃 𝒄) 5. 𝑪 𝒕 = 𝒊 𝒕 ∗ 𝑪 𝒕 + 𝒇 𝒕 ∗ 𝑪 𝒕−𝟏 6. 𝒉 𝒕 = 𝒐 𝒕 ∗ 𝑡𝑎𝑛ℎ(𝑪𝒕) http://deeplearning.net/tutorial/lstm.html#lstm 𝒙 𝒕 𝒊 𝒕 𝒐 𝒕 𝒇 𝒕 𝑪 𝒕 𝒉 𝒕 𝑪 𝒕−𝟏
  • 7. 2-b. Model Details: Encoder A model takes a single raw image and generates a caption 𝒚 encoded as a sequence of 1-of-K encoded words. • Caption : • Image : 𝑦 = 𝒚 𝟏, … , 𝒚 𝑪 , 𝒚𝒊 ∈ ℝ 𝐾 𝐾: vocab size, 𝐶: caption length 𝐷 : dim. of representation corresponding to a part of the image 𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷 𝒂𝒊 𝒚𝒊 𝒂 𝑳 𝒂 𝟏 ⋯ ⋯
  • 8. 𝒂 𝑳 𝒂 𝟏 ⋯ ⋯ 2-b. Model Details: Encoder • Caption : • Image : “We extract features from a lower convolutional layer unlike previous work which instead used a fully connected layer.” 𝑦 = 𝒚 𝟏, … , 𝒚 𝑪 , 𝒚𝒊 ∈ ℝ 𝐾 𝐾: vocab size, 𝐶: caption length 𝐷 : dim. of representation corresponding to a part of the image 𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷
  • 9. • We use a LSTM[1] that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words. 2-b. Model Details: Decoder (LSTM) 𝒚 𝒕 𝒛 𝒕 𝒉 𝒕−𝟏 𝒚 𝒕−𝟏 [1] Hochreiter & Schmidhuber, 1997
  • 10. • 𝐢𝐭 = 𝜎 𝑊𝑖 𝐸𝐲𝐭−𝟏 + 𝑈𝑖 𝐡𝐭−𝟏 + 𝑍𝑖 𝐳𝐭 + 𝐛𝐢 , • 𝐟𝐭 = 𝜎 𝑊𝑓 𝐸𝐲𝐭−𝟏 + 𝑈𝑓 𝐡𝐭−𝟏 + 𝑍𝑓 𝐳𝐭 + 𝐛 𝐟 , • 𝐜𝐭 = 𝐟𝐜 𝐜𝐭−𝟏 + 𝐢𝐭 tanh 𝑊𝑐 𝐸𝐲𝐭−𝟏 + 𝑈𝑐 𝐡𝐭−𝟏 + 𝑍 𝑐 𝐳𝐭 + 𝐛 𝐜 , • 𝐨𝐭 = 𝜎 𝑊𝑜 𝐸𝐲𝐭−𝟏 + 𝑈 𝑜 𝐡𝐭−𝟏 + 𝑍 𝑜 𝐳𝐭 + 𝐛 𝐨 , • 𝐡𝐭 = 𝐨𝐭 tanh 𝐜𝐭 . 𝐢𝐭, 𝐟𝐭, 𝐜𝐭, 𝐨𝐭, 𝐡𝐭 are the input, forget, memory, output and hidden state of LSTM. 𝑊∎, 𝑈∎, 𝑍∎ and 𝐛∎ are learned weight matrices and biases. 𝐄 ∈ ℝ 𝑚×𝐾 : an embedding matrix. 𝑚 : embedding dim. 𝑛 : LSTM dim. 𝜎 : logistic sigmoid activation. 2-b. LSTM
  • 11. • A dynamic representation of the relevant part of the image input at time 𝑡. 𝐳 𝑡 = 𝜙( 𝐚𝑖 , 𝛼𝑖 ) 𝑒𝑡𝑖 = 𝑓𝑎𝑡𝑡 𝐚𝑖, 𝐡 𝑡−1 𝛼 𝑡𝑖 = exp(𝑒𝑡𝑖) 𝑘=1 𝐿 exp(𝑒𝑡𝑘) 2-b. Context vector 𝐳𝐭 - (Stochastic attention) : the probability that location 𝑖 is the right place to focus for producing the next word. - (Deterministic attention) : the relative importance to give to location 𝑖 in blending the 𝑎𝑖’s together. The weight 𝛼𝑖 of each annotation vector 𝐚𝑖 is computed by an attention model 𝑓𝑎𝑡𝑡 for which we use a multilayer perceptron conditioned on 𝐡 𝑡−1. 𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷
  • 12. • The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs (init,c and init,h): 𝐜0 = 𝑓𝑖𝑛𝑖𝑡,𝑐 1 𝐿 𝑖 𝐿 𝐚𝑖 , 𝐡0 = 𝑓𝑖𝑛𝑖𝑡,ℎ 1 𝐿 𝑖 𝐿 𝐚𝑖 2-b. Initialization (LSTM) 𝑎 = 𝒂 𝟏, … , 𝒂 𝑳 , 𝒂𝒊 ∈ ℝ 𝐷
  • 13. • We use a deep output layer(Pascanu et al., 2014) to compute the output word probability. 𝑝 𝐲𝑡 𝐚, 𝐲1 𝑡−1 ∝ exp(𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝐡 𝑡 + 𝐋 𝑧 𝐳 𝑡 ) where 𝐋0 ∈ ℝ 𝐾×𝑚 , 𝐋ℎ ∈ ℝ 𝑚×𝑛 , 𝐋 𝑧 ∈ ℝ 𝑚×𝐷 , 𝐄 are learned parameters initialized randomly. 2-b. Output word probability • Vector exponential exp 𝐯 = 𝟏 + 𝐯 + 1 2! 𝐯2 + 1 3! 𝐯3 + ⋯ = 𝟏 cosh 𝐯 + 𝐯 𝐯 sinh( 𝐯 ).
  • 14. • We represent the location variable 𝑠𝑡 as where the model decides to focus attention when generating the 𝑡 𝑡ℎ word. 𝑠𝑡,𝑖 is an indicator one-hot variable which is set to 1 if the 𝑖-th location (out of 𝐿) is the one used to extract visual features. 𝑝 𝑠𝑡,𝑖 = 1 𝑠𝑗<𝑡, 𝒂 = 𝛼 𝑡,𝑖 𝐳 𝑡 = 𝑖 𝑠𝑡,𝑖 𝐚𝑖 3-a. Stochastic “Hard” Attention Binary One-hot 00 01 10 11 0001 0010 0100 1000 1 𝐿 𝑖 𝑠𝑡 𝑡 시간에 attention 할 부분 𝐚 = 𝐚 𝟏, … , 𝐚 𝑳 , 𝐚𝒊 ∈ ℝ 𝐷 𝛼𝑖 = exp(𝑓𝑎𝑡𝑡 𝐚 𝑖,𝐡 𝑡−1 ) 𝑘=1 𝐿 exp(𝑓𝑎𝑡𝑡 𝐚𝑖,𝐡 𝑡−1 ) 𝐳 𝑡 = 𝜙( 𝐚𝑖 , 𝛼𝑖 ) 𝑠𝑡 : attention location var.
  • 15. • A variational lower bound on the marginal log-likelihood log 𝑝(𝐲|𝐚) of observing the sequence of words 𝐲 given image features a. 𝐿 𝑠 = 𝑠 𝑝 𝑠 𝐚 log 𝑝(𝐲|𝑠, 𝐚) ≤ log 𝑠 𝑝 𝑠 𝐚 𝑝 𝐲 𝑠, 𝐚 = log 𝑝(𝐲|𝐚) 𝜕𝐿 𝑠 𝜕𝑊 = 𝑠 𝑝(𝑠|𝐚) 𝜕 log 𝑝 𝐲 𝑠, 𝐚 𝜕𝑊 + log 𝑝 𝐲 𝑠, 𝐚 𝜕 log 𝑝 𝑠 𝐚 𝜕𝑊 3-a. A new objective function 𝐿 𝑠
  • 16. • Monte Carlo based sampling approximation of the gradient with respect to the model parameters: 𝑠𝑡 𝑛 ∼ Multinoulli 𝐿( 𝛼 𝑡 𝑛 ) 𝜕𝐿 𝑠 𝜕𝑊 ≈ 1 𝑁 𝑛=1 𝑁 𝜕 log 𝑝 𝐲 𝑠 𝑛, 𝐚 𝜕𝑊 + log 𝑝 𝐲 𝑠 𝑛, 𝐚 𝜕 log 𝑝 𝑠 𝑛 𝐚 𝜕𝑊 3-a. Approximation of the gradient • 난수를 이용하여 함수의 값을 확률적으로 계산하는 알고리즘. • 계산하려는 값이 닫힌 형식으로 표현되지 않거나 복잡한 경우 근사적으로 계산할 때 사용. (ex) 원주율 계산 원 안의 점 개수 전체 점 개수 ≈ 𝜋 4 Monte Carlo method 𝑠 𝑛 = (𝑠1 𝑛 , 𝑠2 𝑛 , … )
  • 17. • A moving average baseline Upon seeing the 𝑘 𝑡ℎmini-batch, the moving average baseline is estimated as an accumulated sum of the previous log likelihoods with exponential decay: 𝑏 𝑘 = 0.9 × 𝑏 𝑘−1 + 0.1 × log 𝑝(𝐲| 𝑠 𝑘, 𝐚) • An entropy term on the multinouilli distribution, 𝐻[𝑠], is added. 𝜕𝐿 𝑠 𝜕𝑊 ≈ 1 𝑁 𝑛=1 𝑁 𝜕 log 𝑝 𝐲 𝑠 𝑛, 𝐚 𝜕𝑊 + 𝜆 𝑟 log 𝑝 𝐲 𝑠 𝑛, 𝐚 − 𝑏 𝜕 log 𝑝 𝑠 𝑛 𝐚 𝜕𝑊 + 𝜆 𝑒 𝜕𝐻[ 𝑠 𝑛] 𝜕𝑊 3-a. Variance Reduction
  • 18. • In making a hard choice at every point, 𝜙( 𝐚𝑖 , 𝛼𝑖 ) is a function that returns a sampled 𝐚𝑖 at every point in time based upon a multinouilli distribution parameterized by 𝛼 . 3-a. Stochastic “Hard” Attention
  • 19. • Take the expectation of the context vector 𝐳 𝑡 directly, 𝔼 𝑝 𝑠𝑡 𝑎 𝐳 𝑡 = 𝑖=1 𝐿 𝛼 𝑡,𝑖 𝐚𝑖 and formulate a deterministic attention model by computing a soft attention weighted annotation vector 𝜙 𝐚𝑖 , 𝛼𝑖 = 𝑖 𝐿 𝛼𝑖 𝐚𝑖. • This corresponds to feeding in a soft 𝛼 weighted context into the system. 3-b. Deterministic “Soft” Attention
  • 20. • Learning the deterministic attention can be understood as approximately optimizing the marginal likelihood under the 𝑠𝑡. • The hidden activation of LSTM 𝐡 𝑡 is a linear projection of the stochastic context vector 𝐳 𝑡 followed by tanh non-linearity. • To the 1 𝑡ℎ order Taylor approximation, the expected value 𝔼 𝑝 𝑠𝑡 𝑎 𝐡 𝑡 is equal to computing 𝐡 𝑡 using a single forward prop with the expected context vector 𝔼 𝑝 𝑠𝑡 𝑎 𝐳 𝑡 . 3-b. Deterministic “Soft” Attention
  • 21. • Let 𝐧 𝑡 = 𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝐡 𝑡 + 𝐋 𝑧 𝐳 𝑡 (𝐧 𝑡,𝑖 ∶ 𝐧 𝑡 computed by setting 𝐳 𝑡 = 𝐚𝑖) • Define the normalized weighted geometric mean(NWGM) for the softmax 𝑘 𝑡ℎ word prediction: 𝑁𝑊𝐺𝑀 𝑝 𝑦𝑡 = 𝑘 𝑎 = 𝑖 exp 𝑛 𝑡,𝑘,𝑖 𝑝 𝑠𝑡,𝑖=1 𝑎 𝑗 𝑖 exp 𝑛 𝑡,𝑗,𝑖 𝑝 𝑠𝑡,𝑖=1 𝑎 = exp(𝔼 𝑝 𝑠𝑡 𝑎 [𝑛 𝑡,𝑘]) 𝑗 exp(𝔼 𝑝 𝑠𝑡 𝑎 [𝑛 𝑡,𝑗]) 3-b. Deterministic “Soft” Attention
  • 22. • The NWGM can be approximated well by 𝔼[𝐧 𝑡] = 𝐋0 𝐄𝐲𝑡−1 + 𝐋ℎ 𝔼[𝐡 𝑡] + 𝐋 𝑧 𝔼[ 𝐳 𝑡] . (It shows that the NWGM of a softmax unit is obtained by applying softmax to the expectations of the underlying linear projections.) • Also, from the results in (Baldi&Sadowski, 2014), 𝑁𝑊𝐺𝑀 𝑝 𝐲𝑡 = 𝑘 𝐚 ≈ 𝔼[𝑝 𝐲𝑡 = 𝑘 𝐚 ] under softmax activation. • This means the expectation of the outputs over all possible attention locations induced by random variable 𝑠𝑡 is computed by simple feedforward propagation with expected context vector 𝔼[ 𝐳 𝑡]. • In other words, the deterministic attention model is an approximation to the marginal likelihood over the attention locations. 3-b. Deterministic “Soft” Attention 𝑝 𝑋 𝛼 = 𝜃 𝑝 𝑋 𝜃 𝑝 𝜃 𝛼 𝑑𝜃 Marginal likelihood over 𝜃
  • 23. • By construction, 𝑖 𝛼 𝑡,𝑖 = 1 as they are the output of a softmax. • In training the deterministic version of our model, we introduce a form of doubly stochastic regularization where 𝒕 𝜶 𝒕,𝒊 ≈ 𝟏. (This can be interpreted as encouraging the model to pay equal attention to every part of the image over the course of generation.) • This penalty was important to improve overall BLEU score and this leads to more rich and descriptive captions. 3-b-1. Doubly Stochastic Attention 𝛼 𝑡𝑖 = exp(𝑒𝑡𝑖) 𝑘=1 𝐿 exp(𝑒𝑡𝑘)
  • 24. • In addition, the soft attention model predicts a gating scalar 𝛽 from previous hidden state 𝐡 𝑡−1 at each time step 𝑡, s.t 𝜙 𝐚𝑖 , 𝛼𝑖 = 𝛽 𝑖 𝐿 𝛼𝑖 𝐚𝑖 where 𝛽𝑡 = 𝜎(𝑓𝛽 𝐡 𝑡−1 ). • This gating variable lets the decoder decide whether to put more emphasis on language modeling or on the context at each time step. • Qualitatively, we observe that the gating variable is larger than the decoder describes an object in the image. 3-b-1. Doubly Stochastic Attention
  • 25. • The soft attention model is trained end-to-end by minimizing the following penalized negative log-likelihood: 𝐿 𝑑 = − log 𝑝 𝑦 𝑎 + 𝜆 𝑖 𝐿 1 − 𝑡 𝐶 𝛼 𝑡𝑖 2 Where we simply fixed 𝜏 to 1. 3-b. Soft Attention Model
  • 26. • Both variants of our attention model were trained with SGD using adaptive learning rate algorithms. • To create 𝑎𝑖, we used Oxford VGGnet pretrained on ImageNet without finetuning. We use the 14 x 14 x 512 feature map of the 4 𝑡ℎ convolutional layer before max pooling. This means our decoder operates on the flattened 196 x 512(𝑳 × 𝑫) encoding. • (MS COCO) Soft attention model took less than 3 days (NVIDIA Titan Black GPU). • GoogLeNet or Oxford VGG can give a boost in performance over using the AlexNet. 3-c. Training
  • 27. Flickr8k Flickr30k MS COCO 8,000 images 30,000 images 82,738 images 5 reference sentences / image More than 5 / image 4. Experiments • Data • Metric : BLEU (Bilingual Evaluation Understudy)  An algorithm for evaluating the quality of text which has been machine translated from one natural language to another.  Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU.
  • 29. • We are able to significantly improve the state of the art performance METEOR on MS COCO that we speculate is connected to some of the regularization technique and our lower level representation. • Our approach is much more flexible, since the model can attend to “non object” salient regions. 4. Experiments
  • 30. Reference • Papers  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Kelvin Xu et al, ICML 2015 • Useful websites  딥 러닝 라이브러리 정리, RNN 튜토리얼(한글) : http://aikorea.org/  LSTM tutorial : http://deeplearning.net/tutorial/lstm.html#lstm  BLEU: a Method for Automatic Evaluation of Machine Translation (http://www.aclweb.org/anthology/P02-1040.pdf)