SlideShare a Scribd company logo
1 of 196
Download to read offline
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1/1
CS7015 (Deep Learning) : Lecture 14
Long Short Term Memory Cells (LSTMs), Gated Recurrent Units (GRUs)
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2/1
Module : Selective Read, Selective Write, Selective
Forget - The Whiteboard Analogy
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/1
s1
W
V
U
x1
y1
The state (si) of an RNN records
information from all previous time
steps
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/1
s1
W
V
U
x1
y1
The state (si) of an RNN records
information from all previous time
steps
At each new timestep the old
information gets morphed by the
current input
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
The state (si) of an RNN records
information from all previous time
steps
At each new timestep the old
information gets morphed by the
current input
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
The state (si) of an RNN records
information from all previous time
steps
At each new timestep the old
information gets morphed by the
current input
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
s4
x4
y4
W
V
U
The state (si) of an RNN records
information from all previous time
steps
At each new timestep the old
information gets morphed by the
current input
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
s4
x4
y4
W
V
U
. . . st
xt
yt
W
V
U
The state (si) of an RNN records
information from all previous time
steps
At each new timestep the old
information gets morphed by the
current input
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
s4
x4
y4
W
V
U
. . . st
xt
yt
W
V
U
The state (si) of an RNN records
information from all previous time
steps
At each new timestep the old
information gets morphed by the
current input
One could imagine that after t steps
the information stored at time step
t−k (for some k < t) gets completely
morphed
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
s4
x4
y4
W
V
U
. . . st
xt
yt
W
V
U
The state (si) of an RNN records
information from all previous time
steps
At each new timestep the old
information gets morphed by the
current input
One could imagine that after t steps
the information stored at time step
t−k (for some k < t) gets completely
morphed
so much that it would be impossible
to extract the original information
stored at time step t − k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
s4
x4
y4
W
V
U
. . . st
xt
yt
W
V
U
A similar problem occurs when
the information flows backwards
(backpropagation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
s4
x4
y4
W
V
U
. . . st
xt
yt
W
V
U
A similar problem occurs when
the information flows backwards
(backpropagation)
It is very hard to assign the
responsibility of the error caused
at time step t to the events that
occurred at time step t − k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
s4
x4
y4
W
V
U
. . . st
xt
yt
W
V
U
A similar problem occurs when
the information flows backwards
(backpropagation)
It is very hard to assign the
responsibility of the error caused
at time step t to the events that
occurred at time step t − k
This responsibility is of course in the
form of gradients and we studied the
problem in backward flow of gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4/1
s1
W
V
U
x1
y1
s2
x2
y2
W
V
U
s3
x3
y3
W
V
U
s4
x4
y4
W
V
U
. . . st
xt
yt
W
V
U
A similar problem occurs when
the information flows backwards
(backpropagation)
It is very hard to assign the
responsibility of the error caused
at time step t to the events that
occurred at time step t − k
This responsibility is of course in the
form of gradients and we studied the
problem in backward flow of gradients
We saw a formal argument for this
while discussing vanishing gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5/1
Let us see an analogy for this
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5/1
Let us see an analogy for this
We can think of the state as a fixed
size memory
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5/1
Let us see an analogy for this
We can think of the state as a fixed
size memory
Compare this to a fixed size white
board that you use to record
information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5/1
Let us see an analogy for this
We can think of the state as a fixed
size memory
Compare this to a fixed size white
board that you use to record
information
At each time step (periodic intervals)
we keep writing something to the
board
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5/1
Let us see an analogy for this
We can think of the state as a fixed
size memory
Compare this to a fixed size white
board that you use to record
information
At each time step (periodic intervals)
we keep writing something to the
board
Effectively at each time step we
morph the information recorded till
that time point
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5/1
Let us see an analogy for this
We can think of the state as a fixed
size memory
Compare this to a fixed size white
board that you use to record
information
At each time step (periodic intervals)
we keep writing something to the
board
Effectively at each time step we
morph the information recorded till
that time point
After many timesteps it would be
impossible to see how the information
at time step t − k contributed to the
state at timestep t
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6/1
Continuing our whiteboard analogy,
suppose we are interested in deriving
an expression on the whiteboard
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6/1
Continuing our whiteboard analogy,
suppose we are interested in deriving
an expression on the whiteboard
We follow the following strategy at
each time step
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6/1
Continuing our whiteboard analogy,
suppose we are interested in deriving
an expression on the whiteboard
We follow the following strategy at
each time step
Selectively write on the board
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6/1
Continuing our whiteboard analogy,
suppose we are interested in deriving
an expression on the whiteboard
We follow the following strategy at
each time step
Selectively write on the board
Selectively read the already written
content
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6/1
Continuing our whiteboard analogy,
suppose we are interested in deriving
an expression on the whiteboard
We follow the following strategy at
each time step
Selectively write on the board
Selectively read the already written
content
Selectively forget (erase) some
content
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6/1
Continuing our whiteboard analogy,
suppose we are interested in deriving
an expression on the whiteboard
We follow the following strategy at
each time step
Selectively write on the board
Selectively read the already written
content
Selectively forget (erase) some
content
Let us look at each of these in detail
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7/1
Selective write
There may be many steps in the
derivation but we may just skip a few
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7/1
Selective write
There may be many steps in the
derivation but we may just skip a few
In other words we select what to
write
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
Selective write
There may be many steps in the
derivation but we may just skip a few
In other words we select what to
write
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
Selective write
There may be many steps in the
derivation but we may just skip a few
In other words we select what to
write
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd = 33
Selective write
There may be many steps in the
derivation but we may just skip a few
In other words we select what to
write
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd = 33
Selective read
While writing one step we typically
read some of the previous steps we
have already written and then decide
what to write next
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd = 33
Selective read
While writing one step we typically
read some of the previous steps we
have already written and then decide
what to write next
For example at Step 3, information
from Step 2 is important
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd = 33
Selective read
While writing one step we typically
read some of the previous steps we
have already written and then decide
what to write next
For example at Step 3, information
from Step 2 is important
In other words we select what to read
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd = 33
bd + a = 34
Selective read
While writing one step we typically
read some of the previous steps we
have already written and then decide
what to write next
For example at Step 3, information
from Step 2 is important
In other words we select what to read
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd = 33
bd + a = 34
Selective forget
Once the board is full, we need to
delete some obsolete information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd = 33
bd + a = 34
Selective forget
Once the board is full, we need to
delete some obsolete information
But how do we decide what to delete?
We will typically delete the least
useful information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd = 33
bd + a = 34
Selective forget
Once the board is full, we need to
delete some obsolete information
But how do we decide what to delete?
We will typically delete the least
useful information
In other words we select what to
forget
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
bd + a = 34
Selective forget
Once the board is full, we need to
delete some obsolete information
But how do we decide what to delete?
We will typically delete the least
useful information
In other words we select what to
forget
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
ac(bd + a) = 170
bd + a = 34
Selective forget
Once the board is full, we need to
delete some obsolete information
But how do we decide what to delete?
We will typically delete the least
useful information
In other words we select what to
forget
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ac = 5
ac(bd + a) = 170
ad = 11
Selective forget
Once the board is full, we need to
delete some obsolete information
But how do we decide what to delete?
We will typically delete the least
useful information
In other words we select what to
forget
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ad + ac(bd + a) = 181
ac(bd + a) = 170
ad = 11
Selective forget
Once the board is full, we need to
delete some obsolete information
But how do we decide what to delete?
We will typically delete the least
useful information
In other words we select what to
forget
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ad + ac(bd + a) = 181
ac(bd + a) = 170
ad = 11
There are various other scenarios
where we can motivate the need for
selective write, read and forget
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ad + ac(bd + a) = 181
ac(bd + a) = 170
ad = 11
There are various other scenarios
where we can motivate the need for
selective write, read and forget
For example, you could think of our
brain as something which can store
only a finite number of facts
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ad + ac(bd + a) = 181
ac(bd + a) = 170
ad = 11
There are various other scenarios
where we can motivate the need for
selective write, read and forget
For example, you could think of our
brain as something which can store
only a finite number of facts
At different time steps we selectively
read, write and forget some of these
facts
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10/1
a = 1 b = 3 c = 5 d = 11
Compute ac(bd + a) + ad
Say “board” can have only 3 statements
at a time.
1 ac
2 bd
3 bd + a
4 ac(bd + a)
5 ad
6 ac(bd + a) + ad
ad + ac(bd + a) = 181
ac(bd + a) = 170
ad = 11
There are various other scenarios
where we can motivate the need for
selective write, read and forget
For example, you could think of our
brain as something which can store
only a finite number of facts
At different time steps we selectively
read, write and forget some of these
facts
Since the RNN also has a finite state
size, we need to figure out a way to
allow it to selectively read, write and
forget
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11/1
Module : Long Short Term Memory(LSTM) and Gated
Recurrent Units(GRUs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12/1
Questions
Can we give a concrete example where RNNs also need to selectively read,
write and forget ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12/1
Questions
Can we give a concrete example where RNNs also need to selectively read,
write and forget ?
How do we convert this intuition into mathematical equations ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12/1
Questions
Can we give a concrete example where RNNs also need to selectively read,
write and forget ?
How do we convert this intuition into mathematical equations ? We will see
this over the next few slides
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Consider the task of predicting the sentiment
(positive/negative) of a review
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Consider the task of predicting the sentiment
(positive/negative) of a review
RNN reads the document from left to right
and after every word updates the state
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Consider the task of predicting the sentiment
(positive/negative) of a review
RNN reads the document from left to right
and after every word updates the state
By the time we reach the end of the document
the information obtained from the first few
words is completely lost
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Consider the task of predicting the sentiment
(positive/negative) of a review
RNN reads the document from left to right
and after every word updates the state
By the time we reach the end of the document
the information obtained from the first few
words is completely lost
Ideally we want to
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Consider the task of predicting the sentiment
(positive/negative) of a review
RNN reads the document from left to right
and after every word updates the state
By the time we reach the end of the document
the information obtained from the first few
words is completely lost
Ideally we want to
forget the information added by stop words
(a, the, etc.)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Consider the task of predicting the sentiment
(positive/negative) of a review
RNN reads the document from left to right
and after every word updates the state
By the time we reach the end of the document
the information obtained from the first few
words is completely lost
Ideally we want to
forget the information added by stop words
(a, the, etc.)
selectively read the information added by
previous sentiment bearing words (awesome,
amazing, etc.)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Consider the task of predicting the sentiment
(positive/negative) of a review
RNN reads the document from left to right
and after every word updates the state
By the time we reach the end of the document
the information obtained from the first few
words is completely lost
Ideally we want to
forget the information added by stop words
(a, the, etc.)
selectively read the information added by
previous sentiment bearing words (awesome,
amazing, etc.)
selectively write new information from the
current word to the state
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14/1
Questions
Can we give a concrete example where RNNs also need to selectively read,
write and forget ?
How do we convert this intuition into mathematical equations ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Recall that the blue colored vector
(st) is called the state of the RNN
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Recall that the blue colored vector
(st) is called the state of the RNN
It has a finite size (st ∈ Rn) and is
used to store all the information upto
timestep t
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Recall that the blue colored vector
(st) is called the state of the RNN
It has a finite size (st ∈ Rn) and is
used to store all the information upto
timestep t
This state is analogous to the
whiteboard and sooner or later it will
get overloaded and the information
from the initial states will get
morphed beyond recognition
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15/1
Review: The first half of the movie was dry but
the second half really picked up pace. The lead
actor delivered an amazing performance
The first ... ... ... performance
+/− Recall that the blue colored vector
(st) is called the state of the RNN
It has a finite size (st ∈ Rn) and is
used to store all the information upto
timestep t
This state is analogous to the
whiteboard and sooner or later it will
get overloaded and the information
from the initial states will get
morphed beyond recognition
Wishlist: selective write, selective
read and selective forget to ensure
that this finite sized state vector is
used effectively
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16/1
-1.4
-0.4
1
...
-2
st−1 0.7
-0.2
1.1
...
-0.3
xt
-0.9
0.2
1
...
-1.9
st
Just to be clear, we have computed
a state st−1 at timestep t − 1 and
now we want to overload it with new
information (xt) and compute a new
state (st)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16/1
-1.4
-0.4
1
...
-2
st−1
selective read
selective write
selective forget
0.7
-0.2
1.1
...
-0.3
xt
-0.9
0.2
1
...
-1.9
st
Just to be clear, we have computed
a state st−1 at timestep t − 1 and
now we want to overload it with new
information (xt) and compute a new
state (st)
While doing so we want to make sure
that we use selective write, selective
read and selective forget so that only
important information is retained in
st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16/1
-1.4
-0.4
1
...
-2
st−1
selective read
selective write
selective forget
0.7
-0.2
1.1
...
-0.3
xt
-0.9
0.2
1
...
-1.9
st
Just to be clear, we have computed
a state st−1 at timestep t − 1 and
now we want to overload it with new
information (xt) and compute a new
state (st)
While doing so we want to make sure
that we use selective write, selective
read and selective forget so that only
important information is retained in
st
We will now see how to implement
these items from our wishlist
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17/1
-1.4
-0.4
1
...
-2
st−1
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
-0.9
0.2
1
...
-1.9
st
Selective Write
Recall that in RNNs we use st−1 to
compute st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17/1
-1.4
-0.4
1
...
-2
st−1
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
-0.9
0.2
1
...
-1.9
st
Selective Write
Recall that in RNNs we use st−1 to
compute st
st = σ(Wst−1 + Uxt) (ignoring bias)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17/1
-1.4
-0.4
1
...
-2
st−1
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
-0.9
0.2
1
...
-1.9
st
Selective Write
Recall that in RNNs we use st−1 to
compute st
st = σ(Wst−1 + Uxt) (ignoring bias)
But now instead of passing st−1 as it
is to st we want to pass (write) only
some portions of it to the next state
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17/1
-1.4
0
1
...
0
st−1
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
-0.9
0.2
1
...
-1.9
st
Selective Write
Recall that in RNNs we use st−1 to
compute st
st = σ(Wst−1 + Uxt) (ignoring bias)
But now instead of passing st−1 as it
is to st we want to pass (write) only
some portions of it to the next state
In the strictest case our decisions
could be binary (for example, retain
1st and 3rd entries and delete the
rest of the entries)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17/1
-1.4
0
1
...
0
st−1
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
-0.9
0.2
1
...
-1.9
st
Selective Write
Recall that in RNNs we use st−1 to
compute st
st = σ(Wst−1 + Uxt) (ignoring bias)
But now instead of passing st−1 as it
is to st we want to pass (write) only
some portions of it to the next state
In the strictest case our decisions
could be binary (for example, retain
1st and 3rd entries and delete the
rest of the entries)
But a more sensible way of doing
this would be to assign a value
between 0 and 1 which determines
what fraction of the current state to
pass on to the next state
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
We introduce a vector ot−1 which
decides what fraction of each element
of st−1 should be passed to the next
state
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
We introduce a vector ot−1 which
decides what fraction of each element
of st−1 should be passed to the next
state
Each element of ot−1 gets multiplied
with the corresponding element of
st−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
We introduce a vector ot−1 which
decides what fraction of each element
of st−1 should be passed to the next
state
Each element of ot−1 gets multiplied
with the corresponding element of
st−1
Each element of ot−1 is restricted to
be between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
We introduce a vector ot−1 which
decides what fraction of each element
of st−1 should be passed to the next
state
Each element of ot−1 gets multiplied
with the corresponding element of
st−1
Each element of ot−1 is restricted to
be between 0 and 1
But how do we compute ot−1? How
does the RNN know what fraction of
the state to pass on?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
Well the RNN has to learn ot−1 along
with the other parameters (W, U, V )
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
Well the RNN has to learn ot−1 along
with the other parameters (W, U, V )
We compute ot−1 and ht−1 as
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
Well the RNN has to learn ot−1 along
with the other parameters (W, U, V )
We compute ot−1 and ht−1 as
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
Well the RNN has to learn ot−1 along
with the other parameters (W, U, V )
We compute ot−1 and ht−1 as
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
ht−1 = ot−1 ⊙ σ(st−1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
Well the RNN has to learn ot−1 along
with the other parameters (W, U, V )
We compute ot−1 and ht−1 as
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
ht−1 = ot−1 ⊙ σ(st−1)
The parameters Wo, Uo, bo need to
be learned along with the existing
parameters W, U, V
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
Well the RNN has to learn ot−1 along
with the other parameters (W, U, V )
We compute ot−1 and ht−1 as
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
ht−1 = ot−1 ⊙ σ(st−1)
The parameters Wo, Uo, bo need to
be learned along with the existing
parameters W, U, V
The sigmoid (logistic) function
ensures that the values are between
0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
0.7
-0.2
1.1
...
-0.3
xt
-1.4
-0.4
1
...
-2
st
Selective Write
Well the RNN has to learn ot−1 along
with the other parameters (W, U, V )
We compute ot−1 and ht−1 as
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
ht−1 = ot−1 ⊙ σ(st−1)
The parameters Wo, Uo, bo need to
be learned along with the existing
parameters W, U, V
The sigmoid (logistic) function
ensures that the values are between
0 and 1
ot is called the output gate as it
decides how much to pass (write) to
the next time step
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
-1.4
-0.4
1
...
-2
st
Selective Read
We will now use ht−1 to compute the
new state at the next time step
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
-1.4
-0.4
1
...
-2
st
Selective Read
We will now use ht−1 to compute the
new state at the next time step
We will also use xt which is the new
input at time step t
˜st = σ(Wht−1 + Uxt + b)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
-1.4
-0.4
1
...
-2
st
Selective Read
We will now use ht−1 to compute the
new state at the next time step
We will also use xt which is the new
input at time step t
˜st = σ(Wht−1 + Uxt + b)
Note that W, U and b are similar to
the parameters that we used in RNN
(for simplicity we have not shown the
bias b in the figure)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
Selective Read
˜st thus captures all the information
from the previous state (ht−1) and the
current input xt
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
Selective Read
˜st thus captures all the information
from the previous state (ht−1) and the
current input xt
However, we may not want to
use all this new information and
only selectively read from it before
constructing the new cell state st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
Selective Read
˜st thus captures all the information
from the previous state (ht−1) and the
current input xt
However, we may not want to
use all this new information and
only selectively read from it before
constructing the new cell state st
To do this we introduce another gate
called the input gate
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
Selective Read
˜st thus captures all the information
from the previous state (ht−1) and the
current input xt
However, we may not want to
use all this new information and
only selectively read from it before
constructing the new cell state st
To do this we introduce another gate
called the input gate
it = σ(Wiht−1 + Uixt + bi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
Selective Read
˜st thus captures all the information
from the previous state (ht−1) and the
current input xt
However, we may not want to
use all this new information and
only selectively read from it before
constructing the new cell state st
To do this we introduce another gate
called the input gate
it = σ(Wiht−1 + Uixt + bi)
and use it ⊙ ˜st as the selectively read
state information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
So far we have the following
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
So far we have the following
Previous state:
st−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
So far we have the following
Previous state:
st−1
Output gate:
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
So far we have the following
Previous state:
st−1
Output gate:
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
Selectively Write:
ht−1 = ot−1 ⊙ σ(st−1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
So far we have the following
Previous state:
st−1
Output gate:
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
Selectively Write:
ht−1 = ot−1 ⊙ σ(st−1)
Current (temporary) state:
˜st = σ(Wht−1 + Uxt + b)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
So far we have the following
Previous state:
st−1
Output gate:
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
Selectively Write:
ht−1 = ot−1 ⊙ σ(st−1)
Current (temporary) state:
˜st = σ(Wht−1 + Uxt + b)
Input gate:
it = σ(Wiht−1 + Uixt + bi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
-1.4
-0.4
1
...
-2
st
So far we have the following
Previous state:
st−1
Output gate:
ot−1 = σ(Woht−2 + Uoxt−1 + bo)
Selectively Write:
ht−1 = ot−1 ⊙ σ(st−1)
Current (temporary) state:
˜st = σ(Wht−1 + Uxt + b)
Input gate:
it = σ(Wiht−1 + Uixt + bi)
Selectively Read:
it ⊙ ˜st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23/1
Selective Forget
How do we combine st−1 and ˜st to get
the new state
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
=
-0.9
0.2
1
...
-1.9
st
Selective Forget
How do we combine st−1 and ˜st to get
the new state
Here is one simple (but effective) way
of doing this:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
=
-0.9
0.2
1
...
-1.9
st
Selective Forget
How do we combine st−1 and ˜st to get
the new state
Here is one simple (but effective) way
of doing this:
st = st−1 + it ⊙ ˜st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
=
-0.9
0.2
1
...
-1.9
st
Selective Forget
How do we combine st−1 and ˜st to get
the new state
Here is one simple (but effective) way
of doing this:
st = st−1 + it ⊙ ˜st
But we may not want to use the whole
of st−1 but forget some parts of it
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-0.9
0.2
1
...
-1.9
st
Selective Forget
How do we combine st−1 and ˜st to get
the new state
Here is one simple (but effective) way
of doing this:
st = st−1 + it ⊙ ˜st
But we may not want to use the whole
of st−1 but forget some parts of it
To do this we introduce the forget
gate
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-0.9
0.2
1
...
-1.9
st
Selective Forget
How do we combine st−1 and ˜st to get
the new state
Here is one simple (but effective) way
of doing this:
st = st−1 + it ⊙ ˜st
But we may not want to use the whole
of st−1 but forget some parts of it
To do this we introduce the forget
gate
ft = σ(Wf ht−1 + Uf xt + bf )
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-0.9
0.2
1
...
-1.9
st
Selective Forget
How do we combine st−1 and ˜st to get
the new state
Here is one simple (but effective) way
of doing this:
st = st−1 + it ⊙ ˜st
But we may not want to use the whole
of st−1 but forget some parts of it
To do this we introduce the forget
gate
ft = σ(Wf ht−1 + Uf xt + bf )
st = ft ⊙ st−1 + it ⊙ ˜st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
The green box together with the selective write operations following it, show
all the computations which happen at timestep t
Gates: States:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
The green box together with the selective write operations following it, show
all the computations which happen at timestep t
Gates:
ot = σ(Woht−1 + Uoxt + bo)
States:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
The green box together with the selective write operations following it, show
all the computations which happen at timestep t
Gates:
ot = σ(Woht−1 + Uoxt + bo)
it = σ(Wiht−1 + Uixt + bi)
States:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
The green box together with the selective write operations following it, show
all the computations which happen at timestep t
Gates:
ot = σ(Woht−1 + Uoxt + bo)
it = σ(Wiht−1 + Uixt + bi)
ft = σ(Wf ht−1 + Uf xt + bf )
States:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
The green box together with the selective write operations following it, show
all the computations which happen at timestep t
Gates:
ot = σ(Woht−1 + Uoxt + bo)
it = σ(Wiht−1 + Uixt + bi)
ft = σ(Wf ht−1 + Uf xt + bf )
States:
˜st = σ(Wht−1 + Uxt + b)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
The green box together with the selective write operations following it, show
all the computations which happen at timestep t
Gates:
ot = σ(Woht−1 + Uoxt + bo)
it = σ(Wiht−1 + Uixt + bi)
ft = σ(Wf ht−1 + Uf xt + bf )
States:
˜st = σ(Wht−1 + Uxt + b)
st = ft ⊙ st−1 + it ⊙ ˜st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
The green box together with the selective write operations following it, show
all the computations which happen at timestep t
Gates:
ot = σ(Woht−1 + Uoxt + bo)
it = σ(Wiht−1 + Uixt + bi)
ft = σ(Wf ht−1 + Uf xt + bf )
States:
˜st = σ(Wht−1 + Uxt + b)
st = ft ⊙ st−1 + it ⊙ ˜st
ht = ot ⊙ σ(st)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
We now have the full set of equations for LSTMs
The green box together with the selective write operations following it, show
all the computations which happen at timestep t
Gates:
ot = σ(Woht−1 + Uoxt + bo)
it = σ(Wiht−1 + Uixt + bi)
ft = σ(Wf ht−1 + Uf xt + bf )
States:
˜st = σ(Wht−1 + Uxt + b)
st = ft ⊙ st−1 + it ⊙ ˜st
ht = ot ⊙ σ(st) and rnnout = ht
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25/1
Note
LSTM has many variants which include different number of gates and also
different arrangement of gates
The one which we just saw is one of the most popular variants of LSTM
Another equally popular variant of LSTM is Gated Recurrent Unit which we
will see next
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/1
-1.4
-0.4
1
...
-2
st−1
The full set of equations for GRUs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/1
-1.4
-0.4
1
...
-2
st−1
The full set of equations for GRUs
Gates:
States:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/1
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
ot
The full set of equations for GRUs
Gates:
ot = σ(Wost−1 + Uoxt + bo)
States:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/1
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
ot
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
The full set of equations for GRUs
Gates:
ot = σ(Wost−1 + Uoxt + bo)
States:
˜st = σ(W(ot ⊙ st−1) + Uxt + b)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/1
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
ot
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
The full set of equations for GRUs
Gates:
ot = σ(Wost−1 + Uoxt + bo)
it = σ(Wist−1 + Uixt + bi)
States:
˜st = σ(W(ot ⊙ st−1) + Uxt + b)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/1
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
ot
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
1 − it
=
-1.5
0.2
1
...
-1.9
st
The full set of equations for GRUs
Gates:
ot = σ(Wost−1 + Uoxt + bo)
it = σ(Wist−1 + Uixt + bi)
States:
˜st = σ(W(ot ⊙ st−1) + Uxt + b)
st = (1 − it) ⊙ st−1 + it ⊙ ˜st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/1
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
ot
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
1 − it
=
-1.5
0.2
1
...
-1.9
st
The full set of equations for GRUs
Gates:
ot = σ(Wost−1 + Uoxt + bo)
it = σ(Wist−1 + Uixt + bi)
States:
˜st = σ(W(ot ⊙ st−1) + Uxt + b)
st = (1 − it) ⊙ st−1 + it ⊙ ˜st
No explicit forget gate (the forget
gate and input gates are tied)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26/1
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
ot
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.2
0.34
0.9
...
0.29
1 − it
=
-1.5
0.2
1
...
-1.9
st
The full set of equations for GRUs
Gates:
ot = σ(Wost−1 + Uoxt + bo)
it = σ(Wist−1 + Uixt + bi)
States:
˜st = σ(W(ot ⊙ st−1) + Uxt + b)
st = (1 − it) ⊙ st−1 + it ⊙ ˜st
No explicit forget gate (the forget
gate and input gates are tied)
The gates depend directly on st−1 and
not the intermediate ht−1 as in the
case of LSTMs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27/1
Module : How LSTMs avoid the problem of vanishing
gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
Intuition
During forward propagation the
gates control the flow of information
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
Intuition
During forward propagation the
gates control the flow of information
They prevent any irrelevant
information from being written to
the state
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
Intuition
During forward propagation the
gates control the flow of information
They prevent any irrelevant
information from being written to
the state
Similarly during backward
propagation they control the flow of
gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
Intuition
During forward propagation the
gates control the flow of information
They prevent any irrelevant
information from being written to
the state
Similarly during backward
propagation they control the flow of
gradients
It is easy to see that during
backward pass the gradients will get
multiplied by the gate
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
If the state at time t − 1 did not contribute much to the state at time t (i.e., if
∥ft∥ → 0 and ∥ot−1∥ → 0) then during backpropagation the gradients flowing
into st−1 will vanish
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
If the state at time t − 1 did not contribute much to the state at time t (i.e., if
∥ft∥ → 0 and ∥ot−1∥ → 0) then during backpropagation the gradients flowing
into st−1 will vanish
But this kind of a vanishing gradient is fine (since st−1 did not contribute to st
we don’t want to hold it responsible for the crimes of st)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29/1
-1.4
-0.4
1
...
-2
st−1
⊙ =
0.2
0.34
0.9
...
0.29
ot−1
0.5
0.36
0.9
...
0.6
ht−1
selective write
W
σ
0.7
-0.2
1.1
...
-0.3
xt
U
0.4
0.6
0.1
...
0.2
˜st
⊙
0.8
0.66
0.1
...
0.71
it
selective read
+
-1.4
-0.4
1
...
-2
st−1
⊙
0.9
0.7
0.9
...
0.8
ft
selective forget
=
-1.5
0.2
1
...
-1.9
st
⊙ =
0.19
0.34
0.9
...
0.32
ot
0.4
0.34
0.8
...
0.12
ht
selective write
If the state at time t − 1 did not contribute much to the state at time t (i.e., if
∥ft∥ → 0 and ∥ot−1∥ → 0) then during backpropagation the gradients flowing
into st−1 will vanish
But this kind of a vanishing gradient is fine (since st−1 did not contribute to st
we don’t want to hold it responsible for the crimes of st)
The key difference from vanilla RNNs is that the flow of information and
gradients is controlled by the gates which ensure that the gradients vanish only
when they should (i.e., when st−1 didn’t contribute much to st)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30/1
We will now see an illustrative proof of how the gates control the flow of gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31/1
s1
W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x2
L3(θ)
W
V
U
s4
x2
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Recall that RNNs had this
multiplicative term which caused the
gradients to vanish
∂Lt(θ)
∂W
=
∂Lt(θ)
∂st
t∑
k=1
t−1∏
j=k
∂sj+1
∂sj
∂+sk
∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31/1
s1
W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x2
L3(θ)
W
V
U
s4
x2
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
Recall that RNNs had this
multiplicative term which caused the
gradients to vanish
∂Lt(θ)
∂W
=
∂Lt(θ)
∂st
t∑
k=1
t−1∏
j=k
∂sj+1
∂sj
∂+sk
∂W
In particular, if the loss at L4(θ) was
high because W was not good enough
to compute s1 correctly then this
information will not be propagated
back to W as the gradient ∂Lt(θ)
∂W
along this long path will vanish
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32/1
s1
W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x2
L3(θ)
W
V
U
s4
x2
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
In general, the gradient of Lt(θ)
w.r.t. θi vanishes when the gradients
flowing through each and every
path from Lt(θ) to θi vanish.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32/1
s1
W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x2
L3(θ)
W
V
U
s4
x2
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
In general, the gradient of Lt(θ)
w.r.t. θi vanishes when the gradients
flowing through each and every
path from Lt(θ) to θi vanish.
On the other hand, the gradient of
Lt(θ) w.r.t. θi explodes when the
gradient flowing through at least
one path explodes.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32/1
s1
W
V
U
x1
L1(θ)
s2
x2
L2(θ)
W
V
U
s3
x2
L3(θ)
W
V
U
s4
x2
L4(θ)
W
V
U
. . .
s4 L4(θ)
W
s3s2s1s0
In general, the gradient of Lt(θ)
w.r.t. θi vanishes when the gradients
flowing through each and every
path from Lt(θ) to θi vanish.
On the other hand, the gradient of
Lt(θ) w.r.t. θi explodes when the
gradient flowing through at least
one path explodes.
We will first argue that in the case of
LSTMs there exists at least one path
through which the gradients can flow
effectively (and hence no vanishing
gradients)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33/1
We will start with the dependency graph
involving different variables in LSTMs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33/1
We will start with the dependency graph
involving different variables in LSTMs
Starting with the states at timestep k − 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33/1
sk−1
hk−1
We will start with the dependency graph
involving different variables in LSTMs
Starting with the states at timestep k − 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33/1
sk−1
hk−1 Wo, Uo, bo
ok
We will start with the dependency graph
involving different variables in LSTMs
Starting with the states at timestep k − 1
ok = σ(Wohk−1 + Uoxk + bo)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33/1
sk−1
hk−1
ok
We will start with the dependency graph
involving different variables in LSTMs
Starting with the states at timestep k − 1
ok = σ(Wohk−1 + Uoxk + bo)
For simplicity we will omit the parameters for
now and return back to them later
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33/1
sk−1
hk−1
ok˜sk fk ik
We will start with the dependency graph
involving different variables in LSTMs
Starting with the states at timestep k − 1
ok = σ(Wohk−1 + Uoxk + bo)
For simplicity we will omit the parameters for
now and return back to them later
ik = σ(Wihk−1 + Uixk + bi)
fk = σ(Wf hk−1 + Uf xk + bf )
˜sk = σ(Whk−1 + Uxk + b)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33/1
sk−1
hk−1
ok˜sk fk ik
sk
We will start with the dependency graph
involving different variables in LSTMs
Starting with the states at timestep k − 1
ok = σ(Wohk−1 + Uoxk + bo)
For simplicity we will omit the parameters for
now and return back to them later
ik = σ(Wihk−1 + Uixk + bi)
fk = σ(Wf hk−1 + Uf xk + bf )
˜sk = σ(Whk−1 + Uxk + b)
sk = fk ⊙ sk−1 + ik ⊙ ˜sk
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33/1
sk−1
hk−1
ok˜sk fk ik
sk
hk
We will start with the dependency graph
involving different variables in LSTMs
Starting with the states at timestep k − 1
ok = σ(Wohk−1 + Uoxk + bo)
For simplicity we will omit the parameters for
now and return back to them later
ik = σ(Wihk−1 + Uixk + bi)
fk = σ(Wf hk−1 + Uf xk + bf )
˜sk = σ(Whk−1 + Uxk + b)
sk = fk ⊙ sk−1 + ik ⊙ ˜sk
hk = ok ⊙ σ(sk)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34/1
sk−1
hk−1
ok˜sk fk ik
sk
hk
Starting from hk−1 and sk−1 we have reached
hk and sk
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34/1
sk−1
hk−1
ok˜sk fk ik
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Starting from hk−1 and sk−1 we have reached
hk and sk
And the recursion will now continue till the
last timestep
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Starting from hk−1 and sk−1 we have reached
hk and sk
And the recursion will now continue till the
last timestep
For simplicity and ease of illustration, instead
of considering the parameters (W, Wo, Wi,
Wf , U, Uo, Ui, Uf ) as separate nodes in
the graph we will just put them on the
appropriate edges. (We show only a few
parameters and not all)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Starting from hk−1 and sk−1 we have reached
hk and sk
And the recursion will now continue till the
last timestep
For simplicity and ease of illustration, instead
of considering the parameters (W, Wo, Wi,
Wf , U, Uo, Ui, Uf ) as separate nodes in
the graph we will just put them on the
appropriate edges. (We show only a few
parameters and not all)
We are now interested in knowing if the
gradient from Lt(θ) flows back to an arbitrary
timestep k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
For example, we are interested in knowing if
the gradient flows to Wf through sk
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
For example, we are interested in knowing if
the gradient flows to Wf through sk
In other words, if Lt(θ) was high because Wf
failed to compute an appropriate value for sk
then this information should flow back to Wf
through the gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
For example, we are interested in knowing if
the gradient flows to Wf through sk
In other words, if Lt(θ) was high because Wf
failed to compute an appropriate value for sk
then this information should flow back to Wf
through the gradients
We can ask a similar question about the other
parameters (for example, Wi, Wo, W, etc.)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
For example, we are interested in knowing if
the gradient flows to Wf through sk
In other words, if Lt(θ) was high because Wf
failed to compute an appropriate value for sk
then this information should flow back to Wf
through the gradients
We can ask a similar question about the other
parameters (for example, Wi, Wo, W, etc.)
How does LSTM ensure that this gradient
does not vanish even at arbitrary time steps?
Let us see
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
It is sufficient to show that ∂Lt(θ)
∂sk
does not
vanish (because if this does not vanish we can
reach Wf through sk)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
It is sufficient to show that ∂Lt(θ)
∂sk
does not
vanish (because if this does not vanish we can
reach Wf through sk)
First, we observe that there are multiple paths
from Lt(θ) to sk (you just need to reverse the
direction of the arrows for backpropagation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
It is sufficient to show that ∂Lt(θ)
∂sk
does not
vanish (because if this does not vanish we can
reach Wf through sk)
First, we observe that there are multiple paths
from Lt(θ) to sk (you just need to reverse the
direction of the arrows for backpropagation)
For example, there is one path through sk+1,
another through hk
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
It is sufficient to show that ∂Lt(θ)
∂sk
does not
vanish (because if this does not vanish we can
reach Wf through sk)
First, we observe that there are multiple paths
from Lt(θ) to sk (you just need to reverse the
direction of the arrows for backpropagation)
For example, there is one path through sk+1,
another through hk
Further, there are multiple paths to reach
to hk itself (as should be obvious from the
number of outgoing arrows from hk)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36/1
sk−1
hk−1
ok
Wo
˜sk fk ik
WiWf
W
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
It is sufficient to show that ∂Lt(θ)
∂sk
does not
vanish (because if this does not vanish we can
reach Wf through sk)
First, we observe that there are multiple paths
from Lt(θ) to sk (you just need to reverse the
direction of the arrows for backpropagation)
For example, there is one path through sk+1,
another through hk
Further, there are multiple paths to reach
to hk itself (as should be obvious from the
number of outgoing arrows from hk)
So at this point just convince yourself that
there are many paths from Lt(θ) to sk
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Consider one such path (highlighted) which
will contribute to the gradient
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Consider one such path (highlighted) which
will contribute to the gradient
Let us denote the gradient along this path as
t0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Consider one such path (highlighted) which
will contribute to the gradient
Let us denote the gradient along this path as
t0
t0 =
∂Lt(θ)
∂ht
∂ht
∂st
∂st
∂st−1
. . .
∂sk+1
∂sk
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Consider one such path (highlighted) which
will contribute to the gradient
Let us denote the gradient along this path as
t0
t0 =
∂Lt(θ)
∂ht
∂ht
∂st
∂st
∂st−1
. . .
∂sk+1
∂sk
The first term ∂Lt(θ)
∂ht
is fine and it doesn’t
vanish (ht is directly connected to Lt(θ) and
there are no intermediate nodes which can
cause the gradient to vanish)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Consider one such path (highlighted) which
will contribute to the gradient
Let us denote the gradient along this path as
t0
t0 =
∂Lt(θ)
∂ht
∂ht
∂st
∂st
∂st−1
. . .
∂sk+1
∂sk
The first term ∂Lt(θ)
∂ht
is fine and it doesn’t
vanish (ht is directly connected to Lt(θ) and
there are no intermediate nodes which can
cause the gradient to vanish)
We will now look at the other terms
∂ht
∂st
∂st
∂st−1
(∀t)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Let us first look at ∂ht
∂st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Let us first look at ∂ht
∂st
Recall that
ht = ot ⊙ σ(st)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Let us first look at ∂ht
∂st
Recall that
ht = ot ⊙ σ(st)
Note that hti only depends on oti and sti and
not on any other elements of ot and st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Let us first look at ∂ht
∂st
Recall that
ht = ot ⊙ σ(st)
Note that hti only depends on oti and sti and
not on any other elements of ot and st
∂ht
∂st
will thus be a square diagonal matrix
∈ Rd×d whose diagonal will be
ot ⊙ σ′(st) ∈ Rd (see slide 35 of Lecture 14)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Let us first look at ∂ht
∂st
Recall that
ht = ot ⊙ σ(st)
Note that hti only depends on oti and sti and
not on any other elements of ot and st
∂ht
∂st
will thus be a square diagonal matrix
∈ Rd×d whose diagonal will be
ot ⊙ σ′(st) ∈ Rd (see slide 35 of Lecture 14)
We will represent this diagonal matrix by
D(ot ⊙ σ′(st))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now let us consider ∂st
∂st−1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now let us consider ∂st
∂st−1
Recall that
st = ft ⊙ st−1 + it ⊙ ˜st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now let us consider ∂st
∂st−1
Recall that
st = ft ⊙ st−1 + it ⊙ ˜st
Notice that ˜st also depends on st−1 so we
cannot treat it as a constant
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now let us consider ∂st
∂st−1
Recall that
st = ft ⊙ st−1 + it ⊙ ˜st
Notice that ˜st also depends on st−1 so we
cannot treat it as a constant
So once again we are dealing with an ordered
network and thus ∂st
∂st−1
will be a sum of an
explicit term and an implicit term (see slide
37 from Lecture 14)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now let us consider ∂st
∂st−1
Recall that
st = ft ⊙ st−1 + it ⊙ ˜st
Notice that ˜st also depends on st−1 so we
cannot treat it as a constant
So once again we are dealing with an ordered
network and thus ∂st
∂st−1
will be a sum of an
explicit term and an implicit term (see slide
37 from Lecture 14)
For simplicity, let us assume that the gradient
from the implicit term vanishes (we are
assuming a worst case scenario)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now let us consider ∂st
∂st−1
Recall that
st = ft ⊙ st−1 + it ⊙ ˜st
Notice that ˜st also depends on st−1 so we
cannot treat it as a constant
So once again we are dealing with an ordered
network and thus ∂st
∂st−1
will be a sum of an
explicit term and an implicit term (see slide
37 from Lecture 14)
For simplicity, let us assume that the gradient
from the implicit term vanishes (we are
assuming a worst case scenario)
And the gradient from the explicit term
(treating ˜st as a constant) is given by D(ft)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
We now return back to our full expression for
t0:
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
We now return back to our full expression for
t0:
t0 =
∂Lt(θ)
∂ht
∂ht
∂st
∂st
∂st−1
. . .
∂sk+1
∂sk
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
We now return back to our full expression for
t0:
t0 =
∂Lt(θ)
∂ht
∂ht
∂st
∂st
∂st−1
. . .
∂sk+1
∂sk
= L′
t(ht).D(ot ⊙ σ′
(st))D(ft) . . . D(fk+1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
We now return back to our full expression for
t0:
t0 =
∂Lt(θ)
∂ht
∂ht
∂st
∂st
∂st−1
. . .
∂sk+1
∂sk
= L′
t(ht).D(ot ⊙ σ′
(st))D(ft) . . . D(fk+1)
= L′
t(ht).D(ot ⊙ σ′
(st))D(ft ⊙ . . . ⊙ fk+1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
We now return back to our full expression for
t0:
t0 =
∂Lt(θ)
∂ht
∂ht
∂st
∂st
∂st−1
. . .
∂sk+1
∂sk
= L′
t(ht).D(ot ⊙ σ′
(st))D(ft) . . . D(fk+1)
= L′
t(ht).D(ot ⊙ σ′
(st))D(ft ⊙ . . . ⊙ fk+1)
= L′
t(ht).D(ot ⊙ σ′
(st))D(⊙t
i=k+1fi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
We now return back to our full expression for
t0:
t0 =
∂Lt(θ)
∂ht
∂ht
∂st
∂st
∂st−1
. . .
∂sk+1
∂sk
= L′
t(ht).D(ot ⊙ σ′
(st))D(ft) . . . D(fk+1)
= L′
t(ht).D(ot ⊙ σ′
(st))D(ft ⊙ . . . ⊙ fk+1)
= L′
t(ht).D(ot ⊙ σ′
(st))D(⊙t
i=k+1fi)
The red terms don’t vanish and the blue terms
contain a multiplication of the forget gates
The forget gates thus regulate the gradient
flow depending on the explicit contribution of
a state (st) to the next state st+1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
If during forward pass st did not contribute
much to st+1 (because ft → 0) then during
backpropgation also the gradient will not
reach st
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
If during forward pass st did not contribute
much to st+1 (because ft → 0) then during
backpropgation also the gradient will not
reach st
This is fine because if st did not contribute
much to st+1 then there is no reason to hold
it responsible during backpropgation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
If during forward pass st did not contribute
much to st+1 (because ft → 0) then during
backpropgation also the gradient will not
reach st
This is fine because if st did not contribute
much to st+1 then there is no reason to hold
it responsible during backpropgation (ft does
the same regulation during forward pass and
backward pass which is fair)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
If during forward pass st did not contribute
much to st+1 (because ft → 0) then during
backpropgation also the gradient will not
reach st
This is fine because if st did not contribute
much to st+1 then there is no reason to hold
it responsible during backpropgation (ft does
the same regulation during forward pass and
backward pass which is fair)
Thus there exists this one path along which
the gradient doesn’t vanish when it shouldn’t
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
If during forward pass st did not contribute
much to st+1 (because ft → 0) then during
backpropgation also the gradient will not
reach st
This is fine because if st did not contribute
much to st+1 then there is no reason to hold
it responsible during backpropgation (ft does
the same regulation during forward pass and
backward pass which is fair)
Thus there exists this one path along which
the gradient doesn’t vanish when it shouldn’t
And as argued as long as the gradient flows
back to Wf through one of the paths (t0)
through sk we are fine !
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
If during forward pass st did not contribute
much to st+1 (because ft → 0) then during
backpropgation also the gradient will not
reach st
This is fine because if st did not contribute
much to st+1 then there is no reason to hold
it responsible during backpropgation (ft does
the same regulation during forward pass and
backward pass which is fair)
Thus there exists this one path along which
the gradient doesn’t vanish when it shouldn’t
And as argued as long as the gradient flows
back to Wf through one of the paths (t0)
through sk we are fine !
Of course the gradient flows back only when
required as regulated by fi’s (but let me just
say it one last time that this is fair)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now we will see why LSTMs do not solve the
problem of exploding gradients
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now we will see why LSTMs do not solve the
problem of exploding gradients
We will show a path through which the
gradient can explode
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now we will see why LSTMs do not solve the
problem of exploding gradients
We will show a path through which the
gradient can explode
Let us compute one term (say t1) of ∂Lt(θ)
∂hk−1
corresponding to the highlighted path
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now we will see why LSTMs do not solve the
problem of exploding gradients
We will show a path through which the
gradient can explode
Let us compute one term (say t1) of ∂Lt(θ)
∂hk−1
corresponding to the highlighted path
t1 =
∂Lt(θ)
∂ht
(
∂ht
∂ot
∂ot
∂ht−1
)
. . .
(
∂hk
∂ok
∂ok
∂hk−1
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now we will see why LSTMs do not solve the
problem of exploding gradients
We will show a path through which the
gradient can explode
Let us compute one term (say t1) of ∂Lt(θ)
∂hk−1
corresponding to the highlighted path
t1 =
∂Lt(θ)
∂ht
(
∂ht
∂ot
∂ot
∂ht−1
)
. . .
(
∂hk
∂ok
∂ok
∂hk−1
)
=L′
t(ht) (D(σ(st) ⊙ o′
t).Wo) . . .
(D(σ(sk) ⊙ o′
k).Wo)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now we will see why LSTMs do not solve the
problem of exploding gradients
We will show a path through which the
gradient can explode
Let us compute one term (say t1) of ∂Lt(θ)
∂hk−1
corresponding to the highlighted path
t1 =
∂Lt(θ)
∂ht
(
∂ht
∂ot
∂ot
∂ht−1
)
. . .
(
∂hk
∂ok
∂ok
∂hk−1
)
=L′
t(ht) (D(σ(st) ⊙ o′
t).Wo) . . .
(D(σ(sk) ⊙ o′
k).Wo)
∥t1∥ ≤∥L′
t(ht)∥ (∥K∥∥Wo∥)
t−k+1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42/1
sk−1
hk−1
˜sk fk ik ok
sk
hk
st−1
ht−1
˜st ft it ot
st
ht Lt(θ)
Now we will see why LSTMs do not solve the
problem of exploding gradients
We will show a path through which the
gradient can explode
Let us compute one term (say t1) of ∂Lt(θ)
∂hk−1
corresponding to the highlighted path
t1 =
∂Lt(θ)
∂ht
(
∂ht
∂ot
∂ot
∂ht−1
)
. . .
(
∂hk
∂ok
∂ok
∂hk−1
)
=L′
t(ht) (D(σ(st) ⊙ o′
t).Wo) . . .
(D(σ(sk) ⊙ o′
k).Wo)
∥t1∥ ≤∥L′
t(ht)∥ (∥K∥∥Wo∥)
t−k+1
Depending on the norm of matrix Wo, the
gradient ∂Lt(θ)
∂hk−1
may explode
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
RNN in deep learning
RNN in deep learning
RNN in deep learning
RNN in deep learning
RNN in deep learning

More Related Content

Recently uploaded

Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Recently uploaded (20)

Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming LanguageSimplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

RNN in deep learning