SlideShare uma empresa Scribd logo
1 de 288
Baixar para ler offline
1/57
CS7015 (Deep Learning): Lecture 4
Feedforward Neural Networks, Backpropagation
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
2/57
References/Acknowledgments
See the excellent videos by Hugo Larochelle on Backpropagation
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
3/57
Module 4.1: Feedforward Neural Networks (a.k.a.
multilayered network of neurons)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
The input to the network is an n-dimensional
vector
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
The input to the network is an n-dimensional
vector
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts :
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts :
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
a1
a2
a3
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts : pre-activation
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
a1
a2
a3
h1
h2
hL = ˆy = f(x) The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts : pre-activation and
activation
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
a1
a2
a3
h1
h2
hL = ˆy = f(x) The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts : pre-activation and
activation (ai and hi are vectors)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
a1
a2
a3
h1
h2
hL = ˆy = f(x) The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts : pre-activation and
activation (ai and hi are vectors)
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
a1
a2
a3
h1
h2
hL = ˆy = f(x)
W1 b1
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts : pre-activation and
activation (ai and hi are vectors)
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
Wi ∈ Rn×n and bi ∈ Rn are the weight and bias
between layers i − 1 and i (0 < i < L)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
a1
a2
a3
h1
h2
hL = ˆy = f(x)
W1 b1
W2 b2
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts : pre-activation and
activation (ai and hi are vectors)
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
Wi ∈ Rn×n and bi ∈ Rn are the weight and bias
between layers i − 1 and i (0 < i < L)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
4/57
x1 x2 xn
a1
a2
a3
h1
h2
hL = ˆy = f(x)
W1 b1
W2 b2
W3 b3
The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
this case) having n neurons each
Finally, there is one output layer containing k
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
can be split into two parts : pre-activation and
activation (ai and hi are vectors)
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
Wi ∈ Rn×n and bi ∈ Rn are the weight and bias
between layers i − 1 and i (0 < i < L)
WL ∈ Rn×k and bL ∈ Rk are the weight and bias
between the last hidden layer and the output layer
(L = 3 in this case)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
5/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) The pre-activation at layer i is given by
ai(x) = bi + Wihi−1(x)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
5/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) The pre-activation at layer i is given by
ai(x) = bi + Wihi−1(x)
The activation at layer i is given by
hi(x) = g(ai(x))
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
5/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) The pre-activation at layer i is given by
ai(x) = bi + Wihi−1(x)
The activation at layer i is given by
hi(x) = g(ai(x))
where g is called the activation function (for
example, logistic, tanh, linear, etc.)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
5/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) The pre-activation at layer i is given by
ai(x) = bi + Wihi−1(x)
The activation at layer i is given by
hi(x) = g(ai(x))
where g is called the activation function (for
example, logistic, tanh, linear, etc.)
The activation at the output layer is given by
f(x) = hL(x) = O(aL(x))
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
5/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) The pre-activation at layer i is given by
ai(x) = bi + Wihi−1(x)
The activation at layer i is given by
hi(x) = g(ai(x))
where g is called the activation function (for
example, logistic, tanh, linear, etc.)
The activation at the output layer is given by
f(x) = hL(x) = O(aL(x))
where O is the output activation function (for
example, softmax, linear, etc.)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
5/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) The pre-activation at layer i is given by
ai(x) = bi + Wihi−1(x)
The activation at layer i is given by
hi(x) = g(ai(x))
where g is called the activation function (for
example, logistic, tanh, linear, etc.)
The activation at the output layer is given by
f(x) = hL(x) = O(aL(x))
where O is the output activation function (for
example, softmax, linear, etc.)
To simplify notation we will refer to ai(x) as ai
and hi(x) as hi
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
6/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) The pre-activation at layer i is given by
ai = bi + Wihi−1
The activation at layer i is given by
hi = g(ai)
where g is called the activation function (for
example, logistic, tanh, linear, etc.)
The activation at the output layer is given by
f(x) = hL = O(aL)
where O is the output activation function (for
example, softmax, linear, etc.)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
7/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
Data: {xi, yi}N
i=1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
7/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
Data: {xi, yi}N
i=1
Model:
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
7/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
Data: {xi, yi}N
i=1
Model:
ˆyi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
7/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
Data: {xi, yi}N
i=1
Model:
ˆyi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3)
Parameters:
θ = W1, .., WL, b1, b2, ..., bL(L = 3)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
7/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
Data: {xi, yi}N
i=1
Model:
ˆyi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3)
Parameters:
θ = W1, .., WL, b1, b2, ..., bL(L = 3)
Algorithm: Gradient Descent with Back-
propagation (we will see soon)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
7/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
Data: {xi, yi}N
i=1
Model:
ˆyi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3)
Parameters:
θ = W1, .., WL, b1, b2, ..., bL(L = 3)
Algorithm: Gradient Descent with Back-
propagation (we will see soon)
Objective/Loss/Error function: Say,
min
1
N
N
i=1
k
j=1
(ˆyij − yij)2
In general, min L (θ)
where L (θ) is some function of the parameters
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
8/57
Module 4.2: Learning Parameters of Feedforward
Neural Networks (Intuition)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
9/57
The story so far...
We have introduced feedforward neural networks
We are now interested in finding an algorithm for learning the parameters of
this model
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
10/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Recall our gradient descent algorithm
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
10/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Recall our gradient descent algorithm
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize w0, b0;
while t++ < max iterations do
wt+1 ← wt − η wt;
bt+1 ← bt − η bt;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
10/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Recall our gradient descent algorithm
We can write it more concisely as
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize w0, b0;
while t++ < max iterations do
wt+1 ← wt − η wt;
bt+1 ← bt − η bt;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
10/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Recall our gradient descent algorithm
We can write it more concisely as
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [w0, b0];
while t++ < max iterations do
θt+1 ← θt − η θt;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
10/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Recall our gradient descent algorithm
We can write it more concisely as
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [w0, b0];
while t++ < max iterations do
θt+1 ← θt − η θt;
end
where θt = ∂L (θ)
∂wt
, ∂L (θ)
∂bt
T
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
10/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Recall our gradient descent algorithm
We can write it more concisely as
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [w0, b0];
while t++ < max iterations do
θt+1 ← θt − η θt;
end
where θt = ∂L (θ)
∂wt
, ∂L (θ)
∂bt
T
Now, in this feedforward neural network,
instead of θ = [w, b] we have θ =
[W1, W2, .., WL, b1, b2, .., bL]
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
10/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Recall our gradient descent algorithm
We can write it more concisely as
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [w0, b0];
while t++ < max iterations do
θt+1 ← θt − η θt;
end
where θt = ∂L (θ)
∂wt
, ∂L (θ)
∂bt
T
Now, in this feedforward neural network,
instead of θ = [w, b] we have θ =
[W1, W2, .., WL, b1, b2, .., bL]
We can still use the same algorithm for
learning the parameters of our model
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
10/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Recall our gradient descent algorithm
We can write it more concisely as
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [W0
1 , ..., W0
L, b0
1, ..., b0
L];
while t++ < max iterations do
θt+1 ← θt − η θt;
end
where θt = ∂L (θ)
∂W1,t
, ., ∂L (θ)
∂WL,t
, ∂L (θ)
∂b1,t
, ., ∂L (θ)
∂bL,t
T
Now, in this feedforward neural network,
instead of θ = [w, b] we have θ =
[W1, W2, .., WL, b1, b2, .., bL]
We can still use the same algorithm for
learning the parameters of our model
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111









Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111
. . .









Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111
. . . ∂L (θ)
∂W11n









Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111
. . . ∂L (θ)
∂W11n
∂L (θ)
∂W121
. . . ∂L (θ)
∂W12n
...
...
...
∂L (θ)
∂W1n1
. . . ∂L (θ)
∂W1nn









Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111
. . . ∂L (θ)
∂W11n
∂L (θ)
∂W211
. . . ∂L (θ)
∂W21n
∂L (θ)
∂W121
. . . ∂L (θ)
∂W12n
∂L (θ)
∂W221
. . . ∂L (θ)
∂W22n
...
...
...
...
...
...
∂L (θ)
∂W1n1
. . . ∂L (θ)
∂W1nn
∂L (θ)
∂W2n1
. . . ∂L (θ)
∂W2nn









Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111
. . . ∂L (θ)
∂W11n
∂L (θ)
∂W211
. . . ∂L (θ)
∂W21n
. . .
∂L (θ)
∂W121
. . . ∂L (θ)
∂W12n
∂L (θ)
∂W221
. . . ∂L (θ)
∂W22n
. . .
...
...
...
...
...
...
...
∂L (θ)
∂W1n1
. . . ∂L (θ)
∂W1nn
∂L (θ)
∂W2n1
. . . ∂L (θ)
∂W2nn
. . .









Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111
. . . ∂L (θ)
∂W11n
∂L (θ)
∂W211
. . . ∂L (θ)
∂W21n
. . . ∂L (θ)
∂WL,11
. . . ∂L (θ)
∂WL,1k
∂L (θ)
∂WL,1k
∂L (θ)
∂W121
. . . ∂L (θ)
∂W12n
∂L (θ)
∂W221
. . . ∂L (θ)
∂W22n
. . . ∂L (θ)
∂WL,21
. . . ∂L (θ)
∂WL,2k
∂L (θ)
∂WL,2k
...
...
...
...
...
...
...
...
...
...
...
∂L (θ)
∂W1n1
. . . ∂L (θ)
∂W1nn
∂L (θ)
∂W2n1
. . . ∂L (θ)
∂W2nn
. . . ∂L (θ)
∂WL,n1
. . . ∂L (θ)
∂WL,nk
∂L (θ)
∂WL,nk









Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111
. . . ∂L (θ)
∂W11n
∂L (θ)
∂W211
. . . ∂L (θ)
∂W21n
. . . ∂L (θ)
∂WL,11
. . . ∂L (θ)
∂WL,1k
∂L (θ)
∂WL,1k
∂L (θ)
∂b11
. . . ∂L (θ)
∂bL1
∂L (θ)
∂W121
. . . ∂L (θ)
∂W12n
∂L (θ)
∂W221
. . . ∂L (θ)
∂W22n
. . . ∂L (θ)
∂WL,21
. . . ∂L (θ)
∂WL,2k
∂L (θ)
∂WL,2k
∂L (θ)
∂b12
. . . ∂L (θ)
∂bL2
...
...
...
...
...
...
...
...
...
...
...
...
...
...
∂L (θ)
∂W1n1
. . . ∂L (θ)
∂W1nn
∂L (θ)
∂W2n1
. . . ∂L (θ)
∂W2nn
. . . ∂L (θ)
∂WL,n1
. . . ∂L (θ)
∂WL,nk
∂L (θ)
∂WL,nk
∂L (θ)
∂b1n
. . . ∂L (θ)
∂bLk









Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
11/57
Except that now our θ looks much more nasty









∂L (θ)
∂W111
. . . ∂L (θ)
∂W11n
∂L (θ)
∂W211
. . . ∂L (θ)
∂W21n
. . . ∂L (θ)
∂WL,11
. . . ∂L (θ)
∂WL,1k
∂L (θ)
∂WL,1k
∂L (θ)
∂b11
. . . ∂L (θ)
∂bL1
∂L (θ)
∂W121
. . . ∂L (θ)
∂W12n
∂L (θ)
∂W221
. . . ∂L (θ)
∂W22n
. . . ∂L (θ)
∂WL,21
. . . ∂L (θ)
∂WL,2k
∂L (θ)
∂WL,2k
∂L (θ)
∂b12
. . . ∂L (θ)
∂bL2
...
...
...
...
...
...
...
...
...
...
...
...
...
...
∂L (θ)
∂W1n1
. . . ∂L (θ)
∂W1nn
∂L (θ)
∂W2n1
. . . ∂L (θ)
∂W2nn
. . . ∂L (θ)
∂WL,n1
. . . ∂L (θ)
∂WL,nk
∂L (θ)
∂WL,nk
∂L (θ)
∂b1n
. . . ∂L (θ)
∂bLk









θ is thus composed of
W1, W2, ... WL−1 ∈ Rn×n, WL ∈ Rn×k,
b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
12/57
We need to answer two questions
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
12/57
We need to answer two questions
How to choose the loss function L (θ)?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
12/57
We need to answer two questions
How to choose the loss function L (θ)?
How to compute θ which is composed of
W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k
b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
13/57
Module 4.3: Output Functions and Loss Functions
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
14/57
We need to answer two questions
How to choose the loss function L (θ) ?
How to compute θ which is composed of:
W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k
b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
14/57
We need to answer two questions
How to choose the loss function L (θ) ?
How to compute θ which is composed of:
W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k
b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
15/57
The choice of loss function depends
on the problem at hand
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
15/57
The choice of loss function depends
on the problem at hand
We will illustrate this with the help
of two examples
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
15/57
Neural network with
L − 1 hidden layers
isActor
Damon
isDirector
Nolan
Critics
Rating
imdb
Rating
RT
Rating
. . . . . . . . . .
xi
yi = {7.5 8.2 7.7}
The choice of loss function depends
on the problem at hand
We will illustrate this with the help
of two examples
Consider our movie example again
but this time we are interested in
predicting ratings
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
15/57
Neural network with
L − 1 hidden layers
isActor
Damon
isDirector
Nolan
Critics
Rating
imdb
Rating
RT
Rating
. . . . . . . . . .
xi
yi = {7.5 8.2 7.7}
The choice of loss function depends
on the problem at hand
We will illustrate this with the help
of two examples
Consider our movie example again
but this time we are interested in
predicting ratings
Here yi ∈ R3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
15/57
Neural network with
L − 1 hidden layers
isActor
Damon
isDirector
Nolan
Critics
Rating
imdb
Rating
RT
Rating
. . . . . . . . . .
xi
yi = {7.5 8.2 7.7}
The choice of loss function depends
on the problem at hand
We will illustrate this with the help
of two examples
Consider our movie example again
but this time we are interested in
predicting ratings
Here yi ∈ R3
The loss function should capture how
much ˆyi deviates from yi
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
15/57
Neural network with
L − 1 hidden layers
isActor
Damon
isDirector
Nolan
Critics
Rating
imdb
Rating
RT
Rating
. . . . . . . . . .
xi
yi = {7.5 8.2 7.7}
The choice of loss function depends
on the problem at hand
We will illustrate this with the help
of two examples
Consider our movie example again
but this time we are interested in
predicting ratings
Here yi ∈ R3
The loss function should capture how
much ˆyi deviates from yi
If yi ∈ Rn then the squared error loss
can capture this deviation
L (θ) =
1
N
N
i=1
3
j=1
(ˆyij − yij)2
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
16/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) A related question: What should the
output function ‘O’ be if yi ∈ R?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
16/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) A related question: What should the
output function ‘O’ be if yi ∈ R?
More specifically, can it be the logistic
function?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
16/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) A related question: What should the
output function ‘O’ be if yi ∈ R?
More specifically, can it be the logistic
function?
No, because it restricts ˆyi to a value
between 0 & 1 but we want ˆyi ∈ R
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
16/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) A related question: What should the
output function ‘O’ be if yi ∈ R?
More specifically, can it be the logistic
function?
No, because it restricts ˆyi to a value
between 0 & 1 but we want ˆyi ∈ R
So, in such cases it makes sense to
have ‘O’ as linear function
f(x) = hL = O(aL)
= WOaL + bO
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
16/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) A related question: What should the
output function ‘O’ be if yi ∈ R?
More specifically, can it be the logistic
function?
No, because it restricts ˆyi to a value
between 0 & 1 but we want ˆyi ∈ R
So, in such cases it makes sense to
have ‘O’ as linear function
f(x) = hL = O(aL)
= WOaL + bO
ˆyi = f(xi) is no longer bounded
between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
17/57Intentionally left blank
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
17/57Intentionally left blank
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
18/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Now let us consider another problem
for which a different loss function
would be appropriate
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
18/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Now let us consider another problem
for which a different loss function
would be appropriate
Suppose we want to classify an image
into 1 of k classes
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
18/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Now let us consider another problem
for which a different loss function
would be appropriate
Suppose we want to classify an image
into 1 of k classes
Here again we could use the squared
error loss to capture the deviation
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
18/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Now let us consider another problem
for which a different loss function
would be appropriate
Suppose we want to classify an image
into 1 of k classes
Here again we could use the squared
error loss to capture the deviation
But can you think of a better
function?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
19/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Notice that y is a probability
distribution
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
19/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Notice that y is a probability
distribution
Therefore we should also ensure that
ˆy is a probability distribution
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
19/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Notice that y is a probability
distribution
Therefore we should also ensure that
ˆy is a probability distribution
What choice of the output activation
‘O’ will ensure this ?
aL = WLhL−1 + bL
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
19/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Notice that y is a probability
distribution
Therefore we should also ensure that
ˆy is a probability distribution
What choice of the output activation
‘O’ will ensure this ?
aL = WLhL−1 + bL
ˆyj = O(aL)j =
eaL,j
k
i=1 eaL,i
O(aL)j is the jth element of ˆy and aL,j
is the jth element of the vector aL.
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
19/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x) Notice that y is a probability
distribution
Therefore we should also ensure that
ˆy is a probability distribution
What choice of the output activation
‘O’ will ensure this ?
aL = WLhL−1 + bL
ˆyj = O(aL)j =
eaL,j
k
i=1 eaL,i
O(aL)j is the jth element of ˆy and aL,j
is the jth element of the vector aL.
This function is called the softmax
function
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
20/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Now that we have ensured that both
y & ˆy are probability distributions
can you think of a function which
captures the difference between
them?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
20/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Now that we have ensured that both
y & ˆy are probability distributions
can you think of a function which
captures the difference between
them?
Cross-entropy
L (θ) = −
k
c=1
yc log ˆyc
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
20/57
Neural network with
L − 1 hidden layers
Apple Mango Orange Banana
y = [1 0 0 0]
Now that we have ensured that both
y & ˆy are probability distributions
can you think of a function which
captures the difference between
them?
Cross-entropy
L (θ) = −
k
c=1
yc log ˆyc
Notice that
yc = 1 if c = (the true class label)
= 0 otherwise
∴ L (θ) = − log ˆy
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
21/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
So, for classification problem (where you have
to choose 1 of K classes), we use the following
objective function
minimize
θ
L (θ) = − log ˆy
or maximize
θ
− L (θ) = log ˆy
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
21/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
So, for classification problem (where you have
to choose 1 of K classes), we use the following
objective function
minimize
θ
L (θ) = − log ˆy
or maximize
θ
− L (θ) = log ˆy
But wait!
Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
21/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
So, for classification problem (where you have
to choose 1 of K classes), we use the following
objective function
minimize
θ
L (θ) = − log ˆy
or maximize
θ
− L (θ) = log ˆy
But wait!
Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]?
Yes, it is indeed a function of θ
ˆy = [O(W3g(W2g(W1x + b1) + b2) + b3)]
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
21/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
So, for classification problem (where you have
to choose 1 of K classes), we use the following
objective function
minimize
θ
L (θ) = − log ˆy
or maximize
θ
− L (θ) = log ˆy
But wait!
Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]?
Yes, it is indeed a function of θ
ˆy = [O(W3g(W2g(W1x + b1) + b2) + b3)]
What does ˆy encode?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
21/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
So, for classification problem (where you have
to choose 1 of K classes), we use the following
objective function
minimize
θ
L (θ) = − log ˆy
or maximize
θ
− L (θ) = log ˆy
But wait!
Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]?
Yes, it is indeed a function of θ
ˆy = [O(W3g(W2g(W1x + b1) + b2) + b3)]
What does ˆy encode?
It is the probability that x belongs to the th class
(bring it as close to 1).
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
21/57
x1 x2 xn
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
hL = ˆy = f(x)
So, for classification problem (where you have
to choose 1 of K classes), we use the following
objective function
minimize
θ
L (θ) = − log ˆy
or maximize
θ
− L (θ) = log ˆy
But wait!
Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]?
Yes, it is indeed a function of θ
ˆy = [O(W3g(W2g(W1x + b1) + b2) + b3)]
What does ˆy encode?
It is the probability that x belongs to the th class
(bring it as close to 1).
log ˆy is called the log-likelihood of the data.
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
22/57
Outputs
Real Values Probabilities
Output Activation
Loss Function
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
22/57
Outputs
Real Values Probabilities
Output Activation Linear
Loss Function
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
22/57
Outputs
Real Values Probabilities
Output Activation Linear Softmax
Loss Function
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
22/57
Outputs
Real Values Probabilities
Output Activation Linear Softmax
Loss Function Squared Error
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
22/57
Outputs
Real Values Probabilities
Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
22/57
Outputs
Real Values Probabilities
Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy
Of course, there could be other loss functions depending on the problem at hand
but the two loss functions that we just saw are encountered very often
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
22/57
Outputs
Real Values Probabilities
Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy
Of course, there could be other loss functions depending on the problem at hand
but the two loss functions that we just saw are encountered very often
For the rest of this lecture we will focus on the case where the output activation
is a softmax function and the loss function is cross entropy
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
23/57
Module 4.4: Backpropagation (Intuition)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
24/57
We need to answer two questions
How to choose the loss function L (θ) ?
How to compute θ which is composed of:
W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k
b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
24/57
We need to answer two questions
How to choose the loss function L (θ) ?
How to compute θ which is composed of:
W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k
b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ?
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
25/57
Let us focus on this one
weight (W112).
x1 x2 xd
W111
a11
W211
a21
h11
W311
a31
h21
b1
b2
b3
ˆy = f(x)
W112
Algorithm: gradient
descent()
t ← 0;
max iterations ←
1000;
Initialize θ0;
while
t++ < max iterations
do
θt+1 ← θt − η θt;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
25/57
Let us focus on this one
weight (W112).
To learn this weight
using SGD we need a
formula for ∂L (θ)
∂W112
.
x1 x2 xd
W111
a11
W211
a21
h11
W311
a31
h21
b1
b2
b3
ˆy = f(x)
W112
Algorithm: gradient
descent()
t ← 0;
max iterations ←
1000;
Initialize θ0;
while
t++ < max iterations
do
θt+1 ← θt − η θt;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
25/57
Let us focus on this one
weight (W112).
To learn this weight
using SGD we need a
formula for ∂L (θ)
∂W112
.
We will see how to
calculate this.
x1 x2 xd
W111
a11
W211
a21
h11
W311
a31
h21
b1
b2
b3
ˆy = f(x)
W112
Algorithm: gradient
descent()
t ← 0;
max iterations ←
1000;
Initialize θ0;
while
t++ < max iterations
do
θt+1 ← θt − η θt;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
26/57
First let us take the simple case when
we have a deep but thin network.
x1
W111
a11
W211
a21
h11
WL11
aL1
h21
ˆy = f(x)
L (θ)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
26/57
First let us take the simple case when
we have a deep but thin network.
In this case it is easy to find the
derivative by chain rule.
x1
W111
a11
W211
a21
h11
WL11
aL1
h21
ˆy = f(x)
L (θ)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
26/57
First let us take the simple case when
we have a deep but thin network.
In this case it is easy to find the
derivative by chain rule.
∂L (θ)
∂W111
=
∂L (θ)
∂ˆy
∂ˆy
∂aL11
∂aL11
∂h21
∂h21
∂a21
∂a21
∂h11
∂h11
∂a11
∂a11
∂W111
x1
W111
a11
W211
a21
h11
WL11
aL1
h21
ˆy = f(x)
L (θ)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
26/57
First let us take the simple case when
we have a deep but thin network.
In this case it is easy to find the
derivative by chain rule.
∂L (θ)
∂W111
=
∂L (θ)
∂ˆy
∂ˆy
∂aL11
∂aL11
∂h21
∂h21
∂a21
∂a21
∂h11
∂h11
∂a11
∂a11
∂W111
∂L (θ)
∂W111
=
∂L (θ)
∂h11
∂h11
∂W111
(just compressing the chain rule)
x1
W111
a11
W211
a21
h11
WL11
aL1
h21
ˆy = f(x)
L (θ)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
26/57
First let us take the simple case when
we have a deep but thin network.
In this case it is easy to find the
derivative by chain rule.
∂L (θ)
∂W111
=
∂L (θ)
∂ˆy
∂ˆy
∂aL11
∂aL11
∂h21
∂h21
∂a21
∂a21
∂h11
∂h11
∂a11
∂a11
∂W111
∂L (θ)
∂W111
=
∂L (θ)
∂h11
∂h11
∂W111
(just compressing the chain rule)
∂L (θ)
∂W211
=
∂L (θ)
∂h21
∂h21
∂W211
x1
W111
a11
W211
a21
h11
WL11
aL1
h21
ˆy = f(x)
L (θ)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
26/57
First let us take the simple case when
we have a deep but thin network.
In this case it is easy to find the
derivative by chain rule.
∂L (θ)
∂W111
=
∂L (θ)
∂ˆy
∂ˆy
∂aL11
∂aL11
∂h21
∂h21
∂a21
∂a21
∂h11
∂h11
∂a11
∂a11
∂W111
∂L (θ)
∂W111
=
∂L (θ)
∂h11
∂h11
∂W111
(just compressing the chain rule)
∂L (θ)
∂W211
=
∂L (θ)
∂h21
∂h21
∂W211
∂L (θ)
∂WL11
=
∂L (θ)
∂aL1
∂aL1
∂WL11 x1
W111
a11
W211
a21
h11
WL11
aL1
h21
ˆy = f(x)
L (θ)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
27/57
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
28/57
We get a certain loss at the output and we try to
figure out who is responsible for this loss
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
28/57
We get a certain loss at the output and we try to
figure out who is responsible for this loss
So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility”.
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
28/57
We get a certain loss at the output and we try to
figure out who is responsible for this loss
So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility”.
The output layer says “Well, I take responsibility
for my part but please understand that I am only
as the good as the hidden layer and weights below
me”. After all . . .
f(x) = ˆy = O(WLhL−1 + bL)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
29/57
So, we talk to WL, bL and hL and ask them “What is
wrong with you?”
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
29/57
So, we talk to WL, bL and hL and ask them “What is
wrong with you?”
WL and bL take full responsibility but hL says “Well,
please understand that I am only as good as the pre-
activation layer”
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
29/57
So, we talk to WL, bL and hL and ask them “What is
wrong with you?”
WL and bL take full responsibility but hL says “Well,
please understand that I am only as good as the pre-
activation layer”
The pre-activation layer in turn says that I am only as
good as the hidden layer and weights below me.
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
29/57
So, we talk to WL, bL and hL and ask them “What is
wrong with you?”
WL and bL take full responsibility but hL says “Well,
please understand that I am only as good as the pre-
activation layer”
The pre-activation layer in turn says that I am only as
good as the hidden layer and weights below me.
We continue in this manner and realize that the
responsibility lies with all the weights and biases (i.e.
all the parameters of the model)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
29/57
So, we talk to WL, bL and hL and ask them “What is
wrong with you?”
WL and bL take full responsibility but hL says “Well,
please understand that I am only as good as the pre-
activation layer”
The pre-activation layer in turn says that I am only as
good as the hidden layer and weights below me.
We continue in this manner and realize that the
responsibility lies with all the weights and biases (i.e.
all the parameters of the model)
But instead of talking to them directly, it is easier to
talk to them through the hidden layers and output
layers (and this is exactly what the chain rule allows
us to do)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
29/57
So, we talk to WL, bL and hL and ask them “What is
wrong with you?”
WL and bL take full responsibility but hL says “Well,
please understand that I am only as good as the pre-
activation layer”
The pre-activation layer in turn says that I am only as
good as the hidden layer and weights below me.
We continue in this manner and realize that the
responsibility lies with all the weights and biases (i.e.
all the parameters of the model)
But instead of talking to them directly, it is easier to
talk to them through the hidden layers and output
layers (and this is exactly what the chain rule allows
us to do)
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
30/57
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
30/57
Quantities of interest (roadmap for the remaining part):
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
30/57
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
30/57
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
30/57
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
30/57
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Our focus is on Cross entropy loss and Softmax output.
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
31/57
Module 4.5: Backpropagation: Computing Gradients
w.r.t. the Output Units
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
32/57
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Our focus is on Cross entropy loss and Softmax output.
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
33/57
Let us first consider the partial derivative
w.r.t. i-th output
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
33/57
Let us first consider the partial derivative
w.r.t. i-th output
L (θ) = − log ˆy ( = true class label)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
33/57
Let us first consider the partial derivative
w.r.t. i-th output
L (θ) = − log ˆy ( = true class label)
∂
∂ˆyi
(L (θ)) =
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
33/57
Let us first consider the partial derivative
w.r.t. i-th output
L (θ) = − log ˆy ( = true class label)
∂
∂ˆyi
(L (θ)) =
∂
∂ˆyi
(− log ˆy )
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
33/57
Let us first consider the partial derivative
w.r.t. i-th output
L (θ) = − log ˆy ( = true class label)
∂
∂ˆyi
(L (θ)) =
∂
∂ˆyi
(− log ˆy )
= −
1
ˆy
if i =
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
33/57
Let us first consider the partial derivative
w.r.t. i-th output
L (θ) = − log ˆy ( = true class label)
∂
∂ˆyi
(L (θ)) =
∂
∂ˆyi
(− log ˆy )
= −
1
ˆy
if i =
= 0 otherwise
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
33/57
Let us first consider the partial derivative
w.r.t. i-th output
L (θ) = − log ˆy ( = true class label)
∂
∂ˆyi
(L (θ)) =
∂
∂ˆyi
(− log ˆy )
= −
1
ˆy
if i =
= 0 otherwise
More compactly,
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
33/57
Let us first consider the partial derivative
w.r.t. i-th output
L (θ) = − log ˆy ( = true class label)
∂
∂ˆyi
(L (θ)) =
∂
∂ˆyi
(− log ˆy )
= −
1
ˆy
if i =
= 0 otherwise
More compactly,
∂
∂ˆyi
(L (θ)) = −
1(i= )
ˆy
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =








x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk



 = −
1
ˆy
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk



 = −
1
ˆy










x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk



 = −
1
ˆy





1 =1





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk



 = −
1
ˆy





1 =1
1 =2





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk



 = −
1
ˆy





1 =1
1 =2
...





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk



 = −
1
ˆy





1 =1
1 =2
...
1 =k





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk



 = −
1
ˆy





1 =1
1 =2
...
1 =k





= −
1
ˆy
e
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
34/57
∂
∂ˆyi
(L (θ)) = −
1( =i)
ˆy
We can now talk about the gradient
w.r.t. the vector ˆy
ˆyL (θ) =




∂L (θ)
∂ˆy1
...
∂L (θ)
∂ˆyk



 = −
1
ˆy





1 =1
1 =2
...
1 =k





= −
1
ˆy
e
where e( ) is a k-dimensional vector
whose -th element is 1 and all other
elements are 0.
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
35/57
What we are actually interested in is
∂L (θ)
∂aLi
=
∂(− log ˆy )
∂aLi
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
35/57
What we are actually interested in is
∂L (θ)
∂aLi
=
∂(− log ˆy )
∂aLi
=
∂(− log ˆy )
∂ˆy
∂ˆy
∂aLi
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
35/57
What we are actually interested in is
∂L (θ)
∂aLi
=
∂(− log ˆy )
∂aLi
=
∂(− log ˆy )
∂ˆy
∂ˆy
∂aLi
Does ˆy depend on aLi ?
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
35/57
What we are actually interested in is
∂L (θ)
∂aLi
=
∂(− log ˆy )
∂aLi
=
∂(− log ˆy )
∂ˆy
∂ˆy
∂aLi
Does ˆy depend on aLi ? Indeed, it does.
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
35/57
What we are actually interested in is
∂L (θ)
∂aLi
=
∂(− log ˆy )
∂aLi
=
∂(− log ˆy )
∂ˆy
∂ˆy
∂aLi
Does ˆy depend on aLi ? Indeed, it does.
ˆy =
exp(aL )
i exp(aLi)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
35/57
What we are actually interested in is
∂L (θ)
∂aLi
=
∂(− log ˆy )
∂aLi
=
∂(− log ˆy )
∂ˆy
∂ˆy
∂aLi
Does ˆy depend on aLi ? Indeed, it does.
ˆy =
exp(aL )
i exp(aLi)
Having established this, we will now
derive the full expression on the next
slide
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
=
−1
ˆy
∂
∂aLi
softmax(aL)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
=
−1
ˆy
∂
∂aLi
softmax(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
=
−1
ˆy
∂
∂aLi
softmax(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)
∂ g(x)
h(x)
∂x
=
∂g(x)
∂x
1
h(x)
−
g(x)
h(x)2
∂h(x)
∂x
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
=
−1
ˆy
∂
∂aLi
softmax(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)i
−
exp(aL) ∂
∂aLi i exp(aL)i
( i (exp(aL)i )2
∂ g(x)
h(x)
∂x
=
∂g(x)
∂x
1
h(x)
−
g(x)
h(x)2
∂h(x)
∂x
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
=
−1
ˆy
∂
∂aLi
softmax(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)i
−
exp(aL) ∂
∂aLi i exp(aL)i
( i (exp(aL)i )2
=
−1
ˆy
1( =i) exp(aL)
i exp(aL)i
−
exp(aL)
i exp(aL)i
exp(aL)i
i exp(aL)i
∂ g(x)
h(x)
∂x
=
∂g(x)
∂x
1
h(x)
−
g(x)
h(x)2
∂h(x)
∂x
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
=
−1
ˆy
∂
∂aLi
softmax(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)i
−
exp(aL) ∂
∂aLi i exp(aL)i
( i (exp(aL)i )2
=
−1
ˆy
1( =i) exp(aL)
i exp(aL)i
−
exp(aL)
i exp(aL)i
exp(aL)i
i exp(aL)i
=
−1
ˆy
1( =i)softmax(aL) − softmax(aL) softmax(aL)i
∂ g(x)
h(x)
∂x
=
∂g(x)
∂x
1
h(x)
−
g(x)
h(x)2
∂h(x)
∂x
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
=
−1
ˆy
∂
∂aLi
softmax(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)i
−
exp(aL) ∂
∂aLi i exp(aL)i
( i (exp(aL)i )2
=
−1
ˆy
1( =i) exp(aL)
i exp(aL)i
−
exp(aL)
i exp(aL)i
exp(aL)i
i exp(aL)i
=
−1
ˆy
1( =i)softmax(aL) − softmax(aL) softmax(aL)i
=
−1
ˆy
1( =i) ˆy − ˆy ˆyi
∂ g(x)
h(x)
∂x
=
∂g(x)
∂x
1
h(x)
−
g(x)
h(x)2
∂h(x)
∂x
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
36/57
∂
∂aLi
− log ˆy =
−1
ˆy
∂
∂aLi
ˆy
=
−1
ˆy
∂
∂aLi
softmax(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)
=
−1
ˆy
∂
∂aLi
exp(aL)
i exp(aL)i
−
exp(aL) ∂
∂aLi i exp(aL)i
( i (exp(aL)i )2
=
−1
ˆy
1( =i) exp(aL)
i exp(aL)i
−
exp(aL)
i exp(aL)i
exp(aL)i
i exp(aL)i
=
−1
ˆy
1( =i)softmax(aL) − softmax(aL) softmax(aL)i
=
−1
ˆy
1( =i) ˆy − ˆy ˆyi
= − 1( =i) − ˆyi
∂ g(x)
h(x)
∂x
=
∂g(x)
∂x
1
h(x)
−
g(x)
h(x)2
∂h(x)
∂x
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1
...




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1
...
∂L (θ)
∂aLk




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1
...
∂L (θ)
∂aLk



 =










x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1
...
∂L (θ)
∂aLk



 =





− (1 =1 − ˆy1)





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1
...
∂L (θ)
∂aLk



 =





− (1 =1 − ˆy1)
− (1 =2 − ˆy2)





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1
...
∂L (θ)
∂aLk



 =





− (1 =1 − ˆy1)
− (1 =2 − ˆy2)
...





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1
...
∂L (θ)
∂aLk



 =





− (1 =1 − ˆy1)
− (1 =2 − ˆy2)
...
− (1 =k − ˆyk)





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
37/57
So far we have derived the partial derivative w.r.t.
the i-th element of aL
∂L (θ)
∂aL,i
= −(1 =i − ˆyi)
We can now write the gradient w.r.t. the vector aL
aL
L (θ) =




∂L (θ)
∂aL1
...
∂L (θ)
∂aLk



 =





− (1 =1 − ˆy1)
− (1 =2 − ˆy2)
...
− (1 =k − ˆyk)





= −(e( ) − ˆy)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
38/57
Module 4.6: Backpropagation: Computing Gradients
w.r.t. Hidden Units
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
39/57
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Our focus is on Cross entropy loss and Softmax output.
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
40/57
Chain rule along multiple paths: If a
function p(z) can be written as a function of
intermediate results qi(z) then we have :
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
40/57
Chain rule along multiple paths: If a
function p(z) can be written as a function of
intermediate results qi(z) then we have :
∂p(z)
∂z
=
m
∂p(z)
∂qm(z)
∂qm(z)
∂z
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
40/57
Chain rule along multiple paths: If a
function p(z) can be written as a function of
intermediate results qi(z) then we have :
∂p(z)
∂z
=
m
∂p(z)
∂qm(z)
∂qm(z)
∂z
In our case:
p(z) is the loss function L (θ)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
40/57
Chain rule along multiple paths: If a
function p(z) can be written as a function of
intermediate results qi(z) then we have :
∂p(z)
∂z
=
m
∂p(z)
∂qm(z)
∂qm(z)
∂z
In our case:
p(z) is the loss function L (θ)
z = hij
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
40/57
Chain rule along multiple paths: If a
function p(z) can be written as a function of
intermediate results qi(z) then we have :
∂p(z)
∂z
=
m
∂p(z)
∂qm(z)
∂qm(z)
∂z
In our case:
p(z) is the loss function L (θ)
z = hij
qm(z) = aLm
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
41/57Intentionally left blank
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =







 ; Wi+1, · ,j =






x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1



 ; Wi+1, · ,j =






x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1



 ; Wi+1, · ,j =



Wi+1,1,j



x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1
...



 ; Wi+1, · ,j =



Wi+1,1,j
...



x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1
...
∂L (θ)
∂ai+1,k



 ; Wi+1, · ,j =



Wi+1,1,j
...



x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1
...
∂L (θ)
∂ai+1,k



 ; Wi+1, · ,j =



Wi+1,1,j
...
Wi+1,k,j



x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1
...
∂L (θ)
∂ai+1,k



 ; Wi+1, · ,j =



Wi+1,1,j
...
Wi+1,k,j



x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1
...
∂L (θ)
∂ai+1,k



 ; Wi+1, · ,j =



Wi+1,1,j
...
Wi+1,k,j



Wi+1, · ,j is the j-th column of Wi+1;
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1
...
∂L (θ)
∂ai+1,k



 ; Wi+1, · ,j =



Wi+1,1,j
...
Wi+1,k,j



Wi+1, · ,j is the j-th column of Wi+1; see that,
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1
...
∂L (θ)
∂ai+1,k



 ; Wi+1, · ,j =



Wi+1,1,j
...
Wi+1,k,j



Wi+1, · ,j is the j-th column of Wi+1; see that,
(Wi+1, · ,j)T
ai+1 L (θ) =
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
42/57
∂L (θ)
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
∂ai+1,m
∂hij
=
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
Now consider these two vectors,
ai+1 L (θ) =




∂L (θ)
∂ai+1,1
...
∂L (θ)
∂ai+1,k



 ; Wi+1, · ,j =



Wi+1,1,j
...
Wi+1,k,j



Wi+1, · ,j is the j-th column of Wi+1; see that,
(Wi+1, · ,j)T
ai+1 L (θ) =
k
m=1
∂L (θ)
∂ai+1,m
Wi+1,m,j
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
ai+1 = Wi+1hij + bi+1
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =












=










x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1






=










x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1






=





(Wi+1, · ,1)T
ai+1 L (θ)





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1
∂L (θ)
∂hi2






=





(Wi+1, · ,1)T
ai+1 L (θ)





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1
∂L (θ)
∂hi2






=





(Wi+1, · ,1)T
ai+1 L (θ)
(Wi+1, · ,2)T
ai+1 L (θ)





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1
∂L (θ)
∂hi2
...






=





(Wi+1, · ,1)T
ai+1 L (θ)
(Wi+1, · ,2)T
ai+1 L (θ)
...





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1
∂L (θ)
∂hi2
...
∂L (θ)
∂hin






=





(Wi+1, · ,1)T
ai+1 L (θ)
(Wi+1, · ,2)T
ai+1 L (θ)
...





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1
∂L (θ)
∂hi2
...
∂L (θ)
∂hin






=





(Wi+1, · ,1)T
ai+1 L (θ)
(Wi+1, · ,2)T
ai+1 L (θ)
...
(Wi+1, · ,n)T
ai+1 L (θ)





x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1
∂L (θ)
∂hi2
...
∂L (θ)
∂hin






=





(Wi+1, · ,1)T
ai+1 L (θ)
(Wi+1, · ,2)T
ai+1 L (θ)
...
(Wi+1, · ,n)T
ai+1 L (θ)





= (Wi+1)T
( ai+1 L (θ))
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1
∂L (θ)
∂hi2
...
∂L (θ)
∂hin






=





(Wi+1, · ,1)T
ai+1 L (θ)
(Wi+1, · ,2)T
ai+1 L (θ)
...
(Wi+1, · ,n)T
ai+1 L (θ)





= (Wi+1)T
( ai+1 L (θ))
We are almost done except that we do not
know how to calculate ai+1 L (θ) for i < L−1
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
43/57
We have,
∂L (θ)
∂hij
= (Wi+1,.,j)T
ai+1 L (θ)
We can now write the gradient w.r.t. hi
hi
L (θ) =






∂L (θ)
∂hi1
∂L (θ)
∂hi2
...
∂L (θ)
∂hin






=





(Wi+1, · ,1)T
ai+1 L (θ)
(Wi+1, · ,2)T
ai+1 L (θ)
...
(Wi+1, · ,n)T
ai+1 L (θ)





= (Wi+1)T
( ai+1 L (θ))
We are almost done except that we do not
know how to calculate ai+1 L (θ) for i < L−1
We will see how to compute that
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =








x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
=
∂L (θ)
∂hij
∂hij
∂aij
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
=
∂L (θ)
∂hij
∂hij
∂aij
=
∂L (θ)
∂hij
g (aij) [ hij = g(aij)]
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
=
∂L (θ)
∂hij
∂hij
∂aij
=
∂L (θ)
∂hij
g (aij) [ hij = g(aij)]
ai
L (θ)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
=
∂L (θ)
∂hij
∂hij
∂aij
=
∂L (θ)
∂hij
g (aij) [ hij = g(aij)]
ai
L (θ) =








x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
=
∂L (θ)
∂hij
∂hij
∂aij
=
∂L (θ)
∂hij
g (aij) [ hij = g(aij)]
ai
L (θ) =




∂L (θ)
∂hi1
g (ai1)




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
=
∂L (θ)
∂hij
∂hij
∂aij
=
∂L (θ)
∂hij
g (aij) [ hij = g(aij)]
ai
L (θ) =




∂L (θ)
∂hi1
g (ai1)
...




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
=
∂L (θ)
∂hij
∂hij
∂aij
=
∂L (θ)
∂hij
g (aij) [ hij = g(aij)]
ai
L (θ) =




∂L (θ)
∂hi1
g (ai1)
...
∂L (θ)
∂hin
g (ain)




x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
44/57
ai
L (θ) =




∂L (θ)
∂ai1
...
∂L (θ)
∂ain




∂L (θ)
∂aij
=
∂L (θ)
∂hij
∂hij
∂aij
=
∂L (θ)
∂hij
g (aij) [ hij = g(aij)]
ai
L (θ) =




∂L (θ)
∂hi1
g (ai1)
...
∂L (θ)
∂hin
g (ain)




= hi
L (θ) [. . . , g (aik), . . . ] x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
45/57
Module 4.7: Backpropagation: Computing Gradients
w.r.t. Parameters
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
46/57
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases
∂L (θ)
∂W111
Talk to the
weight directly
=
∂L (θ)
∂ˆy
∂ˆy
∂a3
Talk to the
output layer
∂a3
∂h2
∂h2
∂a2
Talk to the
previous hidden
layer
∂a2
∂h1
∂h1
∂a1
Talk to the
previous
hidden layer
∂a1
∂W111
and now
talk to
the
weights
Our focus is on Cross entropy loss and Softmax output.
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
47/57
Recall that,
ak = bk + Wkhk−1
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
47/57
Recall that,
ak = bk + Wkhk−1
∂aki
∂Wkij
= hk−1,j
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
47/57
Recall that,
ak = bk + Wkhk−1
∂aki
∂Wkij
= hk−1,j
∂L (θ)
∂Wkij
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
47/57
Recall that,
ak = bk + Wkhk−1
∂aki
∂Wkij
= hk−1,j
∂L (θ)
∂Wkij
=
∂L (θ)
∂aki
∂aki
∂Wkij
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
47/57
Recall that,
ak = bk + Wkhk−1
∂aki
∂Wkij
= hk−1,j
∂L (θ)
∂Wkij
=
∂L (θ)
∂aki
∂aki
∂Wkij
=
∂L (θ)
∂aki
hk−1,j
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
47/57
Recall that,
ak = bk + Wkhk−1
∂aki
∂Wkij
= hk−1,j
∂L (θ)
∂Wkij
=
∂L (θ)
∂aki
∂aki
∂Wkij
=
∂L (θ)
∂aki
hk−1,j
Wk
L (θ) =
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
47/57
Recall that,
ak = bk + Wkhk−1
∂aki
∂Wkij
= hk−1,j
∂L (θ)
∂Wkij
=
∂L (θ)
∂aki
∂aki
∂Wkij
=
∂L (θ)
∂aki
hk−1,j
Wk
L (θ) =






∂L (θ)
∂Wk11
∂L (θ)
∂Wk12
. . . . . . ∂L (θ)
∂Wk1n
. . . . . . . . . . . . . . .
...
...
...
...
...
. . . . . . . . . . . . ∂L (θ)
∂Wknn






x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
48/57Intentionally left blank
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
49/57
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
49/57
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
Wk
L (θ) =







∂L (θ)
∂Wk11
∂L (θ)
∂Wk12
∂L (θ)
∂Wk13
∂L (θ)
∂Wk21
∂L (θ)
∂Wk22
∂L (θ)
∂Wk23
∂L (θ)
∂Wk31
∂L (θ)
∂Wk32
∂L (θ)
∂Wk33







∂L (θ)
∂Wkij
= ∂L (θ)
∂aki
∂aki
∂Wkij
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
49/57
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
Wk
L (θ) =







∂L (θ)
∂Wk11
∂L (θ)
∂Wk12
∂L (θ)
∂Wk13
∂L (θ)
∂Wk21
∂L (θ)
∂Wk22
∂L (θ)
∂Wk23
∂L (θ)
∂Wk31
∂L (θ)
∂Wk32
∂L (θ)
∂Wk33







∂L (θ)
∂Wkij
= ∂L (θ)
∂aki
∂aki
∂Wkij
Wk
L (θ) =







∂L (θ)
∂ak1
hk−1,1
∂L (θ)
∂ak1
hk−1,2
∂L (θ)
∂ak1
hk−1,3
∂L (θ)
∂ak2
hk−1,1
∂L (θ)
∂ak2
hk−1,2
∂L (θ)
∂ak2
hk−1,3
∂L (θ)
∂ak3
hk−1,1
∂L (θ)
∂ak3
hk−1,2
∂L (θ)
∂ak3
hk−1,3







=
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
49/57
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
Wk
L (θ) =







∂L (θ)
∂Wk11
∂L (θ)
∂Wk12
∂L (θ)
∂Wk13
∂L (θ)
∂Wk21
∂L (θ)
∂Wk22
∂L (θ)
∂Wk23
∂L (θ)
∂Wk31
∂L (θ)
∂Wk32
∂L (θ)
∂Wk33







∂L (θ)
∂Wkij
= ∂L (θ)
∂aki
∂aki
∂Wkij
Wk
L (θ) =







∂L (θ)
∂ak1
hk−1,1
∂L (θ)
∂ak1
hk−1,2
∂L (θ)
∂ak1
hk−1,3
∂L (θ)
∂ak2
hk−1,1
∂L (θ)
∂ak2
hk−1,2
∂L (θ)
∂ak2
hk−1,3
∂L (θ)
∂ak3
hk−1,1
∂L (θ)
∂ak3
hk−1,2
∂L (θ)
∂ak3
hk−1,3







=
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
49/57
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
Wk
L (θ) =







∂L (θ)
∂Wk11
∂L (θ)
∂Wk12
∂L (θ)
∂Wk13
∂L (θ)
∂Wk21
∂L (θ)
∂Wk22
∂L (θ)
∂Wk23
∂L (θ)
∂Wk31
∂L (θ)
∂Wk32
∂L (θ)
∂Wk33







∂L (θ)
∂Wkij
= ∂L (θ)
∂aki
∂aki
∂Wkij
Wk
L (θ) =







∂L (θ)
∂ak1
hk−1,1
∂L (θ)
∂ak1
hk−1,2
∂L (θ)
∂ak1
hk−1,3
∂L (θ)
∂ak2
hk−1,1
∂L (θ)
∂ak2
hk−1,2
∂L (θ)
∂ak2
hk−1,3
∂L (θ)
∂ak3
hk−1,1
∂L (θ)
∂ak3
hk−1,2
∂L (θ)
∂ak3
hk−1,3







=
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
49/57
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
Wk
L (θ) =







∂L (θ)
∂Wk11
∂L (θ)
∂Wk12
∂L (θ)
∂Wk13
∂L (θ)
∂Wk21
∂L (θ)
∂Wk22
∂L (θ)
∂Wk23
∂L (θ)
∂Wk31
∂L (θ)
∂Wk32
∂L (θ)
∂Wk33







∂L (θ)
∂Wkij
= ∂L (θ)
∂aki
∂aki
∂Wkij
Wk
L (θ) =







∂L (θ)
∂ak1
hk−1,1
∂L (θ)
∂ak1
hk−1,2
∂L (θ)
∂ak1
hk−1,3
∂L (θ)
∂ak2
hk−1,1
∂L (θ)
∂ak2
hk−1,2
∂L (θ)
∂ak2
hk−1,3
∂L (θ)
∂ak3
hk−1,1
∂L (θ)
∂ak3
hk−1,2
∂L (θ)
∂ak3
hk−1,3







=
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
49/57
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
Wk
L (θ) =







∂L (θ)
∂Wk11
∂L (θ)
∂Wk12
∂L (θ)
∂Wk13
∂L (θ)
∂Wk21
∂L (θ)
∂Wk22
∂L (θ)
∂Wk23
∂L (θ)
∂Wk31
∂L (θ)
∂Wk32
∂L (θ)
∂Wk33







∂L (θ)
∂Wkij
= ∂L (θ)
∂aki
∂aki
∂Wkij
Wk
L (θ) =







∂L (θ)
∂ak1
hk−1,1
∂L (θ)
∂ak1
hk−1,2
∂L (θ)
∂ak1
hk−1,3
∂L (θ)
∂ak2
hk−1,1
∂L (θ)
∂ak2
hk−1,2
∂L (θ)
∂ak2
hk−1,3
∂L (θ)
∂ak3
hk−1,1
∂L (θ)
∂ak3
hk−1,2
∂L (θ)
∂ak3
hk−1,3







=
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
49/57
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
Wk
L (θ) =







∂L (θ)
∂Wk11
∂L (θ)
∂Wk12
∂L (θ)
∂Wk13
∂L (θ)
∂Wk21
∂L (θ)
∂Wk22
∂L (θ)
∂Wk23
∂L (θ)
∂Wk31
∂L (θ)
∂Wk32
∂L (θ)
∂Wk33







∂L (θ)
∂Wkij
= ∂L (θ)
∂aki
∂aki
∂Wkij
Wk
L (θ) =







∂L (θ)
∂ak1
hk−1,1
∂L (θ)
∂ak1
hk−1,2
∂L (θ)
∂ak1
hk−1,3
∂L (θ)
∂ak2
hk−1,1
∂L (θ)
∂ak2
hk−1,2
∂L (θ)
∂ak2
hk−1,3
∂L (θ)
∂ak3
hk−1,1
∂L (θ)
∂ak3
hk−1,2
∂L (θ)
∂ak3
hk−1,3







= ak
L (θ) · hk−1
T
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
50/57
Finally, coming to the biases
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
50/57
Finally, coming to the biases
aki = bki +
j
Wkijhk−1,j
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
50/57
Finally, coming to the biases
aki = bki +
j
Wkijhk−1,j
∂L (θ)
∂bki
=
∂L (θ)
∂aki
∂aki
∂bki
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
50/57
Finally, coming to the biases
aki = bki +
j
Wkijhk−1,j
∂L (θ)
∂bki
=
∂L (θ)
∂aki
∂aki
∂bki
=
∂L (θ)
∂aki
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
50/57
Finally, coming to the biases
aki = bki +
j
Wkijhk−1,j
∂L (θ)
∂bki
=
∂L (θ)
∂aki
∂aki
∂bki
=
∂L (θ)
∂aki
We can now write the gradient w.r.t. the vector
bk
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
50/57
Finally, coming to the biases
aki = bki +
j
Wkijhk−1,j
∂L (θ)
∂bki
=
∂L (θ)
∂aki
∂aki
∂bki
=
∂L (θ)
∂aki
We can now write the gradient w.r.t. the vector
bk
bk
L (θ) =






∂L (θ)
ak1
∂L (θ)
ak2
...
∂L (θ)
akn





 x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
50/57
Finally, coming to the biases
aki = bki +
j
Wkijhk−1,j
∂L (θ)
∂bki
=
∂L (θ)
∂aki
∂aki
∂bki
=
∂L (θ)
∂aki
We can now write the gradient w.r.t. the vector
bk
bk
L (θ) =






∂L (θ)
ak1
∂L (θ)
ak2
...
∂L (θ)
akn






= ak
L (θ)
x1 x2 xn
− log ˆy
W1
a1
W2
a2
h1
W3
a3
h2
b1
b2
b3
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
51/57
Module 4.8: Backpropagation: Pseudo code
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
52/57
Finally, we have all the pieces of the puzzle
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
52/57
Finally, we have all the pieces of the puzzle
aL
L (θ) (gradient w.r.t. output layer)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
52/57
Finally, we have all the pieces of the puzzle
aL
L (θ) (gradient w.r.t. output layer)
hk
L (θ), ak
L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
52/57
Finally, we have all the pieces of the puzzle
aL
L (θ) (gradient w.r.t. output layer)
hk
L (θ), ak
L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)
Wk
L (θ), bk
L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
52/57
Finally, we have all the pieces of the puzzle
aL
L (θ) (gradient w.r.t. output layer)
hk
L (θ), ak
L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)
Wk
L (θ), bk
L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)
We can now write the full learning algorithm
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
53/57
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [W0
1 , ..., W0
L, b0
1, ..., b0
L];
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
53/57
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [W0
1 , ..., W0
L, b0
1, ..., b0
L];
while t++ < max iterations do
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
53/57
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [W0
1 , ..., W0
L, b0
1, ..., b0
L];
while t++ < max iterations do
h1, h2, ..., hL−1, a1, a2, ..., aL, ˆy = forward propagation(θt);
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
53/57
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [W0
1 , ..., W0
L, b0
1, ..., b0
L];
while t++ < max iterations do
h1, h2, ..., hL−1, a1, a2, ..., aL, ˆy = forward propagation(θt);
θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy);
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
53/57
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [W0
1 , ..., W0
L, b0
1, ..., b0
L];
while t++ < max iterations do
h1, h2, ..., hL−1, a1, a2, ..., aL, ˆy = forward propagation(θt);
θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy);
θt+1 ← θt − η θt;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
54/57
Algorithm: forward propagation(θ)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
54/57
Algorithm: forward propagation(θ)
for k = 1 to L − 1 do
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
54/57
Algorithm: forward propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wkhk−1;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
54/57
Algorithm: forward propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wkhk−1;
hk = g(ak);
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
54/57
Algorithm: forward propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wkhk−1;
hk = g(ak);
end
aL = bL + WLhL−1;
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
54/57
Algorithm: forward propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wkhk−1;
hk = g(ak);
end
aL = bL + WLhL−1;
ˆy = O(aL);
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
for k = L to 1 do
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
Wk
L (θ) = ak
L (θ)hT
k−1 ;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
Wk
L (θ) = ak
L (θ)hT
k−1 ;
bk
L (θ) = ak
L (θ) ;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
Wk
L (θ) = ak
L (θ)hT
k−1 ;
bk
L (θ) = ak
L (θ) ;
// Compute gradients w.r.t. layer below ;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
Wk
L (θ) = ak
L (θ)hT
k−1 ;
bk
L (θ) = ak
L (θ) ;
// Compute gradients w.r.t. layer below ;
hk−1
L (θ) = WT
k ( ak
L (θ)) ;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
Wk
L (θ) = ak
L (θ)hT
k−1 ;
bk
L (θ) = ak
L (θ) ;
// Compute gradients w.r.t. layer below ;
hk−1
L (θ) = WT
k ( ak
L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
55/57
Just do a forward propagation and compute all hi’s, ai’s, and ˆy
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy)
//Compute output gradient ;
aL L (θ) = −(e(y) − ˆy) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
Wk
L (θ) = ak
L (θ)hT
k−1 ;
bk
L (θ) = ak
L (θ) ;
// Compute gradients w.r.t. layer below ;
hk−1
L (θ) = WT
k ( ak
L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
ak−1
L (θ) = hk−1
L (θ) [. . . , g (ak−1,j), . . . ] ;
end
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
56/57
Module 4.9: Derivative of the activation function
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
57/57
Now, the only thing we need to figure out is how to compute g
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
57/57
Now, the only thing we need to figure out is how to compute g
Logistic function
g(z) = σ(z)
=
1
1 + e−z
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
57/57
Now, the only thing we need to figure out is how to compute g
Logistic function
g(z) = σ(z)
=
1
1 + e−z
g (z) = (−1)
1
(1 + e−z)2
d
dz
(1 + e−z
)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
57/57
Now, the only thing we need to figure out is how to compute g
Logistic function
g(z) = σ(z)
=
1
1 + e−z
g (z) = (−1)
1
(1 + e−z)2
d
dz
(1 + e−z
)
= (−1)
1
(1 + e−z)2
(−e−z
)
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
deep learning
deep learning
deep learning
deep learning
deep learning
deep learning
deep learning

Mais conteúdo relacionado

Mais procurados

Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...Simplilearn
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetSungminYou
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learningJörgen Sandig
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesRui Pedro Paiva
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognitionYUNG-KUEI CHEN
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...Simplilearn
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Simplilearn
 
Python: Modules and Packages
Python: Modules and PackagesPython: Modules and Packages
Python: Modules and PackagesDamian T. Gordon
 

Mais procurados (20)

Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Cnn
CnnCnn
Cnn
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
 
Perceptron
PerceptronPerceptron
Perceptron
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
1.Introduction to deep learning
1.Introduction to deep learning1.Introduction to deep learning
1.Introduction to deep learning
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
 
Python: Modules and Packages
Python: Modules and PackagesPython: Modules and Packages
Python: Modules and Packages
 

Semelhante a deep learning

neural (1).ppt
neural (1).pptneural (1).ppt
neural (1).pptAlmamoon
 
8 neural network representation
8 neural network representation8 neural network representation
8 neural network representationTanmayVijay1
 
6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdfJavier Crisostomo
 
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience hirokazutanaka
 
Counter propagation Network
Counter propagation NetworkCounter propagation Network
Counter propagation NetworkAkshay Dhole
 
Dixon Deep Learning
Dixon Deep LearningDixon Deep Learning
Dixon Deep LearningSciCompIIT
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdfNash229987
 
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...cscpconf
 
SOFT COMPUTERING TECHNICS -Unit 1
SOFT COMPUTERING TECHNICS -Unit 1SOFT COMPUTERING TECHNICS -Unit 1
SOFT COMPUTERING TECHNICS -Unit 1sravanthi computers
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Association for Computational Linguistics
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronAndrew Ferlitsch
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksAndrew Ferlitsch
 

Semelhante a deep learning (20)

neural.ppt
neural.pptneural.ppt
neural.ppt
 
neural.ppt
neural.pptneural.ppt
neural.ppt
 
neural.ppt
neural.pptneural.ppt
neural.ppt
 
neural (1).ppt
neural (1).pptneural (1).ppt
neural (1).ppt
 
neural.ppt
neural.pptneural.ppt
neural.ppt
 
neural.ppt
neural.pptneural.ppt
neural.ppt
 
8 neural network representation
8 neural network representation8 neural network representation
8 neural network representation
 
6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf6_1_course-notes-deep-nets-overview.pdf
6_1_course-notes-deep-nets-overview.pdf
 
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
 
Counter propagation Network
Counter propagation NetworkCounter propagation Network
Counter propagation Network
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Deep learning (2)
Deep learning (2)Deep learning (2)
Deep learning (2)
 
Dixon Deep Learning
Dixon Deep LearningDixon Deep Learning
Dixon Deep Learning
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdf
 
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
DESIGN AND IMPLEMENTATION OF BINARY NEURAL NETWORK LEARNING WITH FUZZY CLUSTE...
 
SOFT COMPUTERING TECHNICS -Unit 1
SOFT COMPUTERING TECHNICS -Unit 1SOFT COMPUTERING TECHNICS -Unit 1
SOFT COMPUTERING TECHNICS -Unit 1
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
 
Machine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - PerceptronMachine Learning - Neural Networks - Perceptron
Machine Learning - Neural Networks - Perceptron
 
Machine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural NetworksMachine Learning - Introduction to Neural Networks
Machine Learning - Introduction to Neural Networks
 
Deep learning networks
Deep learning networksDeep learning networks
Deep learning networks
 

Último

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 

Último (20)

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 

deep learning

  • 1. 1/57 CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 2. 2/57 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 3. 3/57 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 4. 4/57 The input to the network is an n-dimensional vector Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 5. 4/57 x1 x2 xn The input to the network is an n-dimensional vector Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 6. 4/57 x1 x2 xn The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 7. 4/57 x1 x2 xn The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 8. 4/57 x1 x2 xn The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 9. 4/57 x1 x2 xn The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 10. 4/57 x1 x2 xn The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 11. 4/57 x1 x2 xn The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 12. 4/57 x1 x2 xn a1 a2 a3 The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : pre-activation Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 13. 4/57 x1 x2 xn a1 a2 a3 h1 h2 hL = ˆy = f(x) The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 14. 4/57 x1 x2 xn a1 a2 a3 h1 h2 hL = ˆy = f(x) The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (ai and hi are vectors) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 15. 4/57 x1 x2 xn a1 a2 a3 h1 h2 hL = ˆy = f(x) The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (ai and hi are vectors) The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 16. 4/57 x1 x2 xn a1 a2 a3 h1 h2 hL = ˆy = f(x) W1 b1 The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (ai and hi are vectors) The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer Wi ∈ Rn×n and bi ∈ Rn are the weight and bias between layers i − 1 and i (0 < i < L) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 17. 4/57 x1 x2 xn a1 a2 a3 h1 h2 hL = ˆy = f(x) W1 b1 W2 b2 The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (ai and hi are vectors) The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer Wi ∈ Rn×n and bi ∈ Rn are the weight and bias between layers i − 1 and i (0 < i < L) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 18. 4/57 x1 x2 xn a1 a2 a3 h1 h2 hL = ˆy = f(x) W1 b1 W2 b2 W3 b3 The input to the network is an n-dimensional vector The network contains L − 1 hidden layers (2, in this case) having n neurons each Finally, there is one output layer containing k neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer can be split into two parts : pre-activation and activation (ai and hi are vectors) The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer Wi ∈ Rn×n and bi ∈ Rn are the weight and bias between layers i − 1 and i (0 < i < L) WL ∈ Rn×k and bL ∈ Rk are the weight and bias between the last hidden layer and the output layer (L = 3 in this case) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 19. 5/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 20. 5/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) The activation at layer i is given by hi(x) = g(ai(x)) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 21. 5/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) The activation at layer i is given by hi(x) = g(ai(x)) where g is called the activation function (for example, logistic, tanh, linear, etc.) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 22. 5/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) The activation at layer i is given by hi(x) = g(ai(x)) where g is called the activation function (for example, logistic, tanh, linear, etc.) The activation at the output layer is given by f(x) = hL(x) = O(aL(x)) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 23. 5/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) The activation at layer i is given by hi(x) = g(ai(x)) where g is called the activation function (for example, logistic, tanh, linear, etc.) The activation at the output layer is given by f(x) = hL(x) = O(aL(x)) where O is the output activation function (for example, softmax, linear, etc.) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 24. 5/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) The activation at layer i is given by hi(x) = g(ai(x)) where g is called the activation function (for example, logistic, tanh, linear, etc.) The activation at the output layer is given by f(x) = hL(x) = O(aL(x)) where O is the output activation function (for example, softmax, linear, etc.) To simplify notation we will refer to ai(x) as ai and hi(x) as hi Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 25. 6/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) The pre-activation at layer i is given by ai = bi + Wihi−1 The activation at layer i is given by hi = g(ai) where g is called the activation function (for example, logistic, tanh, linear, etc.) The activation at the output layer is given by f(x) = hL = O(aL) where O is the output activation function (for example, softmax, linear, etc.) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 26. 7/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Data: {xi, yi}N i=1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 27. 7/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Data: {xi, yi}N i=1 Model: Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 28. 7/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Data: {xi, yi}N i=1 Model: ˆyi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 29. 7/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Data: {xi, yi}N i=1 Model: ˆyi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3) Parameters: θ = W1, .., WL, b1, b2, ..., bL(L = 3) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 30. 7/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Data: {xi, yi}N i=1 Model: ˆyi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3) Parameters: θ = W1, .., WL, b1, b2, ..., bL(L = 3) Algorithm: Gradient Descent with Back- propagation (we will see soon) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 31. 7/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Data: {xi, yi}N i=1 Model: ˆyi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3) Parameters: θ = W1, .., WL, b1, b2, ..., bL(L = 3) Algorithm: Gradient Descent with Back- propagation (we will see soon) Objective/Loss/Error function: Say, min 1 N N i=1 k j=1 (ˆyij − yij)2 In general, min L (θ) where L (θ) is some function of the parameters Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 32. 8/57 Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 33. 9/57 The story so far... We have introduced feedforward neural networks We are now interested in finding an algorithm for learning the parameters of this model Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 34. 10/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Recall our gradient descent algorithm Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 35. 10/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Recall our gradient descent algorithm Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize w0, b0; while t++ < max iterations do wt+1 ← wt − η wt; bt+1 ← bt − η bt; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 36. 10/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize w0, b0; while t++ < max iterations do wt+1 ← wt − η wt; bt+1 ← bt − η bt; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 37. 10/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [w0, b0]; while t++ < max iterations do θt+1 ← θt − η θt; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 38. 10/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [w0, b0]; while t++ < max iterations do θt+1 ← θt − η θt; end where θt = ∂L (θ) ∂wt , ∂L (θ) ∂bt T Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 39. 10/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [w0, b0]; while t++ < max iterations do θt+1 ← θt − η θt; end where θt = ∂L (θ) ∂wt , ∂L (θ) ∂bt T Now, in this feedforward neural network, instead of θ = [w, b] we have θ = [W1, W2, .., WL, b1, b2, .., bL] Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 40. 10/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [w0, b0]; while t++ < max iterations do θt+1 ← θt − η θt; end where θt = ∂L (θ) ∂wt , ∂L (θ) ∂bt T Now, in this feedforward neural network, instead of θ = [w, b] we have θ = [W1, W2, .., WL, b1, b2, .., bL] We can still use the same algorithm for learning the parameters of our model Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 41. 10/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Recall our gradient descent algorithm We can write it more concisely as Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [W0 1 , ..., W0 L, b0 1, ..., b0 L]; while t++ < max iterations do θt+1 ← θt − η θt; end where θt = ∂L (θ) ∂W1,t , ., ∂L (θ) ∂WL,t , ∂L (θ) ∂b1,t , ., ∂L (θ) ∂bL,t T Now, in this feedforward neural network, instead of θ = [w, b] we have θ = [W1, W2, .., WL, b1, b2, .., bL] We can still use the same algorithm for learning the parameters of our model Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 42. 11/57 Except that now our θ looks much more nasty Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 43. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111          Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 44. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111 . . .          Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 45. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111 . . . ∂L (θ) ∂W11n          Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 46. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111 . . . ∂L (θ) ∂W11n ∂L (θ) ∂W121 . . . ∂L (θ) ∂W12n ... ... ... ∂L (θ) ∂W1n1 . . . ∂L (θ) ∂W1nn          Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 47. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111 . . . ∂L (θ) ∂W11n ∂L (θ) ∂W211 . . . ∂L (θ) ∂W21n ∂L (θ) ∂W121 . . . ∂L (θ) ∂W12n ∂L (θ) ∂W221 . . . ∂L (θ) ∂W22n ... ... ... ... ... ... ∂L (θ) ∂W1n1 . . . ∂L (θ) ∂W1nn ∂L (θ) ∂W2n1 . . . ∂L (θ) ∂W2nn          Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 48. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111 . . . ∂L (θ) ∂W11n ∂L (θ) ∂W211 . . . ∂L (θ) ∂W21n . . . ∂L (θ) ∂W121 . . . ∂L (θ) ∂W12n ∂L (θ) ∂W221 . . . ∂L (θ) ∂W22n . . . ... ... ... ... ... ... ... ∂L (θ) ∂W1n1 . . . ∂L (θ) ∂W1nn ∂L (θ) ∂W2n1 . . . ∂L (θ) ∂W2nn . . .          Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 49. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111 . . . ∂L (θ) ∂W11n ∂L (θ) ∂W211 . . . ∂L (θ) ∂W21n . . . ∂L (θ) ∂WL,11 . . . ∂L (θ) ∂WL,1k ∂L (θ) ∂WL,1k ∂L (θ) ∂W121 . . . ∂L (θ) ∂W12n ∂L (θ) ∂W221 . . . ∂L (θ) ∂W22n . . . ∂L (θ) ∂WL,21 . . . ∂L (θ) ∂WL,2k ∂L (θ) ∂WL,2k ... ... ... ... ... ... ... ... ... ... ... ∂L (θ) ∂W1n1 . . . ∂L (θ) ∂W1nn ∂L (θ) ∂W2n1 . . . ∂L (θ) ∂W2nn . . . ∂L (θ) ∂WL,n1 . . . ∂L (θ) ∂WL,nk ∂L (θ) ∂WL,nk          Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 50. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111 . . . ∂L (θ) ∂W11n ∂L (θ) ∂W211 . . . ∂L (θ) ∂W21n . . . ∂L (θ) ∂WL,11 . . . ∂L (θ) ∂WL,1k ∂L (θ) ∂WL,1k ∂L (θ) ∂b11 . . . ∂L (θ) ∂bL1 ∂L (θ) ∂W121 . . . ∂L (θ) ∂W12n ∂L (θ) ∂W221 . . . ∂L (θ) ∂W22n . . . ∂L (θ) ∂WL,21 . . . ∂L (θ) ∂WL,2k ∂L (θ) ∂WL,2k ∂L (θ) ∂b12 . . . ∂L (θ) ∂bL2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ∂L (θ) ∂W1n1 . . . ∂L (θ) ∂W1nn ∂L (θ) ∂W2n1 . . . ∂L (θ) ∂W2nn . . . ∂L (θ) ∂WL,n1 . . . ∂L (θ) ∂WL,nk ∂L (θ) ∂WL,nk ∂L (θ) ∂b1n . . . ∂L (θ) ∂bLk          Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 51. 11/57 Except that now our θ looks much more nasty          ∂L (θ) ∂W111 . . . ∂L (θ) ∂W11n ∂L (θ) ∂W211 . . . ∂L (θ) ∂W21n . . . ∂L (θ) ∂WL,11 . . . ∂L (θ) ∂WL,1k ∂L (θ) ∂WL,1k ∂L (θ) ∂b11 . . . ∂L (θ) ∂bL1 ∂L (θ) ∂W121 . . . ∂L (θ) ∂W12n ∂L (θ) ∂W221 . . . ∂L (θ) ∂W22n . . . ∂L (θ) ∂WL,21 . . . ∂L (θ) ∂WL,2k ∂L (θ) ∂WL,2k ∂L (θ) ∂b12 . . . ∂L (θ) ∂bL2 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ∂L (θ) ∂W1n1 . . . ∂L (θ) ∂W1nn ∂L (θ) ∂W2n1 . . . ∂L (θ) ∂W2nn . . . ∂L (θ) ∂WL,n1 . . . ∂L (θ) ∂WL,nk ∂L (θ) ∂WL,nk ∂L (θ) ∂b1n . . . ∂L (θ) ∂bLk          θ is thus composed of W1, W2, ... WL−1 ∈ Rn×n, WL ∈ Rn×k, b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 52. 12/57 We need to answer two questions Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 53. 12/57 We need to answer two questions How to choose the loss function L (θ)? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 54. 12/57 We need to answer two questions How to choose the loss function L (θ)? How to compute θ which is composed of W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 55. 13/57 Module 4.3: Output Functions and Loss Functions Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 56. 14/57 We need to answer two questions How to choose the loss function L (θ) ? How to compute θ which is composed of: W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 57. 14/57 We need to answer two questions How to choose the loss function L (θ) ? How to compute θ which is composed of: W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 58. 15/57 The choice of loss function depends on the problem at hand Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 59. 15/57 The choice of loss function depends on the problem at hand We will illustrate this with the help of two examples Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 60. 15/57 Neural network with L − 1 hidden layers isActor Damon isDirector Nolan Critics Rating imdb Rating RT Rating . . . . . . . . . . xi yi = {7.5 8.2 7.7} The choice of loss function depends on the problem at hand We will illustrate this with the help of two examples Consider our movie example again but this time we are interested in predicting ratings Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 61. 15/57 Neural network with L − 1 hidden layers isActor Damon isDirector Nolan Critics Rating imdb Rating RT Rating . . . . . . . . . . xi yi = {7.5 8.2 7.7} The choice of loss function depends on the problem at hand We will illustrate this with the help of two examples Consider our movie example again but this time we are interested in predicting ratings Here yi ∈ R3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 62. 15/57 Neural network with L − 1 hidden layers isActor Damon isDirector Nolan Critics Rating imdb Rating RT Rating . . . . . . . . . . xi yi = {7.5 8.2 7.7} The choice of loss function depends on the problem at hand We will illustrate this with the help of two examples Consider our movie example again but this time we are interested in predicting ratings Here yi ∈ R3 The loss function should capture how much ˆyi deviates from yi Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 63. 15/57 Neural network with L − 1 hidden layers isActor Damon isDirector Nolan Critics Rating imdb Rating RT Rating . . . . . . . . . . xi yi = {7.5 8.2 7.7} The choice of loss function depends on the problem at hand We will illustrate this with the help of two examples Consider our movie example again but this time we are interested in predicting ratings Here yi ∈ R3 The loss function should capture how much ˆyi deviates from yi If yi ∈ Rn then the squared error loss can capture this deviation L (θ) = 1 N N i=1 3 j=1 (ˆyij − yij)2 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 64. 16/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) A related question: What should the output function ‘O’ be if yi ∈ R? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 65. 16/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) A related question: What should the output function ‘O’ be if yi ∈ R? More specifically, can it be the logistic function? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 66. 16/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) A related question: What should the output function ‘O’ be if yi ∈ R? More specifically, can it be the logistic function? No, because it restricts ˆyi to a value between 0 & 1 but we want ˆyi ∈ R Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 67. 16/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) A related question: What should the output function ‘O’ be if yi ∈ R? More specifically, can it be the logistic function? No, because it restricts ˆyi to a value between 0 & 1 but we want ˆyi ∈ R So, in such cases it makes sense to have ‘O’ as linear function f(x) = hL = O(aL) = WOaL + bO Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 68. 16/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) A related question: What should the output function ‘O’ be if yi ∈ R? More specifically, can it be the logistic function? No, because it restricts ˆyi to a value between 0 & 1 but we want ˆyi ∈ R So, in such cases it makes sense to have ‘O’ as linear function f(x) = hL = O(aL) = WOaL + bO ˆyi = f(xi) is no longer bounded between 0 and 1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 69. 17/57Intentionally left blank Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 70. 17/57Intentionally left blank Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 71. 18/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Now let us consider another problem for which a different loss function would be appropriate Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 72. 18/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Now let us consider another problem for which a different loss function would be appropriate Suppose we want to classify an image into 1 of k classes Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 73. 18/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Now let us consider another problem for which a different loss function would be appropriate Suppose we want to classify an image into 1 of k classes Here again we could use the squared error loss to capture the deviation Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 74. 18/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Now let us consider another problem for which a different loss function would be appropriate Suppose we want to classify an image into 1 of k classes Here again we could use the squared error loss to capture the deviation But can you think of a better function? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 75. 19/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Notice that y is a probability distribution Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 76. 19/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Notice that y is a probability distribution Therefore we should also ensure that ˆy is a probability distribution Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 77. 19/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Notice that y is a probability distribution Therefore we should also ensure that ˆy is a probability distribution What choice of the output activation ‘O’ will ensure this ? aL = WLhL−1 + bL Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 78. 19/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Notice that y is a probability distribution Therefore we should also ensure that ˆy is a probability distribution What choice of the output activation ‘O’ will ensure this ? aL = WLhL−1 + bL ˆyj = O(aL)j = eaL,j k i=1 eaL,i O(aL)j is the jth element of ˆy and aL,j is the jth element of the vector aL. Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 79. 19/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) Notice that y is a probability distribution Therefore we should also ensure that ˆy is a probability distribution What choice of the output activation ‘O’ will ensure this ? aL = WLhL−1 + bL ˆyj = O(aL)j = eaL,j k i=1 eaL,i O(aL)j is the jth element of ˆy and aL,j is the jth element of the vector aL. This function is called the softmax function Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 80. 20/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Now that we have ensured that both y & ˆy are probability distributions can you think of a function which captures the difference between them? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 81. 20/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Now that we have ensured that both y & ˆy are probability distributions can you think of a function which captures the difference between them? Cross-entropy L (θ) = − k c=1 yc log ˆyc Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 82. 20/57 Neural network with L − 1 hidden layers Apple Mango Orange Banana y = [1 0 0 0] Now that we have ensured that both y & ˆy are probability distributions can you think of a function which captures the difference between them? Cross-entropy L (θ) = − k c=1 yc log ˆyc Notice that yc = 1 if c = (the true class label) = 0 otherwise ∴ L (θ) = − log ˆy Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 83. 21/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) So, for classification problem (where you have to choose 1 of K classes), we use the following objective function minimize θ L (θ) = − log ˆy or maximize θ − L (θ) = log ˆy Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 84. 21/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) So, for classification problem (where you have to choose 1 of K classes), we use the following objective function minimize θ L (θ) = − log ˆy or maximize θ − L (θ) = log ˆy But wait! Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 85. 21/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) So, for classification problem (where you have to choose 1 of K classes), we use the following objective function minimize θ L (θ) = − log ˆy or maximize θ − L (θ) = log ˆy But wait! Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]? Yes, it is indeed a function of θ ˆy = [O(W3g(W2g(W1x + b1) + b2) + b3)] Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 86. 21/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) So, for classification problem (where you have to choose 1 of K classes), we use the following objective function minimize θ L (θ) = − log ˆy or maximize θ − L (θ) = log ˆy But wait! Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]? Yes, it is indeed a function of θ ˆy = [O(W3g(W2g(W1x + b1) + b2) + b3)] What does ˆy encode? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 87. 21/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) So, for classification problem (where you have to choose 1 of K classes), we use the following objective function minimize θ L (θ) = − log ˆy or maximize θ − L (θ) = log ˆy But wait! Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]? Yes, it is indeed a function of θ ˆy = [O(W3g(W2g(W1x + b1) + b2) + b3)] What does ˆy encode? It is the probability that x belongs to the th class (bring it as close to 1). Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 88. 21/57 x1 x2 xn W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 hL = ˆy = f(x) So, for classification problem (where you have to choose 1 of K classes), we use the following objective function minimize θ L (θ) = − log ˆy or maximize θ − L (θ) = log ˆy But wait! Is ˆy a function of θ = [W1, W2, ., WL, b1, b2, ., bL]? Yes, it is indeed a function of θ ˆy = [O(W3g(W2g(W1x + b1) + b2) + b3)] What does ˆy encode? It is the probability that x belongs to the th class (bring it as close to 1). log ˆy is called the log-likelihood of the data. Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 89. 22/57 Outputs Real Values Probabilities Output Activation Loss Function Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 90. 22/57 Outputs Real Values Probabilities Output Activation Linear Loss Function Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 91. 22/57 Outputs Real Values Probabilities Output Activation Linear Softmax Loss Function Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 92. 22/57 Outputs Real Values Probabilities Output Activation Linear Softmax Loss Function Squared Error Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 93. 22/57 Outputs Real Values Probabilities Output Activation Linear Softmax Loss Function Squared Error Cross Entropy Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 94. 22/57 Outputs Real Values Probabilities Output Activation Linear Softmax Loss Function Squared Error Cross Entropy Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 95. 22/57 Outputs Real Values Probabilities Output Activation Linear Softmax Loss Function Squared Error Cross Entropy Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often For the rest of this lecture we will focus on the case where the output activation is a softmax function and the loss function is cross entropy Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 96. 23/57 Module 4.4: Backpropagation (Intuition) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 97. 24/57 We need to answer two questions How to choose the loss function L (θ) ? How to compute θ which is composed of: W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 98. 24/57 We need to answer two questions How to choose the loss function L (θ) ? How to compute θ which is composed of: W1, W2, ..., WL−1 ∈ Rn×n, WL ∈ Rn×k b1, b2, ..., bL−1 ∈ Rn and bL ∈ Rk ? Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 99. 25/57 Let us focus on this one weight (W112). x1 x2 xd W111 a11 W211 a21 h11 W311 a31 h21 b1 b2 b3 ˆy = f(x) W112 Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0; while t++ < max iterations do θt+1 ← θt − η θt; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 100. 25/57 Let us focus on this one weight (W112). To learn this weight using SGD we need a formula for ∂L (θ) ∂W112 . x1 x2 xd W111 a11 W211 a21 h11 W311 a31 h21 b1 b2 b3 ˆy = f(x) W112 Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0; while t++ < max iterations do θt+1 ← θt − η θt; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 101. 25/57 Let us focus on this one weight (W112). To learn this weight using SGD we need a formula for ∂L (θ) ∂W112 . We will see how to calculate this. x1 x2 xd W111 a11 W211 a21 h11 W311 a31 h21 b1 b2 b3 ˆy = f(x) W112 Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0; while t++ < max iterations do θt+1 ← θt − η θt; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 102. 26/57 First let us take the simple case when we have a deep but thin network. x1 W111 a11 W211 a21 h11 WL11 aL1 h21 ˆy = f(x) L (θ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 103. 26/57 First let us take the simple case when we have a deep but thin network. In this case it is easy to find the derivative by chain rule. x1 W111 a11 W211 a21 h11 WL11 aL1 h21 ˆy = f(x) L (θ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 104. 26/57 First let us take the simple case when we have a deep but thin network. In this case it is easy to find the derivative by chain rule. ∂L (θ) ∂W111 = ∂L (θ) ∂ˆy ∂ˆy ∂aL11 ∂aL11 ∂h21 ∂h21 ∂a21 ∂a21 ∂h11 ∂h11 ∂a11 ∂a11 ∂W111 x1 W111 a11 W211 a21 h11 WL11 aL1 h21 ˆy = f(x) L (θ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 105. 26/57 First let us take the simple case when we have a deep but thin network. In this case it is easy to find the derivative by chain rule. ∂L (θ) ∂W111 = ∂L (θ) ∂ˆy ∂ˆy ∂aL11 ∂aL11 ∂h21 ∂h21 ∂a21 ∂a21 ∂h11 ∂h11 ∂a11 ∂a11 ∂W111 ∂L (θ) ∂W111 = ∂L (θ) ∂h11 ∂h11 ∂W111 (just compressing the chain rule) x1 W111 a11 W211 a21 h11 WL11 aL1 h21 ˆy = f(x) L (θ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 106. 26/57 First let us take the simple case when we have a deep but thin network. In this case it is easy to find the derivative by chain rule. ∂L (θ) ∂W111 = ∂L (θ) ∂ˆy ∂ˆy ∂aL11 ∂aL11 ∂h21 ∂h21 ∂a21 ∂a21 ∂h11 ∂h11 ∂a11 ∂a11 ∂W111 ∂L (θ) ∂W111 = ∂L (θ) ∂h11 ∂h11 ∂W111 (just compressing the chain rule) ∂L (θ) ∂W211 = ∂L (θ) ∂h21 ∂h21 ∂W211 x1 W111 a11 W211 a21 h11 WL11 aL1 h21 ˆy = f(x) L (θ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 107. 26/57 First let us take the simple case when we have a deep but thin network. In this case it is easy to find the derivative by chain rule. ∂L (θ) ∂W111 = ∂L (θ) ∂ˆy ∂ˆy ∂aL11 ∂aL11 ∂h21 ∂h21 ∂a21 ∂a21 ∂h11 ∂h11 ∂a11 ∂a11 ∂W111 ∂L (θ) ∂W111 = ∂L (θ) ∂h11 ∂h11 ∂W111 (just compressing the chain rule) ∂L (θ) ∂W211 = ∂L (θ) ∂h21 ∂h21 ∂W211 ∂L (θ) ∂WL11 = ∂L (θ) ∂aL1 ∂aL1 ∂WL11 x1 W111 a11 W211 a21 h11 WL11 aL1 h21 ˆy = f(x) L (θ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 108. 27/57 Let us see an intuitive explanation of backpropagation before we get into the mathematical details Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 109. 28/57 We get a certain loss at the output and we try to figure out who is responsible for this loss x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 110. 28/57 We get a certain loss at the output and we try to figure out who is responsible for this loss So, we talk to the output layer and say “Hey! You are not producing the desired output, better take responsibility”. x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 111. 28/57 We get a certain loss at the output and we try to figure out who is responsible for this loss So, we talk to the output layer and say “Hey! You are not producing the desired output, better take responsibility”. The output layer says “Well, I take responsibility for my part but please understand that I am only as the good as the hidden layer and weights below me”. After all . . . f(x) = ˆy = O(WLhL−1 + bL) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 112. 29/57 So, we talk to WL, bL and hL and ask them “What is wrong with you?” x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 113. 29/57 So, we talk to WL, bL and hL and ask them “What is wrong with you?” WL and bL take full responsibility but hL says “Well, please understand that I am only as good as the pre- activation layer” x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 114. 29/57 So, we talk to WL, bL and hL and ask them “What is wrong with you?” WL and bL take full responsibility but hL says “Well, please understand that I am only as good as the pre- activation layer” The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me. x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 115. 29/57 So, we talk to WL, bL and hL and ask them “What is wrong with you?” WL and bL take full responsibility but hL says “Well, please understand that I am only as good as the pre- activation layer” The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me. We continue in this manner and realize that the responsibility lies with all the weights and biases (i.e. all the parameters of the model) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 116. 29/57 So, we talk to WL, bL and hL and ask them “What is wrong with you?” WL and bL take full responsibility but hL says “Well, please understand that I am only as good as the pre- activation layer” The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me. We continue in this manner and realize that the responsibility lies with all the weights and biases (i.e. all the parameters of the model) But instead of talking to them directly, it is easier to talk to them through the hidden layers and output layers (and this is exactly what the chain rule allows us to do) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 117. 29/57 So, we talk to WL, bL and hL and ask them “What is wrong with you?” WL and bL take full responsibility but hL says “Well, please understand that I am only as good as the pre- activation layer” The pre-activation layer in turn says that I am only as good as the hidden layer and weights below me. We continue in this manner and realize that the responsibility lies with all the weights and biases (i.e. all the parameters of the model) But instead of talking to them directly, it is easier to talk to them through the hidden layers and output layers (and this is exactly what the chain rule allows us to do) ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 118. 30/57 ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 119. 30/57 Quantities of interest (roadmap for the remaining part): ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 120. 30/57 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 121. 30/57 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 122. 30/57 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 123. 30/57 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Our focus is on Cross entropy loss and Softmax output. Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 124. 31/57 Module 4.5: Backpropagation: Computing Gradients w.r.t. the Output Units Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 125. 32/57 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Our focus is on Cross entropy loss and Softmax output. Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 126. 33/57 Let us first consider the partial derivative w.r.t. i-th output x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 127. 33/57 Let us first consider the partial derivative w.r.t. i-th output L (θ) = − log ˆy ( = true class label) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 128. 33/57 Let us first consider the partial derivative w.r.t. i-th output L (θ) = − log ˆy ( = true class label) ∂ ∂ˆyi (L (θ)) = x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 129. 33/57 Let us first consider the partial derivative w.r.t. i-th output L (θ) = − log ˆy ( = true class label) ∂ ∂ˆyi (L (θ)) = ∂ ∂ˆyi (− log ˆy ) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 130. 33/57 Let us first consider the partial derivative w.r.t. i-th output L (θ) = − log ˆy ( = true class label) ∂ ∂ˆyi (L (θ)) = ∂ ∂ˆyi (− log ˆy ) = − 1 ˆy if i = x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 131. 33/57 Let us first consider the partial derivative w.r.t. i-th output L (θ) = − log ˆy ( = true class label) ∂ ∂ˆyi (L (θ)) = ∂ ∂ˆyi (− log ˆy ) = − 1 ˆy if i = = 0 otherwise x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 132. 33/57 Let us first consider the partial derivative w.r.t. i-th output L (θ) = − log ˆy ( = true class label) ∂ ∂ˆyi (L (θ)) = ∂ ∂ˆyi (− log ˆy ) = − 1 ˆy if i = = 0 otherwise More compactly, x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 133. 33/57 Let us first consider the partial derivative w.r.t. i-th output L (θ) = − log ˆy ( = true class label) ∂ ∂ˆyi (L (θ)) = ∂ ∂ˆyi (− log ˆy ) = − 1 ˆy if i = = 0 otherwise More compactly, ∂ ∂ˆyi (L (θ)) = − 1(i= ) ˆy x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 134. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 135. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 136. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =         x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 137. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 138. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ...     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 139. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 140. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     = − 1 ˆy x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 141. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     = − 1 ˆy           x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 142. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     = − 1 ˆy      1 =1      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 143. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     = − 1 ˆy      1 =1 1 =2      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 144. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     = − 1 ˆy      1 =1 1 =2 ...      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 145. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     = − 1 ˆy      1 =1 1 =2 ... 1 =k      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 146. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     = − 1 ˆy      1 =1 1 =2 ... 1 =k      = − 1 ˆy e x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 147. 34/57 ∂ ∂ˆyi (L (θ)) = − 1( =i) ˆy We can now talk about the gradient w.r.t. the vector ˆy ˆyL (θ) =     ∂L (θ) ∂ˆy1 ... ∂L (θ) ∂ˆyk     = − 1 ˆy      1 =1 1 =2 ... 1 =k      = − 1 ˆy e where e( ) is a k-dimensional vector whose -th element is 1 and all other elements are 0. x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 148. 35/57 What we are actually interested in is ∂L (θ) ∂aLi = ∂(− log ˆy ) ∂aLi x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 149. 35/57 What we are actually interested in is ∂L (θ) ∂aLi = ∂(− log ˆy ) ∂aLi = ∂(− log ˆy ) ∂ˆy ∂ˆy ∂aLi x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 150. 35/57 What we are actually interested in is ∂L (θ) ∂aLi = ∂(− log ˆy ) ∂aLi = ∂(− log ˆy ) ∂ˆy ∂ˆy ∂aLi Does ˆy depend on aLi ? x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 151. 35/57 What we are actually interested in is ∂L (θ) ∂aLi = ∂(− log ˆy ) ∂aLi = ∂(− log ˆy ) ∂ˆy ∂ˆy ∂aLi Does ˆy depend on aLi ? Indeed, it does. x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 152. 35/57 What we are actually interested in is ∂L (θ) ∂aLi = ∂(− log ˆy ) ∂aLi = ∂(− log ˆy ) ∂ˆy ∂ˆy ∂aLi Does ˆy depend on aLi ? Indeed, it does. ˆy = exp(aL ) i exp(aLi) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 153. 35/57 What we are actually interested in is ∂L (θ) ∂aLi = ∂(− log ˆy ) ∂aLi = ∂(− log ˆy ) ∂ˆy ∂ˆy ∂aLi Does ˆy depend on aLi ? Indeed, it does. ˆy = exp(aL ) i exp(aLi) Having established this, we will now derive the full expression on the next slide x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 154. 36/57 ∂ ∂aLi − log ˆy = Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 155. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 156. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy = −1 ˆy ∂ ∂aLi softmax(aL) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 157. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy = −1 ˆy ∂ ∂aLi softmax(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 158. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy = −1 ˆy ∂ ∂aLi softmax(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL) ∂ g(x) h(x) ∂x = ∂g(x) ∂x 1 h(x) − g(x) h(x)2 ∂h(x) ∂x Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 159. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy = −1 ˆy ∂ ∂aLi softmax(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL)i − exp(aL) ∂ ∂aLi i exp(aL)i ( i (exp(aL)i )2 ∂ g(x) h(x) ∂x = ∂g(x) ∂x 1 h(x) − g(x) h(x)2 ∂h(x) ∂x Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 160. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy = −1 ˆy ∂ ∂aLi softmax(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL)i − exp(aL) ∂ ∂aLi i exp(aL)i ( i (exp(aL)i )2 = −1 ˆy 1( =i) exp(aL) i exp(aL)i − exp(aL) i exp(aL)i exp(aL)i i exp(aL)i ∂ g(x) h(x) ∂x = ∂g(x) ∂x 1 h(x) − g(x) h(x)2 ∂h(x) ∂x Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 161. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy = −1 ˆy ∂ ∂aLi softmax(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL)i − exp(aL) ∂ ∂aLi i exp(aL)i ( i (exp(aL)i )2 = −1 ˆy 1( =i) exp(aL) i exp(aL)i − exp(aL) i exp(aL)i exp(aL)i i exp(aL)i = −1 ˆy 1( =i)softmax(aL) − softmax(aL) softmax(aL)i ∂ g(x) h(x) ∂x = ∂g(x) ∂x 1 h(x) − g(x) h(x)2 ∂h(x) ∂x Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 162. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy = −1 ˆy ∂ ∂aLi softmax(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL)i − exp(aL) ∂ ∂aLi i exp(aL)i ( i (exp(aL)i )2 = −1 ˆy 1( =i) exp(aL) i exp(aL)i − exp(aL) i exp(aL)i exp(aL)i i exp(aL)i = −1 ˆy 1( =i)softmax(aL) − softmax(aL) softmax(aL)i = −1 ˆy 1( =i) ˆy − ˆy ˆyi ∂ g(x) h(x) ∂x = ∂g(x) ∂x 1 h(x) − g(x) h(x)2 ∂h(x) ∂x Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 163. 36/57 ∂ ∂aLi − log ˆy = −1 ˆy ∂ ∂aLi ˆy = −1 ˆy ∂ ∂aLi softmax(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL) = −1 ˆy ∂ ∂aLi exp(aL) i exp(aL)i − exp(aL) ∂ ∂aLi i exp(aL)i ( i (exp(aL)i )2 = −1 ˆy 1( =i) exp(aL) i exp(aL)i − exp(aL) i exp(aL)i exp(aL)i i exp(aL)i = −1 ˆy 1( =i)softmax(aL) − softmax(aL) softmax(aL)i = −1 ˆy 1( =i) ˆy − ˆy ˆyi = − 1( =i) − ˆyi ∂ g(x) h(x) ∂x = ∂g(x) ∂x 1 h(x) − g(x) h(x)2 ∂h(x) ∂x Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 164. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 165. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 166. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 167. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1 ...     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 168. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1 ... ∂L (θ) ∂aLk     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 169. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1 ... ∂L (θ) ∂aLk     =           x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 170. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1 ... ∂L (θ) ∂aLk     =      − (1 =1 − ˆy1)      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 171. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1 ... ∂L (θ) ∂aLk     =      − (1 =1 − ˆy1) − (1 =2 − ˆy2)      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 172. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1 ... ∂L (θ) ∂aLk     =      − (1 =1 − ˆy1) − (1 =2 − ˆy2) ...      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 173. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1 ... ∂L (θ) ∂aLk     =      − (1 =1 − ˆy1) − (1 =2 − ˆy2) ... − (1 =k − ˆyk)      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 174. 37/57 So far we have derived the partial derivative w.r.t. the i-th element of aL ∂L (θ) ∂aL,i = −(1 =i − ˆyi) We can now write the gradient w.r.t. the vector aL aL L (θ) =     ∂L (θ) ∂aL1 ... ∂L (θ) ∂aLk     =      − (1 =1 − ˆy1) − (1 =2 − ˆy2) ... − (1 =k − ˆyk)      = −(e( ) − ˆy) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 175. 38/57 Module 4.6: Backpropagation: Computing Gradients w.r.t. Hidden Units Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 176. 39/57 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Our focus is on Cross entropy loss and Softmax output. Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 177. 40/57 Chain rule along multiple paths: If a function p(z) can be written as a function of intermediate results qi(z) then we have : x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 178. 40/57 Chain rule along multiple paths: If a function p(z) can be written as a function of intermediate results qi(z) then we have : ∂p(z) ∂z = m ∂p(z) ∂qm(z) ∂qm(z) ∂z x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 179. 40/57 Chain rule along multiple paths: If a function p(z) can be written as a function of intermediate results qi(z) then we have : ∂p(z) ∂z = m ∂p(z) ∂qm(z) ∂qm(z) ∂z In our case: p(z) is the loss function L (θ) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 180. 40/57 Chain rule along multiple paths: If a function p(z) can be written as a function of intermediate results qi(z) then we have : ∂p(z) ∂z = m ∂p(z) ∂qm(z) ∂qm(z) ∂z In our case: p(z) is the loss function L (θ) z = hij x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 181. 40/57 Chain rule along multiple paths: If a function p(z) can be written as a function of intermediate results qi(z) then we have : ∂p(z) ∂z = m ∂p(z) ∂qm(z) ∂qm(z) ∂z In our case: p(z) is the loss function L (θ) z = hij qm(z) = aLm x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 182. 41/57Intentionally left blank Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 183. 42/57 ∂L (θ) ∂hij x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 184. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 185. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 186. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 187. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =         ; Wi+1, · ,j =       x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 188. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1     ; Wi+1, · ,j =       x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 189. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1     ; Wi+1, · ,j =    Wi+1,1,j    x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 190. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1 ...     ; Wi+1, · ,j =    Wi+1,1,j ...    x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 191. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1 ... ∂L (θ) ∂ai+1,k     ; Wi+1, · ,j =    Wi+1,1,j ...    x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 192. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1 ... ∂L (θ) ∂ai+1,k     ; Wi+1, · ,j =    Wi+1,1,j ... Wi+1,k,j    x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 193. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1 ... ∂L (θ) ∂ai+1,k     ; Wi+1, · ,j =    Wi+1,1,j ... Wi+1,k,j    x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 194. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1 ... ∂L (θ) ∂ai+1,k     ; Wi+1, · ,j =    Wi+1,1,j ... Wi+1,k,j    Wi+1, · ,j is the j-th column of Wi+1; x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 195. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1 ... ∂L (θ) ∂ai+1,k     ; Wi+1, · ,j =    Wi+1,1,j ... Wi+1,k,j    Wi+1, · ,j is the j-th column of Wi+1; see that, x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 196. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1 ... ∂L (θ) ∂ai+1,k     ; Wi+1, · ,j =    Wi+1,1,j ... Wi+1,k,j    Wi+1, · ,j is the j-th column of Wi+1; see that, (Wi+1, · ,j)T ai+1 L (θ) = x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 197. 42/57 ∂L (θ) ∂hij = k m=1 ∂L (θ) ∂ai+1,m ∂ai+1,m ∂hij = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j Now consider these two vectors, ai+1 L (θ) =     ∂L (θ) ∂ai+1,1 ... ∂L (θ) ∂ai+1,k     ; Wi+1, · ,j =    Wi+1,1,j ... Wi+1,k,j    Wi+1, · ,j is the j-th column of Wi+1; see that, (Wi+1, · ,j)T ai+1 L (θ) = k m=1 ∂L (θ) ∂ai+1,m Wi+1,m,j x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 ai+1 = Wi+1hij + bi+1 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 198. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 199. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 200. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =             =           x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 201. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1       =           x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 202. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1       =      (Wi+1, · ,1)T ai+1 L (θ)      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 203. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1 ∂L (θ) ∂hi2       =      (Wi+1, · ,1)T ai+1 L (θ)      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 204. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1 ∂L (θ) ∂hi2       =      (Wi+1, · ,1)T ai+1 L (θ) (Wi+1, · ,2)T ai+1 L (θ)      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 205. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1 ∂L (θ) ∂hi2 ...       =      (Wi+1, · ,1)T ai+1 L (θ) (Wi+1, · ,2)T ai+1 L (θ) ...      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 206. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1 ∂L (θ) ∂hi2 ... ∂L (θ) ∂hin       =      (Wi+1, · ,1)T ai+1 L (θ) (Wi+1, · ,2)T ai+1 L (θ) ...      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 207. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1 ∂L (θ) ∂hi2 ... ∂L (θ) ∂hin       =      (Wi+1, · ,1)T ai+1 L (θ) (Wi+1, · ,2)T ai+1 L (θ) ... (Wi+1, · ,n)T ai+1 L (θ)      x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 208. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1 ∂L (θ) ∂hi2 ... ∂L (θ) ∂hin       =      (Wi+1, · ,1)T ai+1 L (θ) (Wi+1, · ,2)T ai+1 L (θ) ... (Wi+1, · ,n)T ai+1 L (θ)      = (Wi+1)T ( ai+1 L (θ)) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 209. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1 ∂L (θ) ∂hi2 ... ∂L (θ) ∂hin       =      (Wi+1, · ,1)T ai+1 L (θ) (Wi+1, · ,2)T ai+1 L (θ) ... (Wi+1, · ,n)T ai+1 L (θ)      = (Wi+1)T ( ai+1 L (θ)) We are almost done except that we do not know how to calculate ai+1 L (θ) for i < L−1 x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 210. 43/57 We have, ∂L (θ) ∂hij = (Wi+1,.,j)T ai+1 L (θ) We can now write the gradient w.r.t. hi hi L (θ) =       ∂L (θ) ∂hi1 ∂L (θ) ∂hi2 ... ∂L (θ) ∂hin       =      (Wi+1, · ,1)T ai+1 L (θ) (Wi+1, · ,2)T ai+1 L (θ) ... (Wi+1, · ,n)T ai+1 L (θ)      = (Wi+1)T ( ai+1 L (θ)) We are almost done except that we do not know how to calculate ai+1 L (θ) for i < L−1 We will see how to compute that x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 211. 44/57 ai L (θ) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 212. 44/57 ai L (θ) =         x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 213. 44/57 ai L (θ) =     ∂L (θ) ∂ai1     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 214. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ...     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 215. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 216. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 217. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij = ∂L (θ) ∂hij ∂hij ∂aij x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 218. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij = ∂L (θ) ∂hij ∂hij ∂aij = ∂L (θ) ∂hij g (aij) [ hij = g(aij)] x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 219. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij = ∂L (θ) ∂hij ∂hij ∂aij = ∂L (θ) ∂hij g (aij) [ hij = g(aij)] ai L (θ) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 220. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij = ∂L (θ) ∂hij ∂hij ∂aij = ∂L (θ) ∂hij g (aij) [ hij = g(aij)] ai L (θ) =         x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 221. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij = ∂L (θ) ∂hij ∂hij ∂aij = ∂L (θ) ∂hij g (aij) [ hij = g(aij)] ai L (θ) =     ∂L (θ) ∂hi1 g (ai1)     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 222. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij = ∂L (θ) ∂hij ∂hij ∂aij = ∂L (θ) ∂hij g (aij) [ hij = g(aij)] ai L (θ) =     ∂L (θ) ∂hi1 g (ai1) ...     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 223. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij = ∂L (θ) ∂hij ∂hij ∂aij = ∂L (θ) ∂hij g (aij) [ hij = g(aij)] ai L (θ) =     ∂L (θ) ∂hi1 g (ai1) ... ∂L (θ) ∂hin g (ain)     x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 224. 44/57 ai L (θ) =     ∂L (θ) ∂ai1 ... ∂L (θ) ∂ain     ∂L (θ) ∂aij = ∂L (θ) ∂hij ∂hij ∂aij = ∂L (θ) ∂hij g (aij) [ hij = g(aij)] ai L (θ) =     ∂L (θ) ∂hi1 g (ai1) ... ∂L (θ) ∂hin g (ain)     = hi L (θ) [. . . , g (aik), . . . ] x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 225. 45/57 Module 4.7: Backpropagation: Computing Gradients w.r.t. Parameters Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 226. 46/57 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases ∂L (θ) ∂W111 Talk to the weight directly = ∂L (θ) ∂ˆy ∂ˆy ∂a3 Talk to the output layer ∂a3 ∂h2 ∂h2 ∂a2 Talk to the previous hidden layer ∂a2 ∂h1 ∂h1 ∂a1 Talk to the previous hidden layer ∂a1 ∂W111 and now talk to the weights Our focus is on Cross entropy loss and Softmax output. Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 227. 47/57 Recall that, ak = bk + Wkhk−1 x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 228. 47/57 Recall that, ak = bk + Wkhk−1 ∂aki ∂Wkij = hk−1,j x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 229. 47/57 Recall that, ak = bk + Wkhk−1 ∂aki ∂Wkij = hk−1,j ∂L (θ) ∂Wkij x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 230. 47/57 Recall that, ak = bk + Wkhk−1 ∂aki ∂Wkij = hk−1,j ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 231. 47/57 Recall that, ak = bk + Wkhk−1 ∂aki ∂Wkij = hk−1,j ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij = ∂L (θ) ∂aki hk−1,j x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 232. 47/57 Recall that, ak = bk + Wkhk−1 ∂aki ∂Wkij = hk−1,j ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij = ∂L (θ) ∂aki hk−1,j Wk L (θ) = x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 233. 47/57 Recall that, ak = bk + Wkhk−1 ∂aki ∂Wkij = hk−1,j ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij = ∂L (θ) ∂aki hk−1,j Wk L (θ) =       ∂L (θ) ∂Wk11 ∂L (θ) ∂Wk12 . . . . . . ∂L (θ) ∂Wk1n . . . . . . . . . . . . . . . ... ... ... ... ... . . . . . . . . . . . . ∂L (θ) ∂Wknn       x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 234. 48/57Intentionally left blank Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 235. 49/57 Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 236. 49/57 Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like Wk L (θ) =        ∂L (θ) ∂Wk11 ∂L (θ) ∂Wk12 ∂L (θ) ∂Wk13 ∂L (θ) ∂Wk21 ∂L (θ) ∂Wk22 ∂L (θ) ∂Wk23 ∂L (θ) ∂Wk31 ∂L (θ) ∂Wk32 ∂L (θ) ∂Wk33        ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 237. 49/57 Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like Wk L (θ) =        ∂L (θ) ∂Wk11 ∂L (θ) ∂Wk12 ∂L (θ) ∂Wk13 ∂L (θ) ∂Wk21 ∂L (θ) ∂Wk22 ∂L (θ) ∂Wk23 ∂L (θ) ∂Wk31 ∂L (θ) ∂Wk32 ∂L (θ) ∂Wk33        ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij Wk L (θ) =        ∂L (θ) ∂ak1 hk−1,1 ∂L (θ) ∂ak1 hk−1,2 ∂L (θ) ∂ak1 hk−1,3 ∂L (θ) ∂ak2 hk−1,1 ∂L (θ) ∂ak2 hk−1,2 ∂L (θ) ∂ak2 hk−1,3 ∂L (θ) ∂ak3 hk−1,1 ∂L (θ) ∂ak3 hk−1,2 ∂L (θ) ∂ak3 hk−1,3        = Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 238. 49/57 Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like Wk L (θ) =        ∂L (θ) ∂Wk11 ∂L (θ) ∂Wk12 ∂L (θ) ∂Wk13 ∂L (θ) ∂Wk21 ∂L (θ) ∂Wk22 ∂L (θ) ∂Wk23 ∂L (θ) ∂Wk31 ∂L (θ) ∂Wk32 ∂L (θ) ∂Wk33        ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij Wk L (θ) =        ∂L (θ) ∂ak1 hk−1,1 ∂L (θ) ∂ak1 hk−1,2 ∂L (θ) ∂ak1 hk−1,3 ∂L (θ) ∂ak2 hk−1,1 ∂L (θ) ∂ak2 hk−1,2 ∂L (θ) ∂ak2 hk−1,3 ∂L (θ) ∂ak3 hk−1,1 ∂L (θ) ∂ak3 hk−1,2 ∂L (θ) ∂ak3 hk−1,3        = Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 239. 49/57 Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like Wk L (θ) =        ∂L (θ) ∂Wk11 ∂L (θ) ∂Wk12 ∂L (θ) ∂Wk13 ∂L (θ) ∂Wk21 ∂L (θ) ∂Wk22 ∂L (θ) ∂Wk23 ∂L (θ) ∂Wk31 ∂L (θ) ∂Wk32 ∂L (θ) ∂Wk33        ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij Wk L (θ) =        ∂L (θ) ∂ak1 hk−1,1 ∂L (θ) ∂ak1 hk−1,2 ∂L (θ) ∂ak1 hk−1,3 ∂L (θ) ∂ak2 hk−1,1 ∂L (θ) ∂ak2 hk−1,2 ∂L (θ) ∂ak2 hk−1,3 ∂L (θ) ∂ak3 hk−1,1 ∂L (θ) ∂ak3 hk−1,2 ∂L (θ) ∂ak3 hk−1,3        = Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 240. 49/57 Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like Wk L (θ) =        ∂L (θ) ∂Wk11 ∂L (θ) ∂Wk12 ∂L (θ) ∂Wk13 ∂L (θ) ∂Wk21 ∂L (θ) ∂Wk22 ∂L (θ) ∂Wk23 ∂L (θ) ∂Wk31 ∂L (θ) ∂Wk32 ∂L (θ) ∂Wk33        ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij Wk L (θ) =        ∂L (θ) ∂ak1 hk−1,1 ∂L (θ) ∂ak1 hk−1,2 ∂L (θ) ∂ak1 hk−1,3 ∂L (θ) ∂ak2 hk−1,1 ∂L (θ) ∂ak2 hk−1,2 ∂L (θ) ∂ak2 hk−1,3 ∂L (θ) ∂ak3 hk−1,1 ∂L (θ) ∂ak3 hk−1,2 ∂L (θ) ∂ak3 hk−1,3        = Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 241. 49/57 Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like Wk L (θ) =        ∂L (θ) ∂Wk11 ∂L (θ) ∂Wk12 ∂L (θ) ∂Wk13 ∂L (θ) ∂Wk21 ∂L (θ) ∂Wk22 ∂L (θ) ∂Wk23 ∂L (θ) ∂Wk31 ∂L (θ) ∂Wk32 ∂L (θ) ∂Wk33        ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij Wk L (θ) =        ∂L (θ) ∂ak1 hk−1,1 ∂L (θ) ∂ak1 hk−1,2 ∂L (θ) ∂ak1 hk−1,3 ∂L (θ) ∂ak2 hk−1,1 ∂L (θ) ∂ak2 hk−1,2 ∂L (θ) ∂ak2 hk−1,3 ∂L (θ) ∂ak3 hk−1,1 ∂L (θ) ∂ak3 hk−1,2 ∂L (θ) ∂ak3 hk−1,3        = Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 242. 49/57 Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like Wk L (θ) =        ∂L (θ) ∂Wk11 ∂L (θ) ∂Wk12 ∂L (θ) ∂Wk13 ∂L (θ) ∂Wk21 ∂L (θ) ∂Wk22 ∂L (θ) ∂Wk23 ∂L (θ) ∂Wk31 ∂L (θ) ∂Wk32 ∂L (θ) ∂Wk33        ∂L (θ) ∂Wkij = ∂L (θ) ∂aki ∂aki ∂Wkij Wk L (θ) =        ∂L (θ) ∂ak1 hk−1,1 ∂L (θ) ∂ak1 hk−1,2 ∂L (θ) ∂ak1 hk−1,3 ∂L (θ) ∂ak2 hk−1,1 ∂L (θ) ∂ak2 hk−1,2 ∂L (θ) ∂ak2 hk−1,3 ∂L (θ) ∂ak3 hk−1,1 ∂L (θ) ∂ak3 hk−1,2 ∂L (θ) ∂ak3 hk−1,3        = ak L (θ) · hk−1 T Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 243. 50/57 Finally, coming to the biases x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 244. 50/57 Finally, coming to the biases aki = bki + j Wkijhk−1,j x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 245. 50/57 Finally, coming to the biases aki = bki + j Wkijhk−1,j ∂L (θ) ∂bki = ∂L (θ) ∂aki ∂aki ∂bki x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 246. 50/57 Finally, coming to the biases aki = bki + j Wkijhk−1,j ∂L (θ) ∂bki = ∂L (θ) ∂aki ∂aki ∂bki = ∂L (θ) ∂aki x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 247. 50/57 Finally, coming to the biases aki = bki + j Wkijhk−1,j ∂L (θ) ∂bki = ∂L (θ) ∂aki ∂aki ∂bki = ∂L (θ) ∂aki We can now write the gradient w.r.t. the vector bk x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 248. 50/57 Finally, coming to the biases aki = bki + j Wkijhk−1,j ∂L (θ) ∂bki = ∂L (θ) ∂aki ∂aki ∂bki = ∂L (θ) ∂aki We can now write the gradient w.r.t. the vector bk bk L (θ) =       ∂L (θ) ak1 ∂L (θ) ak2 ... ∂L (θ) akn       x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 249. 50/57 Finally, coming to the biases aki = bki + j Wkijhk−1,j ∂L (θ) ∂bki = ∂L (θ) ∂aki ∂aki ∂bki = ∂L (θ) ∂aki We can now write the gradient w.r.t. the vector bk bk L (θ) =       ∂L (θ) ak1 ∂L (θ) ak2 ... ∂L (θ) akn       = ak L (θ) x1 x2 xn − log ˆy W1 a1 W2 a2 h1 W3 a3 h2 b1 b2 b3 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 250. 51/57 Module 4.8: Backpropagation: Pseudo code Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 251. 52/57 Finally, we have all the pieces of the puzzle Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 252. 52/57 Finally, we have all the pieces of the puzzle aL L (θ) (gradient w.r.t. output layer) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 253. 52/57 Finally, we have all the pieces of the puzzle aL L (θ) (gradient w.r.t. output layer) hk L (θ), ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 254. 52/57 Finally, we have all the pieces of the puzzle aL L (θ) (gradient w.r.t. output layer) hk L (θ), ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L) Wk L (θ), bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 255. 52/57 Finally, we have all the pieces of the puzzle aL L (θ) (gradient w.r.t. output layer) hk L (θ), ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L) Wk L (θ), bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L) We can now write the full learning algorithm Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 256. 53/57 Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [W0 1 , ..., W0 L, b0 1, ..., b0 L]; Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 257. 53/57 Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [W0 1 , ..., W0 L, b0 1, ..., b0 L]; while t++ < max iterations do end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 258. 53/57 Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [W0 1 , ..., W0 L, b0 1, ..., b0 L]; while t++ < max iterations do h1, h2, ..., hL−1, a1, a2, ..., aL, ˆy = forward propagation(θt); end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 259. 53/57 Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [W0 1 , ..., W0 L, b0 1, ..., b0 L]; while t++ < max iterations do h1, h2, ..., hL−1, a1, a2, ..., aL, ˆy = forward propagation(θt); θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy); end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 260. 53/57 Algorithm: gradient descent() t ← 0; max iterations ← 1000; Initialize θ0 = [W0 1 , ..., W0 L, b0 1, ..., b0 L]; while t++ < max iterations do h1, h2, ..., hL−1, a1, a2, ..., aL, ˆy = forward propagation(θt); θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy); θt+1 ← θt − η θt; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 261. 54/57 Algorithm: forward propagation(θ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 262. 54/57 Algorithm: forward propagation(θ) for k = 1 to L − 1 do end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 263. 54/57 Algorithm: forward propagation(θ) for k = 1 to L − 1 do ak = bk + Wkhk−1; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 264. 54/57 Algorithm: forward propagation(θ) for k = 1 to L − 1 do ak = bk + Wkhk−1; hk = g(ak); end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 265. 54/57 Algorithm: forward propagation(θ) for k = 1 to L − 1 do ak = bk + Wkhk−1; hk = g(ak); end aL = bL + WLhL−1; Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 266. 54/57 Algorithm: forward propagation(θ) for k = 1 to L − 1 do ak = bk + Wkhk−1; hk = g(ak); end aL = bL + WLhL−1; ˆy = O(aL); Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 267. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 268. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 269. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; for k = L to 1 do end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 270. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; for k = L to 1 do // Compute gradients w.r.t. parameters ; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 271. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; for k = L to 1 do // Compute gradients w.r.t. parameters ; Wk L (θ) = ak L (θ)hT k−1 ; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 272. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; for k = L to 1 do // Compute gradients w.r.t. parameters ; Wk L (θ) = ak L (θ)hT k−1 ; bk L (θ) = ak L (θ) ; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 273. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; for k = L to 1 do // Compute gradients w.r.t. parameters ; Wk L (θ) = ak L (θ)hT k−1 ; bk L (θ) = ak L (θ) ; // Compute gradients w.r.t. layer below ; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 274. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; for k = L to 1 do // Compute gradients w.r.t. parameters ; Wk L (θ) = ak L (θ)hT k−1 ; bk L (θ) = ak L (θ) ; // Compute gradients w.r.t. layer below ; hk−1 L (θ) = WT k ( ak L (θ)) ; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 275. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; for k = L to 1 do // Compute gradients w.r.t. parameters ; Wk L (θ) = ak L (θ)hT k−1 ; bk L (θ) = ak L (θ) ; // Compute gradients w.r.t. layer below ; hk−1 L (θ) = WT k ( ak L (θ)) ; // Compute gradients w.r.t. layer below (pre-activation); end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 276. 55/57 Just do a forward propagation and compute all hi’s, ai’s, and ˆy Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, ˆy) //Compute output gradient ; aL L (θ) = −(e(y) − ˆy) ; for k = L to 1 do // Compute gradients w.r.t. parameters ; Wk L (θ) = ak L (θ)hT k−1 ; bk L (θ) = ak L (θ) ; // Compute gradients w.r.t. layer below ; hk−1 L (θ) = WT k ( ak L (θ)) ; // Compute gradients w.r.t. layer below (pre-activation); ak−1 L (θ) = hk−1 L (θ) [. . . , g (ak−1,j), . . . ] ; end Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 277. 56/57 Module 4.9: Derivative of the activation function Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 278. 57/57 Now, the only thing we need to figure out is how to compute g Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 279. 57/57 Now, the only thing we need to figure out is how to compute g Logistic function g(z) = σ(z) = 1 1 + e−z Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 280. 57/57 Now, the only thing we need to figure out is how to compute g Logistic function g(z) = σ(z) = 1 1 + e−z g (z) = (−1) 1 (1 + e−z)2 d dz (1 + e−z ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
  • 281. 57/57 Now, the only thing we need to figure out is how to compute g Logistic function g(z) = σ(z) = 1 1 + e−z g (z) = (−1) 1 (1 + e−z)2 d dz (1 + e−z ) = (−1) 1 (1 + e−z)2 (−e−z ) Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4