"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
Deep learning
1. 1/55
CS7015 (Deep Learning) : Lecture 7
Autoencoders and relation to PCA, Regularization in autoencoders, Denoising
autoencoders, Sparse autoencoders, Contractive autoencoders
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
4. 3/55
xi
W
h
W∗
ˆxi
An autoencoder is a special type of
feed forward neural network which
does the following
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5. 3/55
xi
W
h
W∗
ˆxi
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
6. 3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
7. 3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8. 3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9. 3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
The model is trained to minimize a
certain loss function which will ensure
that ˆxi is close to xi (we will see some
such loss functions soon)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
14. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Do you see an analogy with PCA?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
15. 4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder where dim(h) < dim(xi) is
called an under complete autoencoder
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Do you see an analogy with PCA?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
17. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
18. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
22. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
23. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
24. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
25. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
26. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
27. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
28. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Such an identity encoding is useless
in practice as it does not really tell us
anything about the important char-
acteristics of the data
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
29. 5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder where dim(h) ≥ dim(xi) is
called an over complete autoencoder
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Such an identity encoding is useless
in practice as it does not really tell us
anything about the important char-
acteristics of the data
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
32. 6/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
33. 7/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
34. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
35. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
36. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
37. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
38. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
39. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
40. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Logistic as it naturally restricts all
outputs to be between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
41. 8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
g is typically chosen as the sigmoid
function
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Logistic as it naturally restricts all
outputs to be between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
42. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
43. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
45. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
46. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
47. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
48. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
49. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
whereas we want ˆxi ∈ Rn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
50. 9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Again, g is typically chosen as the
sigmoid function
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
whereas we want ˆxi ∈ Rn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
51. 10/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
53. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
54. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
55. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
56. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
57. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
58. 11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
All we need is a formula for ∂L (θ)
∂W ∗ and ∂L (θ)
∂W
which we will see now
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
59. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
60. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
61. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
62. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
63. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
64. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
65. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
= ˆxi
{(ˆxi − xi)T
(ˆxi − xi)}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
66. 12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
= ˆxi
{(ˆxi − xi)T
(ˆxi − xi)}
= 2(ˆxi − xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
67. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
68. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
69. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
70. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
71. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
72. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
73. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
74. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Again we need is a formula for ∂L (θ)
∂W ∗ and
∂L (θ)
∂W to use backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
75. 13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when ˆxij = xij !
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Again we need is a formula for ∂L (θ)
∂W ∗ and
∂L (θ)
∂W to use backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
76. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
77. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
79. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
80. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
The first two terms on RHS can be
computed as:
∂L (θ)
∂h2j
= −
xij
ˆxij
+
1 − xij
1 − ˆxij
∂h2j
∂a2j
= σ(a2j)(1 − σ(a2j))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
81. 14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂h2
=
∂L (θ)
∂h2n
...
∂L (θ)
∂h22
∂L (θ)
∂h21
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
The first two terms on RHS can be
computed as:
∂L (θ)
∂h2j
= −
xij
ˆxij
+
1 − xij
1 − ˆxij
∂h2j
∂a2j
= σ(a2j)(1 − σ(a2j))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
82. 15/55
Module 7.2: Link between PCA and Autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
83. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
84. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
85. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
86. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
use squared error loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
87. 16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
use squared error loss function
normalize the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
88. 17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
89. 17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
90. 17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Let X be this zero mean data mat-
rix then what the above normaliza-
tion gives us is X = 1√
m
X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
91. 17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Let X be this zero mean data mat-
rix then what the above normaliza-
tion gives us is X = 1√
m
X
Now (X)T X = 1
m (X )T X is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
93. 18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
94. 18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
95. 18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
1
m
m
i=1
n
j=1
(xij − ˆxij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
96. 18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
1
m
m
i=1
n
j=1
(xij − ˆxij)2
is obtained when we use a linear en-
coder.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
101. 19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
102. 19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by
HW∗
= U.,≤kΣk,kV T
.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
103. 19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by
HW∗
= U.,≤kΣk,kV T
.,≤k
By matching variables one possible solution is
H = U.,≤kΣk,k
W∗
= V T
.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
104. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
105. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
106. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
107. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
108. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
109. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
110. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
111. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
112. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
113. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
114. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
H = XV.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
115. 20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
H = XV.,≤k
Thus H is a linear transformation of X and W = V.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
116. 21/55
We have encoder W = V.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
117. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
118. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
119. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
120. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
121. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
then XT X is indeed the covariance matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
122. 21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
then XT X is indeed the covariance matrix
Thus, the encoder matrix for linear autoencoder(W) and the projection
matrix(P) for PCA could indeed be the same. Hence proved
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
123. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
124. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
125. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
126. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
127. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
128. 22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
131. 24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
132. 24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Here, (as stated earlier) the model
can simply learn to copy xi to h and
then h to ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
133. 24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Here, (as stated earlier) the model
can simply learn to copy xi to h and
then h to ˆxi
To avoid poor generalization, we need
to introduce regularization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
134. 25/55
xi
W
h
W∗
ˆxi
The simplest solution is to add a L2-
regularization term to the objective
function
min
θ,w,w∗,b,c
1
m
m
i=1
n
j=1
(ˆxij − xij)2
+ λ θ 2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
135. 25/55
xi
W
h
W∗
ˆxi
The simplest solution is to add a L2-
regularization term to the objective
function
min
θ,w,w∗,b,c
1
m
m
i=1
n
j=1
(ˆxij − xij)2
+ λ θ 2
This is very easy to implement and
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
137. 26/55
xi
W
h
W∗
ˆxi
Another trick is to tie the weights of
the encoder and decoder i.e., W∗ =
WT
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
138. 26/55
xi
W
h
W∗
ˆxi
Another trick is to tie the weights of
the encoder and decoder i.e., W∗ =
WT
This effectively reduces the capacity
of Autoencoder and acts as a regular-
izer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
140. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
141. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
142. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
143. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
P(xij = xij|xij) = 1 − q
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
144. 28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
P(xij = xij|xij) = 1 − q
In other words, with probability q the
input is flipped to 0 and with probab-
ility (1 − q) it is retained as it is
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
146. 29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
147. 29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
148. 29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
149. 29/55
xi
˜xi
h
ˆxi
P(xij|xij)
For example, it will have to learn to
reconstruct a corrupted xij correctly by
relying on its interactions with other
elements of xi
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
150. 30/55
We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
151. 31/55
Task: Hand-written digit
recognition
Figure: MNIST Data
0 1 2 3 9
|xi| = 784 = 28 × 28
28*28
Figure: Basic approach(we use raw data as input
features)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
152. 32/55
Task: Hand-written digit
recognition
Figure: MNIST Data
|xi| = 784 = 28 × 28
ˆxi ∈ R784
h ∈ Rd
Figure: AE approach (first learn important
characteristics of data)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
153. 33/55
Task: Hand-written digit
recognition
Figure: MNIST Data
0 1 2 3 9
|xi| = 784 = 28 × 28
h ∈ Rd
Figure: AE approach (and then train a classifier on
top of this hidden representation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
154. 34/55
We will now see a way of visualizing AEs and use this visualization to compare
different AEs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
155. 35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
156. 35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
157. 35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
158. 35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
159. 35/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
160. 35/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
161. 36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
162. 36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Let us plot these images (xi’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
coder and different denoising autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
163. 36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Let us plot these images (xi’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
coder and different denoising autoencoders
These xi’s are computed by the above formula
using the weights (W1, W2 . . . Wk) learned by
the respective autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
164. 37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
165. 37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
166. 37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the filters become more wide because the neuron has to
rely on more adjacent pixels to feel confident about a stroke
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
168. 38/55
xi
˜xi
h
ˆxi
P(xij|xij)
We saw one form of P(xij|xij) which flips a
fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
xij = xij + N (0, 1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
169. 38/55
xi
˜xi
h
ˆxi
P(xij|xij)
We saw one form of P(xij|xij) which flips a
fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
xij = xij + N (0, 1)
We will now use such a denoising AE on a
different dataset and see their performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
170. 39/55
Figure: Data Figure: AE filters
Figure: Weight decay
filters
The hidden neurons essentially behave like edge detectors
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
171. 39/55
Figure: Data Figure: AE filters
Figure: Weight decay
filters
The hidden neurons essentially behave like edge detectors
PCA does not give such edge detectors
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
174. 41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
175. 41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
output is close to 1 and not activated when
its output is close to 0.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
176. 41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
neuron is inactive most of the times.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
177. 42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
178. 42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
179. 42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
One way of ensuring this is to add the follow-
ing term to the objective function
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
180. 42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
One way of ensuring this is to add the follow-
ing term to the objective function
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
When will this term reach its minimum value
and what is the minimum value? Let us plot
it and check.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
182. 43/55
Ω(θ)
0.2 ˆρl
ρ = 0.2
The function will reach its minimum value(s) when ˆρl = ρ.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
183. 44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
184. 44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
185. 44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
186. 44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
187. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
188. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
189. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
190. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
191. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
192. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
193. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
and
∂ ˆρl
∂W
= xi(g (WT
xi + b))T
(see next slide)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
194. 44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
and
∂ ˆρl
∂W
= xi(g (WT
xi + b))T
(see next slide)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Finally,
∂ ˆL (θ)
∂W
=
∂L (θ)
∂W
+
∂Ω(θ)
∂W
(and we know how to calculate both
terms on R.H.S)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
195. 45/55
Derivation
∂ˆρ
∂W
= ∂ ˆρ1
∂W
∂ ˆρ2
∂W . . . ∂ ˆρk
∂W
For each element in the above equation we can calculate ∂ ˆρl
∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl:-
∂ˆρl
∂Wjl
=
∂ 1
m
m
i=1 g WT
:,lxi + bl
∂Wjl
=
1
m
m
i=1
∂ g WT
:,lxi + bl
∂Wjl
=
1
m
m
i=1
g WT
:,lxi + bl xij
So in matrix notation we can write it as :
∂ˆρl
∂W
= xi(g (WT
xi + b))T
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
197. 47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
198. 47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
199. 47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
where Jx(h) is the Jacobian of the en-
coder.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
200. 47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
where Jx(h) is the Jacobian of the en-
coder.
Let us see what it looks like.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
201. 48/55
If the input has n dimensions and the
hidden layer has k dimensions then
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
202. 48/55
If the input has n dimensions and the
hidden layer has k dimensions then
Jx(h) =
∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
203. 48/55
If the input has n dimensions and the
hidden layer has k dimensions then
In other words, the (j, l) entry of the
Jacobian captures the variation in the
output of the lth neuron with a small
variation in the jth input.
Jx(h) =
∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
204. 48/55
If the input has n dimensions and the
hidden layer has k dimensions then
In other words, the (j, l) entry of the
Jacobian captures the variation in the
output of the lth neuron with a small
variation in the jth input.
Jx(h) =
∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
205. 49/55
What is the intuition behind this ?
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
206. 49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
207. 49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
It means that this neuron is not very
sensitive to variations in the input x1.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
208. 49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
It means that this neuron is not very
sensitive to variations in the input x1.
But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the
input.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
209. 50/55
Indeed it does and that’s the idea
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
210. 50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
211. 50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
212. 50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Ω(θ) - do not capture variations in
data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
213. 50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import-
ant variations in the data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
214. 51/55
Let us try to understand this with the help of an illustration.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
217. 52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
218. 52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
219. 52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
220. 52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7