SlideShare uma empresa Scribd logo
1 de 228
Baixar para ler offline
1/55
CS7015 (Deep Learning) : Lecture 7
Autoencoders and relation to PCA, Regularization in autoencoders, Denoising
autoencoders, Sparse autoencoders, Contractive autoencoders
Mitesh M. Khapra
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
2/55
Module 7.1: Introduction to Autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
3/55
xi
W
h
W∗
ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
3/55
xi
W
h
W∗
ˆxi
An autoencoder is a special type of
feed forward neural network which
does the following
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
3/55
xi
W
h
W∗
ˆxi
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
3/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder is a special type of
feed forward neural network which
does the following
Encodes its input xi into a hidden
representation h
Decodes the input again from this
hidden representation
The model is trained to minimize a
certain loss function which will ensure
that ˆxi is close to xi (we will see some
such loss functions soon)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Do you see an analogy with PCA?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
4/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder where dim(h) < dim(xi) is
called an under complete autoencoder
Let us consider the case where
dim(h) < dim(xi)
If we are still able to reconstruct ˆxi
perfectly from h, then what does it
say about h?
h is a loss-free encoding of xi. It cap-
tures all the important characteristics
of xi
Do you see an analogy with PCA?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Such an identity encoding is useless
in practice as it does not really tell us
anything about the important char-
acteristics of the data
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
5/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
An autoencoder where dim(h) ≥ dim(xi) is
called an over complete autoencoder
Let us consider the case when
dim(h) ≥ dim(xi)
In such a case the autoencoder could
learn a trivial encoding by simply
copying xi into h and then copying
h into ˆxi
Such an identity encoding is useless
in practice as it does not really tell us
anything about the important char-
acteristics of the data
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
6/55
The Road Ahead
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
6/55
The Road Ahead
Choice of f(xi) and g(xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
6/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
7/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Logistic as it naturally restricts all
outputs to be between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
8/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
g is typically chosen as the sigmoid
function
Suppose all our inputs are binary
(each xij ∈ {0, 1})
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Logistic as it naturally restricts all
outputs to be between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
whereas we want ˆxi ∈ Rn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
9/55
0.25 0.5 1.25 3.5 4.5
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(real valued inputs)
Again, g is typically chosen as the
sigmoid function
Suppose all our inputs are real (each
xij ∈ R)
Which of the following functions
would be most apt for the decoder?
ˆxi = tanh(W∗
h + c)
ˆxi = W∗
h + c
ˆxi = logistic(W∗
h + c)
What will logistic and tanh do?
They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
whereas we want ˆxi ∈ Rn
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
10/55
The Road Ahead
Choice of f(xi) and g(xi)
Choice of loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
11/55
xi
W
h
W∗
ˆxi
h = g(Wxi + b)
ˆxi = f(W∗
h + c)
Consider the case when the inputs are real
valued
The objective of the autoencoder is to recon-
struct ˆxi to be as close to xi as possible
This can be formalized using the following
objective function:
min
W,W ∗,c,b
1
m
m
i=1
n
j=1
(ˆxij − xij)2
i.e., min
W,W ∗,c,b
1
m
m
i=1
(ˆxi − xi)T
(ˆxi − xi)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
All we need is a formula for ∂L (θ)
∂W ∗ and ∂L (θ)
∂W
which we will see now
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
= ˆxi
{(ˆxi − xi)T
(ˆxi − xi)}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
12/55
L (θ) = (ˆxi − xi)T
(ˆxi − xi)
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Note that the loss function is
shown for only one training
example.
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
∂L (θ)
∂h2
=
∂L (θ)
∂ˆxi
= ˆxi
{(ˆxi − xi)T
(ˆxi − xi)}
= 2(ˆxi − xi)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Again we need is a formula for ∂L (θ)
∂W ∗ and
∂L (θ)
∂W to use backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
13/55
0 1 1 0 1
xi
h = g(Wxi + b)
W
W∗
ˆxi = f(W∗
h + c)
(binary inputs)
What value of ˆxij will minimize this
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when ˆxij = xij !
Consider the case when the inputs are
binary
We use a sigmoid decoder which will
produce outputs between 0 and 1, and
can be interpreted as probabilities.
For a single n-dimensional ith
input we
can use the following loss function
min{−
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))}
Again we need is a formula for ∂L (θ)
∂W ∗ and
∂L (θ)
∂W to use backpropagation
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
The first two terms on RHS can be
computed as:
∂L (θ)
∂h2j
= −
xij
ˆxij
+
1 − xij
1 − ˆxij
∂h2j
∂a2j
= σ(a2j)(1 − σ(a2j))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
14/55
L (θ) = −
n
j=1
(xij log ˆxij + (1 − xij) log(1 − ˆxij))
h0 = xi
h1
a1
h2 = ˆxi
a2
W
W∗
∂L (θ)
∂h2
=







∂L (θ)
∂h2n
...
∂L (θ)
∂h22
∂L (θ)
∂h21







∂L (θ)
∂W∗
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂W∗
∂L (θ)
∂W
=
∂L (θ)
∂h2
∂h2
∂a2
∂a2
∂h1
∂h1
∂a1
∂a1
∂W
We have already seen how to
calculate the expressions in the
square boxes when we learnt BP
The first two terms on RHS can be
computed as:
∂L (θ)
∂h2j
= −
xij
ˆxij
+
1 − xij
1 − ˆxij
∂h2j
∂a2j
= σ(a2j)(1 − σ(a2j))
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
15/55
Module 7.2: Link between PCA and Autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
use squared error loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
16/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
We will now see that the encoder part
of an autoencoder is equivalent to
PCA if we
use a linear encoder
use a linear decoder
use squared error loss function
normalize the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Let X be this zero mean data mat-
rix then what the above normaliza-
tion gives us is X = 1√
m
X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
17/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First let us consider the implication
of normalizing the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
The operation in the bracket ensures
that the data now has 0 mean along
each dimension j (we are subtracting
the mean)
Let X be this zero mean data mat-
rix then what the above normaliza-
tion gives us is X = 1√
m
X
Now (X)T X = 1
m (X )T X is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
1
m
m
i=1
n
j=1
(xij − ˆxij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
18/55
xi
h
ˆxi
≡
PCA
PT XT XP = D
y
x
u1 u2
First we will show that if we use lin-
ear decoder and a squared error loss
function then
The optimal solution to the following
objective function
1
m
m
i=1
n
j=1
(xij − ˆxij)2
is obtained when we use a linear en-
coder.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by
HW∗
= U.,≤kΣk,kV T
.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
19/55
min
θ
m
i=1
n
j=1
(xij − ˆxij)2
(1)
This is equivalent to
min
W ∗H
( X − HW∗
F )2
A F =
m
i=1
n
j=1
a2
ij
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by
HW∗
= U.,≤kΣk,kV T
.,≤k
By matching variables one possible solution is
H = U.,≤kΣk,k
W∗
= V T
.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
H = XV.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
20/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
H = U.,≤kΣk,k
= (XXT
)(XXT
)−1
U.,≤KΣk,k (pre-multiplying (XXT
)(XXT
)−1
= I)
= (XV ΣT
UT
)(UΣV T
V ΣT
UT
)−1
U.,≤kΣk,k (using X = UΣV T
)
= XV ΣT
UT
(UΣΣT
UT
)−1
U.,≤kΣk,k (V T
V = I)
= XV ΣT
UT
U(ΣΣT
)−1
UT
U.,≤kΣk,k ((ABC)−1
= C−1
B−1
A−1
)
= XV ΣT
(ΣΣT
)−1
UT
U.,≤kΣk,k (UT
U = I)
= XV ΣT
ΣT −1
Σ−1
UT
U.,≤kΣk,k ((AB)−1
= B−1
A−1
)
= XV Σ−1
I.,≤kΣk,k (UT
U.,≤k = I.,≤k)
= XV I.,≤k (Σ−1
I.,≤k = Σ−1
k,k)
H = XV.,≤k
Thus H is a linear transformation of X and W = V.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21/55
We have encoder W = V.,≤k
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
then XT X is indeed the covariance matrix
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
21/55
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of XT X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
then XT X is indeed the covariance matrix
Thus, the encoder matrix for linear autoencoder(W) and the projection
matrix(P) for PCA could indeed be the same. Hence proved
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
22/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to
ˆxij =
1
√
m
xij −
1
m
m
k=1
xkj
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
23/55
Module 7.3: Regularization in autoencoders
(Motivation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
24/55
xi
W
h
W∗
ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Here, (as stated earlier) the model
can simply learn to copy xi to h and
then h to ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
24/55
xi
W
h
W∗
ˆxi
While poor generalization could hap-
pen even in undercomplete autoen-
coders it is an even more serious prob-
lem for overcomplete auto encoders
Here, (as stated earlier) the model
can simply learn to copy xi to h and
then h to ˆxi
To avoid poor generalization, we need
to introduce regularization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
25/55
xi
W
h
W∗
ˆxi
The simplest solution is to add a L2-
regularization term to the objective
function
min
θ,w,w∗,b,c
1
m
m
i=1
n
j=1
(ˆxij − xij)2
+ λ θ 2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
25/55
xi
W
h
W∗
ˆxi
The simplest solution is to add a L2-
regularization term to the objective
function
min
θ,w,w∗,b,c
1
m
m
i=1
n
j=1
(ˆxij − xij)2
+ λ θ 2
This is very easy to implement and
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
26/55
xi
W
h
W∗
ˆxi
Another trick is to tie the weights of
the encoder and decoder
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
26/55
xi
W
h
W∗
ˆxi
Another trick is to tie the weights of
the encoder and decoder i.e., W∗ =
WT
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
26/55
xi
W
h
W∗
ˆxi
Another trick is to tie the weights of
the encoder and decoder i.e., W∗ =
WT
This effectively reduces the capacity
of Autoencoder and acts as a regular-
izer
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
27/55
Module 7.4: Denoising Autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
P(xij = xij|xij) = 1 − q
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
28/55
xi
˜xi
h
ˆxi
P(xij|xij)
A denoising encoder simply corrupts
the input data using a probabilistic
process (P(xij|xij)) before feeding it
to the network
A simple P(xij|xij) used in practice
is the following
P(xij = 0|xij) = q
P(xij = xij|xij) = 1 − q
In other words, with probability q the
input is flipped to 0 and with probab-
ility (1 − q) it is retained as it is
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
29/55
xi
˜xi
h
ˆxi
P(xij|xij)
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
29/55
xi
˜xi
h
ˆxi
P(xij|xij)
For example, it will have to learn to
reconstruct a corrupted xij correctly by
relying on its interactions with other
elements of xi
How does this help ?
This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
arg min
θ
1
m
m
i=1
n
j=1
(ˆxij − xij)2
It no longer makes sense for the model
to copy the corrupted xi into h(xi)
and then into ˆxi (the objective func-
tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
30/55
We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
31/55
Task: Hand-written digit
recognition
Figure: MNIST Data
0 1 2 3 9
|xi| = 784 = 28 × 28
28*28
Figure: Basic approach(we use raw data as input
features)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
32/55
Task: Hand-written digit
recognition
Figure: MNIST Data
|xi| = 784 = 28 × 28
ˆxi ∈ R784
h ∈ Rd
Figure: AE approach (first learn important
characteristics of data)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
33/55
Task: Hand-written digit
recognition
Figure: MNIST Data
0 1 2 3 9
|xi| = 784 = 28 × 28
h ∈ Rd
Figure: AE approach (and then train a classifier on
top of this hidden representation)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
34/55
We will now see a way of visualizing AEs and use this visualization to compare
different AEs
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
35/55
xi
h
ˆxi
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
35/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
35/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
We can think of each neuron as a filter which
will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h1 = σ(WT
1 xi) [ignoring bias b]
Where W1 is the trained vector of weights con-
necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that xi = 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Let us plot these images (xi’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
coder and different denoising autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
36/55
xi
h
ˆxi
max
xi
{WT
1 xi}
s.t. ||xi||2
= xT
i xi = 1
Solution: xi =
W1
WT
1 W1
Thus the inputs
xi =
W1
WT
1 W1
,
W2
WT
2 W2
, . . .
Wn
WT
n Wn
will respectively cause hidden neurons 1 to n
to maximally fire
Let us plot these images (xi’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
coder and different denoising autoencoders
These xi’s are computed by the above formula
using the weights (W1, W2 . . . Wk) learned by
the respective autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
37/55
Figure: Vanilla AE
(No noise)
Figure: 25% Denoising
AE (q=0.25)
Figure: 50% Denoising
AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the filters become more wide because the neuron has to
rely on more adjacent pixels to feel confident about a stroke
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
38/55
xi
˜xi
h
ˆxi
P(xij|xij)
We saw one form of P(xij|xij) which flips a
fraction q of the inputs to zero
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
38/55
xi
˜xi
h
ˆxi
P(xij|xij)
We saw one form of P(xij|xij) which flips a
fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
xij = xij + N (0, 1)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
38/55
xi
˜xi
h
ˆxi
P(xij|xij)
We saw one form of P(xij|xij) which flips a
fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
xij = xij + N (0, 1)
We will now use such a denoising AE on a
different dataset and see their performance
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
39/55
Figure: Data Figure: AE filters
Figure: Weight decay
filters
The hidden neurons essentially behave like edge detectors
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
39/55
Figure: Data Figure: AE filters
Figure: Weight decay
filters
The hidden neurons essentially behave like edge detectors
PCA does not give such edge detectors
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
40/55
Module 7.5: Sparse Autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
41/55
xi
h
ˆxi
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
output is close to 1 and not activated when
its output is close to 0.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
41/55
xi
h
ˆxi
A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
neuron is inactive most of the times.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
One way of ensuring this is to add the follow-
ing term to the objective function
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
42/55
xi
h
ˆxi
The average value of the
activation of a neuron l is given
by
ˆρl =
1
m
m
i=1
h(xi)l
If the neuron l is sparse (i.e. mostly inactive)
then ˆρl → 0
A sparse autoencoder uses a sparsity para-
meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
One way of ensuring this is to add the follow-
ing term to the objective function
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
When will this term reach its minimum value
and what is the minimum value? Let us plot
it and check.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
43/55
Ω(θ)
0.2 ˆρl
ρ = 0.2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
43/55
Ω(θ)
0.2 ˆρl
ρ = 0.2
The function will reach its minimum value(s) when ˆρl = ρ.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
and
∂ ˆρl
∂W
= xi(g (WT
xi + b))T
(see next slide)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
44/55
Ω(θ) =
k
l=1
ρlog
ρ
ˆρl
+ (1 − ρ)log
1 − ρ
1 − ˆρl
Can be re-written as
Ω(θ) =
k
l=1
ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl)
By Chain rule:
∂Ω(θ)
∂W
=
∂Ω(θ)
∂ˆρ
.
∂ˆρ
∂W
∂Ω(θ)
∂ˆρ
= ∂Ω(θ)
∂ ˆρ1
, ∂Ω(θ)
∂ ˆρ2
, . . . ∂Ω(θ)
∂ ˆρk
T
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ)
∂ ˆρl
= −
ρ
ˆρl
+
(1 − ρ)
1 − ˆρl
and
∂ ˆρl
∂W
= xi(g (WT
xi + b))T
(see next slide)
Now,
ˆL (θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or
cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
Let us see how to calculate ∂Ω(θ)
∂W .
Finally,
∂ ˆL (θ)
∂W
=
∂L (θ)
∂W
+
∂Ω(θ)
∂W
(and we know how to calculate both
terms on R.H.S)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
45/55
Derivation
∂ˆρ
∂W
= ∂ ˆρ1
∂W
∂ ˆρ2
∂W . . . ∂ ˆρk
∂W
For each element in the above equation we can calculate ∂ ˆρl
∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl:-
∂ˆρl
∂Wjl
=
∂ 1
m
m
i=1 g WT
:,lxi + bl
∂Wjl
=
1
m
m
i=1
∂ g WT
:,lxi + bl
∂Wjl
=
1
m
m
i=1
g WT
:,lxi + bl xij
So in matrix notation we can write it as :
∂ˆρl
∂W
= xi(g (WT
xi + b))T
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
46/55
Module 7.6: Contractive Autoencoders
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
where Jx(h) is the Jacobian of the en-
coder.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
47/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func-
tion.
It does so by adding the following reg-
ularization term to the loss function
Ω(θ) = Jx(h) 2
F
where Jx(h) is the Jacobian of the en-
coder.
Let us see what it looks like.
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
48/55
If the input has n dimensions and the
hidden layer has k dimensions then
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
48/55
If the input has n dimensions and the
hidden layer has k dimensions then
Jx(h) =






∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn






Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
48/55
If the input has n dimensions and the
hidden layer has k dimensions then
In other words, the (j, l) entry of the
Jacobian captures the variation in the
output of the lth neuron with a small
variation in the jth input.
Jx(h) =






∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn






Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
48/55
If the input has n dimensions and the
hidden layer has k dimensions then
In other words, the (j, l) entry of the
Jacobian captures the variation in the
output of the lth neuron with a small
variation in the jth input.
Jx(h) =






∂h1
∂x1
. . . . . . . . . ∂h1
∂xn
∂h2
∂x1
. . . . . . . . . ∂h2
∂xn
...
...
...
∂hk
∂x1
. . . . . . . . . ∂hk
∂xn






Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
49/55
What is the intuition behind this ?
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
It means that this neuron is not very
sensitive to variations in the input x1.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
49/55
What is the intuition behind this ?
Consider ∂h1
∂x1
, what does it mean if
∂h1
∂x1
= 0
It means that this neuron is not very
sensitive to variations in the input x1.
But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the
input.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
50/55
Indeed it does and that’s the idea
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Ω(θ) - do not capture variations in
data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
50/55
Indeed it does and that’s the idea
By putting these two contradicting
objectives against each other we en-
sure that h is sensitive to only very
important variations as observed in
the training data.
L(θ) - capture important variations
in data
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import-
ant variations in the data
Jx(h) 2
F =
n
j=1
k
l=1
∂hl
∂xj
2
x
h
ˆx
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
51/55
Let us try to understand this with the help of an illustration.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
52/55
x
y
u1
u2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
52/55
x
y
u1
u2
Consider the variations in the data
along directions u1 and u2
It makes sense to maximize a neuron
to be sensitive to variations along u1
At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
53/55
Module 7.7 : Summary
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
54/55
x
h
ˆx
≡
PCA
PT XT XP = D
y
x
u1 u2
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
54/55
x
h
ˆx
≡
PCA
PT XT XP = D
y
x
u1 u2
min
θ
X − HW∗
UΣV T
(SVD)
2
F
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Regularization
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Regularization
Ω(θ) = λ θ 2
Weight decaying
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Regularization
Ω(θ) = λ θ 2
Weight decaying
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
Sparse
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
55/55
xi
˜xi
h
ˆxi
P(xij|xij)
Regularization
Ω(θ) = λ θ 2
Weight decaying
Ω(θ) =
k
l=1
ρ log
ρ
ˆρl
+ (1 − ρ) log
1 − ρ
1 − ˆρl
Sparse
Ω(θ) =
n
j=1
k
l=1
∂hl
∂xj
2
Contractive
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Mais conteúdo relacionado

Mais procurados

Integration material
Integration material Integration material
Integration material Surya Swaroop
 
Mathematics notes and formula for class 12 chapter 7. integrals
Mathematics notes and formula for class 12 chapter 7. integrals Mathematics notes and formula for class 12 chapter 7. integrals
Mathematics notes and formula for class 12 chapter 7. integrals sakhi pathak
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsYoonho Lee
 
Important Cuts
Important CutsImportant Cuts
Important CutsASPAK2014
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodYoonho Lee
 
Paths and Polynomials
Paths and PolynomialsPaths and Polynomials
Paths and PolynomialsASPAK2014
 
A One-Pass Triclustering Approach: Is There any Room for Big Data?
A One-Pass Triclustering Approach: Is There any Room for Big Data?A One-Pass Triclustering Approach: Is There any Room for Big Data?
A One-Pass Triclustering Approach: Is There any Room for Big Data?Dmitrii Ignatov
 
Collision Detection In 3D Environments
Collision Detection In 3D EnvironmentsCollision Detection In 3D Environments
Collision Detection In 3D EnvironmentsUng-Su Lee
 
Iterative Compression
Iterative CompressionIterative Compression
Iterative CompressionASPAK2014
 
A common random fixed point theorem for rational inequality in hilbert space
A common random fixed point theorem for rational inequality in hilbert spaceA common random fixed point theorem for rational inequality in hilbert space
A common random fixed point theorem for rational inequality in hilbert spaceAlexander Decker
 
Cs6702 graph theory and applications Anna University question paper apr may 2...
Cs6702 graph theory and applications Anna University question paper apr may 2...Cs6702 graph theory and applications Anna University question paper apr may 2...
Cs6702 graph theory and applications Anna University question paper apr may 2...appasami
 
Limits and Continuity - Intuitive Approach part 3
Limits and Continuity - Intuitive Approach part 3Limits and Continuity - Intuitive Approach part 3
Limits and Continuity - Intuitive Approach part 3FellowBuddy.com
 
ODD HARMONIOUS LABELINGS OF CYCLIC SNAKES
ODD HARMONIOUS LABELINGS OF CYCLIC SNAKESODD HARMONIOUS LABELINGS OF CYCLIC SNAKES
ODD HARMONIOUS LABELINGS OF CYCLIC SNAKESgraphhoc
 
Dynamic Programming Over Graphs of Bounded Treewidth
Dynamic Programming Over Graphs of Bounded TreewidthDynamic Programming Over Graphs of Bounded Treewidth
Dynamic Programming Over Graphs of Bounded TreewidthASPAK2014
 
Fine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsFine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsAkankshaAgrawal55
 
A common fixed point of integral type contraction in generalized metric spacess
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacessAlexander Decker
 
Learning to Reconstruct at Stanford
Learning to Reconstruct at StanfordLearning to Reconstruct at Stanford
Learning to Reconstruct at StanfordJonas Adler
 
Bidimensionality
BidimensionalityBidimensionality
BidimensionalityASPAK2014
 

Mais procurados (20)

Integration material
Integration material Integration material
Integration material
 
Mathematics notes and formula for class 12 chapter 7. integrals
Mathematics notes and formula for class 12 chapter 7. integrals Mathematics notes and formula for class 12 chapter 7. integrals
Mathematics notes and formula for class 12 chapter 7. integrals
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
Important Cuts
Important CutsImportant Cuts
Important Cuts
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Paths and Polynomials
Paths and PolynomialsPaths and Polynomials
Paths and Polynomials
 
A One-Pass Triclustering Approach: Is There any Room for Big Data?
A One-Pass Triclustering Approach: Is There any Room for Big Data?A One-Pass Triclustering Approach: Is There any Room for Big Data?
A One-Pass Triclustering Approach: Is There any Room for Big Data?
 
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
 
Collision Detection In 3D Environments
Collision Detection In 3D EnvironmentsCollision Detection In 3D Environments
Collision Detection In 3D Environments
 
Iterative Compression
Iterative CompressionIterative Compression
Iterative Compression
 
A common random fixed point theorem for rational inequality in hilbert space
A common random fixed point theorem for rational inequality in hilbert spaceA common random fixed point theorem for rational inequality in hilbert space
A common random fixed point theorem for rational inequality in hilbert space
 
Cs6702 graph theory and applications Anna University question paper apr may 2...
Cs6702 graph theory and applications Anna University question paper apr may 2...Cs6702 graph theory and applications Anna University question paper apr may 2...
Cs6702 graph theory and applications Anna University question paper apr may 2...
 
Gate-Cs 2007
Gate-Cs 2007Gate-Cs 2007
Gate-Cs 2007
 
Limits and Continuity - Intuitive Approach part 3
Limits and Continuity - Intuitive Approach part 3Limits and Continuity - Intuitive Approach part 3
Limits and Continuity - Intuitive Approach part 3
 
ODD HARMONIOUS LABELINGS OF CYCLIC SNAKES
ODD HARMONIOUS LABELINGS OF CYCLIC SNAKESODD HARMONIOUS LABELINGS OF CYCLIC SNAKES
ODD HARMONIOUS LABELINGS OF CYCLIC SNAKES
 
Dynamic Programming Over Graphs of Bounded Treewidth
Dynamic Programming Over Graphs of Bounded TreewidthDynamic Programming Over Graphs of Bounded Treewidth
Dynamic Programming Over Graphs of Bounded Treewidth
 
Fine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its VariantsFine Grained Complexity of Rainbow Coloring and its Variants
Fine Grained Complexity of Rainbow Coloring and its Variants
 
A common fixed point of integral type contraction in generalized metric spacess
A  common fixed point of integral type contraction in generalized metric spacessA  common fixed point of integral type contraction in generalized metric spacess
A common fixed point of integral type contraction in generalized metric spacess
 
Learning to Reconstruct at Stanford
Learning to Reconstruct at StanfordLearning to Reconstruct at Stanford
Learning to Reconstruct at Stanford
 
Bidimensionality
BidimensionalityBidimensionality
Bidimensionality
 

Semelhante a Deep learning

support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learningSamGuy7
 
On Application of Unbounded Hilbert Linear Operators in Quantum Mechanics
On Application of Unbounded Hilbert Linear Operators in Quantum MechanicsOn Application of Unbounded Hilbert Linear Operators in Quantum Mechanics
On Application of Unbounded Hilbert Linear Operators in Quantum MechanicsBRNSS Publication Hub
 
A Brief Introduction to Linear Regression
A Brief Introduction to Linear RegressionA Brief Introduction to Linear Regression
A Brief Introduction to Linear RegressionNidhal Selmi
 
Harmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningHarmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningSungbin Lim
 
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRASFUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRASIAEME Publication
 
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRASFUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRASIAEME Publication
 
Largedictionaries handout
Largedictionaries handoutLargedictionaries handout
Largedictionaries handoutcsedays
 
Dataflow Analysis
Dataflow AnalysisDataflow Analysis
Dataflow AnalysisMiller Lee
 

Semelhante a Deep learning (15)

Deep learning networks
Deep learning networksDeep learning networks
Deep learning networks
 
Lecture5.pdf
Lecture5.pdfLecture5.pdf
Lecture5.pdf
 
Lecture4 xing
Lecture4 xingLecture4 xing
Lecture4 xing
 
deep learning
deep learningdeep learning
deep learning
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learning
 
On Application of Unbounded Hilbert Linear Operators in Quantum Mechanics
On Application of Unbounded Hilbert Linear Operators in Quantum MechanicsOn Application of Unbounded Hilbert Linear Operators in Quantum Mechanics
On Application of Unbounded Hilbert Linear Operators in Quantum Mechanics
 
A Brief Introduction to Linear Regression
A Brief Introduction to Linear RegressionA Brief Introduction to Linear Regression
A Brief Introduction to Linear Regression
 
Harmonic Analysis and Deep Learning
Harmonic Analysis and Deep LearningHarmonic Analysis and Deep Learning
Harmonic Analysis and Deep Learning
 
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRASFUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
 
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRASFUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
FUZZY IDEALS AND FUZZY DOT IDEALS ON BH-ALGEBRAS
 
Lecture11 xing
Lecture11 xingLecture11 xing
Lecture11 xing
 
Largedictionaries handout
Largedictionaries handoutLargedictionaries handout
Largedictionaries handout
 
Dataflow Analysis
Dataflow AnalysisDataflow Analysis
Dataflow Analysis
 

Último

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 

Último (20)

data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 

Deep learning

  • 1. 1/55 CS7015 (Deep Learning) : Lecture 7 Autoencoders and relation to PCA, Regularization in autoencoders, Denoising autoencoders, Sparse autoencoders, Contractive autoencoders Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 2. 2/55 Module 7.1: Introduction to Autoencoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 3. 3/55 xi W h W∗ ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 4. 3/55 xi W h W∗ ˆxi An autoencoder is a special type of feed forward neural network which does the following Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 5. 3/55 xi W h W∗ ˆxi An autoencoder is a special type of feed forward neural network which does the following Encodes its input xi into a hidden representation h Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 6. 3/55 xi W h W∗ ˆxi h = g(Wxi + b) An autoencoder is a special type of feed forward neural network which does the following Encodes its input xi into a hidden representation h Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 7. 3/55 xi W h W∗ ˆxi h = g(Wxi + b) An autoencoder is a special type of feed forward neural network which does the following Encodes its input xi into a hidden representation h Decodes the input again from this hidden representation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 8. 3/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) An autoencoder is a special type of feed forward neural network which does the following Encodes its input xi into a hidden representation h Decodes the input again from this hidden representation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 9. 3/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) An autoencoder is a special type of feed forward neural network which does the following Encodes its input xi into a hidden representation h Decodes the input again from this hidden representation The model is trained to minimize a certain loss function which will ensure that ˆxi is close to xi (we will see some such loss functions soon) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 10. 4/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 11. 4/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case where dim(h) < dim(xi) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 12. 4/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case where dim(h) < dim(xi) If we are still able to reconstruct ˆxi perfectly from h, then what does it say about h? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 13. 4/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case where dim(h) < dim(xi) If we are still able to reconstruct ˆxi perfectly from h, then what does it say about h? h is a loss-free encoding of xi. It cap- tures all the important characteristics of xi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 14. 4/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case where dim(h) < dim(xi) If we are still able to reconstruct ˆxi perfectly from h, then what does it say about h? h is a loss-free encoding of xi. It cap- tures all the important characteristics of xi Do you see an analogy with PCA? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 15. 4/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) An autoencoder where dim(h) < dim(xi) is called an under complete autoencoder Let us consider the case where dim(h) < dim(xi) If we are still able to reconstruct ˆxi perfectly from h, then what does it say about h? h is a loss-free encoding of xi. It cap- tures all the important characteristics of xi Do you see an analogy with PCA? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 16. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 17. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 18. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 19. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 20. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 21. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 22. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 23. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 24. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 25. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 26. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 27. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 28. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 29. 5/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) An autoencoder where dim(h) ≥ dim(xi) is called an over complete autoencoder Let us consider the case when dim(h) ≥ dim(xi) In such a case the autoencoder could learn a trivial encoding by simply copying xi into h and then copying h into ˆxi Such an identity encoding is useless in practice as it does not really tell us anything about the important char- acteristics of the data Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 30. 6/55 The Road Ahead Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 31. 6/55 The Road Ahead Choice of f(xi) and g(xi) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 32. 6/55 The Road Ahead Choice of f(xi) and g(xi) Choice of loss function Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 33. 7/55 The Road Ahead Choice of f(xi) and g(xi) Choice of loss function Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 34. 8/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 35. 8/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Suppose all our inputs are binary (each xij ∈ {0, 1}) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 36. 8/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Suppose all our inputs are binary (each xij ∈ {0, 1}) Which of the following functions would be most apt for the decoder? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 37. 8/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Suppose all our inputs are binary (each xij ∈ {0, 1}) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 38. 8/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Suppose all our inputs are binary (each xij ∈ {0, 1}) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 39. 8/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Suppose all our inputs are binary (each xij ∈ {0, 1}) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c ˆxi = logistic(W∗ h + c) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 40. 8/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Suppose all our inputs are binary (each xij ∈ {0, 1}) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c ˆxi = logistic(W∗ h + c) Logistic as it naturally restricts all outputs to be between 0 and 1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 41. 8/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) g is typically chosen as the sigmoid function Suppose all our inputs are binary (each xij ∈ {0, 1}) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c ˆxi = logistic(W∗ h + c) Logistic as it naturally restricts all outputs to be between 0 and 1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 42. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 43. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Suppose all our inputs are real (each xij ∈ R) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 44. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Suppose all our inputs are real (each xij ∈ R) Which of the following functions would be most apt for the decoder? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 45. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Suppose all our inputs are real (each xij ∈ R) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 46. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Suppose all our inputs are real (each xij ∈ R) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 47. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Suppose all our inputs are real (each xij ∈ R) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c ˆxi = logistic(W∗ h + c) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 48. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Suppose all our inputs are real (each xij ∈ R) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c ˆxi = logistic(W∗ h + c) What will logistic and tanh do? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 49. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Suppose all our inputs are real (each xij ∈ R) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c ˆxi = logistic(W∗ h + c) What will logistic and tanh do? They will restrict the reconstruc- ted ˆxi to lie between [0,1] or [-1,1] whereas we want ˆxi ∈ Rn Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 50. 9/55 0.25 0.5 1.25 3.5 4.5 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (real valued inputs) Again, g is typically chosen as the sigmoid function Suppose all our inputs are real (each xij ∈ R) Which of the following functions would be most apt for the decoder? ˆxi = tanh(W∗ h + c) ˆxi = W∗ h + c ˆxi = logistic(W∗ h + c) What will logistic and tanh do? They will restrict the reconstruc- ted ˆxi to lie between [0,1] or [-1,1] whereas we want ˆxi ∈ Rn Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 51. 10/55 The Road Ahead Choice of f(xi) and g(xi) Choice of loss function Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 52. 11/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 53. 11/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Consider the case when the inputs are real valued Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 54. 11/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Consider the case when the inputs are real valued The objective of the autoencoder is to recon- struct ˆxi to be as close to xi as possible Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 55. 11/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Consider the case when the inputs are real valued The objective of the autoencoder is to recon- struct ˆxi to be as close to xi as possible This can be formalized using the following objective function: min W,W ∗,c,b 1 m m i=1 n j=1 (ˆxij − xij)2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 56. 11/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Consider the case when the inputs are real valued The objective of the autoencoder is to recon- struct ˆxi to be as close to xi as possible This can be formalized using the following objective function: min W,W ∗,c,b 1 m m i=1 n j=1 (ˆxij − xij)2 i.e., min W,W ∗,c,b 1 m m i=1 (ˆxi − xi)T (ˆxi − xi) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 57. 11/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Consider the case when the inputs are real valued The objective of the autoencoder is to recon- struct ˆxi to be as close to xi as possible This can be formalized using the following objective function: min W,W ∗,c,b 1 m m i=1 n j=1 (ˆxij − xij)2 i.e., min W,W ∗,c,b 1 m m i=1 (ˆxi − xi)T (ˆxi − xi) We can then train the autoencoder just like a regular feedforward network using back- propagation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 58. 11/55 xi W h W∗ ˆxi h = g(Wxi + b) ˆxi = f(W∗ h + c) Consider the case when the inputs are real valued The objective of the autoencoder is to recon- struct ˆxi to be as close to xi as possible This can be formalized using the following objective function: min W,W ∗,c,b 1 m m i=1 n j=1 (ˆxij − xij)2 i.e., min W,W ∗,c,b 1 m m i=1 (ˆxi − xi)T (ˆxi − xi) We can then train the autoencoder just like a regular feedforward network using back- propagation All we need is a formula for ∂L (θ) ∂W ∗ and ∂L (θ) ∂W which we will see now Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 59. 12/55 L (θ) = (ˆxi − xi)T (ˆxi − xi) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 60. 12/55 L (θ) = (ˆxi − xi)T (ˆxi − xi) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Note that the loss function is shown for only one training example. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 61. 12/55 L (θ) = (ˆxi − xi)T (ˆxi − xi) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Note that the loss function is shown for only one training example. ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 62. 12/55 L (θ) = (ˆxi − xi)T (ˆxi − xi) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Note that the loss function is shown for only one training example. ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 63. 12/55 L (θ) = (ˆxi − xi)T (ˆxi − xi) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Note that the loss function is shown for only one training example. ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W We have already seen how to calculate the expres- sion in the boxes when we learnt backpropagation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 64. 12/55 L (θ) = (ˆxi − xi)T (ˆxi − xi) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Note that the loss function is shown for only one training example. ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W We have already seen how to calculate the expres- sion in the boxes when we learnt backpropagation ∂L (θ) ∂h2 = ∂L (θ) ∂ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 65. 12/55 L (θ) = (ˆxi − xi)T (ˆxi − xi) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Note that the loss function is shown for only one training example. ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W We have already seen how to calculate the expres- sion in the boxes when we learnt backpropagation ∂L (θ) ∂h2 = ∂L (θ) ∂ˆxi = ˆxi {(ˆxi − xi)T (ˆxi − xi)} Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 66. 12/55 L (θ) = (ˆxi − xi)T (ˆxi − xi) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Note that the loss function is shown for only one training example. ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W We have already seen how to calculate the expres- sion in the boxes when we learnt backpropagation ∂L (θ) ∂h2 = ∂L (θ) ∂ˆxi = ˆxi {(ˆxi − xi)T (ˆxi − xi)} = 2(ˆxi − xi) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 67. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 68. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Consider the case when the inputs are binary Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 69. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Consider the case when the inputs are binary We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 70. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) Consider the case when the inputs are binary We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. For a single n-dimensional ith input we can use the following loss function min{− n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij))} Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 71. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) What value of ˆxij will minimize this function? Consider the case when the inputs are binary We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. For a single n-dimensional ith input we can use the following loss function min{− n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij))} Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 72. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) What value of ˆxij will minimize this function? If xij = 1 ? Consider the case when the inputs are binary We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. For a single n-dimensional ith input we can use the following loss function min{− n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij))} Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 73. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) What value of ˆxij will minimize this function? If xij = 1 ? If xij = 0 ? Consider the case when the inputs are binary We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. For a single n-dimensional ith input we can use the following loss function min{− n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij))} Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 74. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) What value of ˆxij will minimize this function? If xij = 1 ? If xij = 0 ? Consider the case when the inputs are binary We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. For a single n-dimensional ith input we can use the following loss function min{− n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij))} Again we need is a formula for ∂L (θ) ∂W ∗ and ∂L (θ) ∂W to use backpropagation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 75. 13/55 0 1 1 0 1 xi h = g(Wxi + b) W W∗ ˆxi = f(W∗ h + c) (binary inputs) What value of ˆxij will minimize this function? If xij = 1 ? If xij = 0 ? Indeed the above function will be minimized when ˆxij = xij ! Consider the case when the inputs are binary We use a sigmoid decoder which will produce outputs between 0 and 1, and can be interpreted as probabilities. For a single n-dimensional ith input we can use the following loss function min{− n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij))} Again we need is a formula for ∂L (θ) ∂W ∗ and ∂L (θ) ∂W to use backpropagation Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 76. 14/55 L (θ) = − n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij)) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 77. 14/55 L (θ) = − n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij)) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 78. 14/55 L (θ) = − n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij)) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 79. 14/55 L (θ) = − n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij)) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W We have already seen how to calculate the expressions in the square boxes when we learnt BP Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 80. 14/55 L (θ) = − n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij)) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W We have already seen how to calculate the expressions in the square boxes when we learnt BP The first two terms on RHS can be computed as: ∂L (θ) ∂h2j = − xij ˆxij + 1 − xij 1 − ˆxij ∂h2j ∂a2j = σ(a2j)(1 − σ(a2j)) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 81. 14/55 L (θ) = − n j=1 (xij log ˆxij + (1 − xij) log(1 − ˆxij)) h0 = xi h1 a1 h2 = ˆxi a2 W W∗ ∂L (θ) ∂h2 =        ∂L (θ) ∂h2n ... ∂L (θ) ∂h22 ∂L (θ) ∂h21        ∂L (θ) ∂W∗ = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂W∗ ∂L (θ) ∂W = ∂L (θ) ∂h2 ∂h2 ∂a2 ∂a2 ∂h1 ∂h1 ∂a1 ∂a1 ∂W We have already seen how to calculate the expressions in the square boxes when we learnt BP The first two terms on RHS can be computed as: ∂L (θ) ∂h2j = − xij ˆxij + 1 − xij 1 − ˆxij ∂h2j ∂a2j = σ(a2j)(1 − σ(a2j)) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 82. 15/55 Module 7.2: Link between PCA and Autoencoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 83. 16/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 We will now see that the encoder part of an autoencoder is equivalent to PCA if we Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 84. 16/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 We will now see that the encoder part of an autoencoder is equivalent to PCA if we use a linear encoder Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 85. 16/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 We will now see that the encoder part of an autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 86. 16/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 We will now see that the encoder part of an autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder use squared error loss function Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 87. 16/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 We will now see that the encoder part of an autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder use squared error loss function normalize the inputs to ˆxij = 1 √ m xij − 1 m m k=1 xkj Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 88. 17/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 First let us consider the implication of normalizing the inputs to ˆxij = 1 √ m xij − 1 m m k=1 xkj Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 89. 17/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 First let us consider the implication of normalizing the inputs to ˆxij = 1 √ m xij − 1 m m k=1 xkj The operation in the bracket ensures that the data now has 0 mean along each dimension j (we are subtracting the mean) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 90. 17/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 First let us consider the implication of normalizing the inputs to ˆxij = 1 √ m xij − 1 m m k=1 xkj The operation in the bracket ensures that the data now has 0 mean along each dimension j (we are subtracting the mean) Let X be this zero mean data mat- rix then what the above normaliza- tion gives us is X = 1√ m X Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 91. 17/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 First let us consider the implication of normalizing the inputs to ˆxij = 1 √ m xij − 1 m m k=1 xkj The operation in the bracket ensures that the data now has 0 mean along each dimension j (we are subtracting the mean) Let X be this zero mean data mat- rix then what the above normaliza- tion gives us is X = 1√ m X Now (X)T X = 1 m (X )T X is the co- variance matrix (recall that covari- ance matrix plays an important role in PCA) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 92. 18/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 93. 18/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 First we will show that if we use lin- ear decoder and a squared error loss function then Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 94. 18/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 First we will show that if we use lin- ear decoder and a squared error loss function then The optimal solution to the following objective function Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 95. 18/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 First we will show that if we use lin- ear decoder and a squared error loss function then The optimal solution to the following objective function 1 m m i=1 n j=1 (xij − ˆxij)2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 96. 18/55 xi h ˆxi ≡ PCA PT XT XP = D y x u1 u2 First we will show that if we use lin- ear decoder and a squared error loss function then The optimal solution to the following objective function 1 m m i=1 n j=1 (xij − ˆxij)2 is obtained when we use a linear en- coder. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 97. 19/55 min θ m i=1 n j=1 (xij − ˆxij)2 (1) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 98. 19/55 min θ m i=1 n j=1 (xij − ˆxij)2 (1) This is equivalent to Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 99. 19/55 min θ m i=1 n j=1 (xij − ˆxij)2 (1) This is equivalent to min W ∗H ( X − HW∗ F )2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 100. 19/55 min θ m i=1 n j=1 (xij − ˆxij)2 (1) This is equivalent to min W ∗H ( X − HW∗ F )2 A F = m i=1 n j=1 a2 ij Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 101. 19/55 min θ m i=1 n j=1 (xij − ˆxij)2 (1) This is equivalent to min W ∗H ( X − HW∗ F )2 A F = m i=1 n j=1 a2 ij (just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 102. 19/55 min θ m i=1 n j=1 (xij − ˆxij)2 (1) This is equivalent to min W ∗H ( X − HW∗ F )2 A F = m i=1 n j=1 a2 ij (just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases) From SVD we know that optimal solution to the above problem is given by HW∗ = U.,≤kΣk,kV T .,≤k Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 103. 19/55 min θ m i=1 n j=1 (xij − ˆxij)2 (1) This is equivalent to min W ∗H ( X − HW∗ F )2 A F = m i=1 n j=1 a2 ij (just writing the expression (1) in matrix form and using the definition of ||A||F ) (we are ignoring the biases) From SVD we know that optimal solution to the above problem is given by HW∗ = U.,≤kΣk,kV T .,≤k By matching variables one possible solution is H = U.,≤kΣk,k W∗ = V T .,≤k Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 104. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 105. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 106. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 107. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 108. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) = XV ΣT UT (UΣΣT UT )−1 U.,≤kΣk,k (V T V = I) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 109. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) = XV ΣT UT (UΣΣT UT )−1 U.,≤kΣk,k (V T V = I) = XV ΣT UT U(ΣΣT )−1 UT U.,≤kΣk,k ((ABC)−1 = C−1 B−1 A−1 ) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 110. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) = XV ΣT UT (UΣΣT UT )−1 U.,≤kΣk,k (V T V = I) = XV ΣT UT U(ΣΣT )−1 UT U.,≤kΣk,k ((ABC)−1 = C−1 B−1 A−1 ) = XV ΣT (ΣΣT )−1 UT U.,≤kΣk,k (UT U = I) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 111. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) = XV ΣT UT (UΣΣT UT )−1 U.,≤kΣk,k (V T V = I) = XV ΣT UT U(ΣΣT )−1 UT U.,≤kΣk,k ((ABC)−1 = C−1 B−1 A−1 ) = XV ΣT (ΣΣT )−1 UT U.,≤kΣk,k (UT U = I) = XV ΣT ΣT −1 Σ−1 UT U.,≤kΣk,k ((AB)−1 = B−1 A−1 ) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 112. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) = XV ΣT UT (UΣΣT UT )−1 U.,≤kΣk,k (V T V = I) = XV ΣT UT U(ΣΣT )−1 UT U.,≤kΣk,k ((ABC)−1 = C−1 B−1 A−1 ) = XV ΣT (ΣΣT )−1 UT U.,≤kΣk,k (UT U = I) = XV ΣT ΣT −1 Σ−1 UT U.,≤kΣk,k ((AB)−1 = B−1 A−1 ) = XV Σ−1 I.,≤kΣk,k (UT U.,≤k = I.,≤k) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 113. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) = XV ΣT UT (UΣΣT UT )−1 U.,≤kΣk,k (V T V = I) = XV ΣT UT U(ΣΣT )−1 UT U.,≤kΣk,k ((ABC)−1 = C−1 B−1 A−1 ) = XV ΣT (ΣΣT )−1 UT U.,≤kΣk,k (UT U = I) = XV ΣT ΣT −1 Σ−1 UT U.,≤kΣk,k ((AB)−1 = B−1 A−1 ) = XV Σ−1 I.,≤kΣk,k (UT U.,≤k = I.,≤k) = XV I.,≤k (Σ−1 I.,≤k = Σ−1 k,k) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 114. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) = XV ΣT UT (UΣΣT UT )−1 U.,≤kΣk,k (V T V = I) = XV ΣT UT U(ΣΣT )−1 UT U.,≤kΣk,k ((ABC)−1 = C−1 B−1 A−1 ) = XV ΣT (ΣΣT )−1 UT U.,≤kΣk,k (UT U = I) = XV ΣT ΣT −1 Σ−1 UT U.,≤kΣk,k ((AB)−1 = B−1 A−1 ) = XV Σ−1 I.,≤kΣk,k (UT U.,≤k = I.,≤k) = XV I.,≤k (Σ−1 I.,≤k = Σ−1 k,k) H = XV.,≤k Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 115. 20/55 We will now show that H is a linear encoding and find an expression for the encoder weights W H = U.,≤kΣk,k = (XXT )(XXT )−1 U.,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I) = (XV ΣT UT )(UΣV T V ΣT UT )−1 U.,≤kΣk,k (using X = UΣV T ) = XV ΣT UT (UΣΣT UT )−1 U.,≤kΣk,k (V T V = I) = XV ΣT UT U(ΣΣT )−1 UT U.,≤kΣk,k ((ABC)−1 = C−1 B−1 A−1 ) = XV ΣT (ΣΣT )−1 UT U.,≤kΣk,k (UT U = I) = XV ΣT ΣT −1 Σ−1 UT U.,≤kΣk,k ((AB)−1 = B−1 A−1 ) = XV Σ−1 I.,≤kΣk,k (UT U.,≤k = I.,≤k) = XV I.,≤k (Σ−1 I.,≤k = Σ−1 k,k) H = XV.,≤k Thus H is a linear transformation of X and W = V.,≤k Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 116. 21/55 We have encoder W = V.,≤k Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 117. 21/55 We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 118. 21/55 We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 119. 21/55 We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 120. 21/55 We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by ˆxij = 1 √ m xij − 1 m m k=1 xkj Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 121. 21/55 We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by ˆxij = 1 √ m xij − 1 m m k=1 xkj then XT X is indeed the covariance matrix Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 122. 21/55 We have encoder W = V.,≤k From SVD, we know that V is the matrix of eigen vectors of XT X From PCA, we know that P is the matrix of the eigen vectors of the covariance matrix We saw earlier that, if entries of X are normalized by ˆxij = 1 √ m xij − 1 m m k=1 xkj then XT X is indeed the covariance matrix Thus, the encoder matrix for linear autoencoder(W) and the projection matrix(P) for PCA could indeed be the same. Hence proved Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 123. 22/55 Remember The encoder of a linear autoencoder is equivalent to PCA if we Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 124. 22/55 Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 125. 22/55 Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 126. 22/55 Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder use a squared error loss function Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 127. 22/55 Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder use a squared error loss function and normalize the inputs to Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 128. 22/55 Remember The encoder of a linear autoencoder is equivalent to PCA if we use a linear encoder use a linear decoder use a squared error loss function and normalize the inputs to ˆxij = 1 √ m xij − 1 m m k=1 xkj Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 129. 23/55 Module 7.3: Regularization in autoencoders (Motivation) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 130. 24/55 xi W h W∗ ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 131. 24/55 xi W h W∗ ˆxi While poor generalization could hap- pen even in undercomplete autoen- coders it is an even more serious prob- lem for overcomplete auto encoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 132. 24/55 xi W h W∗ ˆxi While poor generalization could hap- pen even in undercomplete autoen- coders it is an even more serious prob- lem for overcomplete auto encoders Here, (as stated earlier) the model can simply learn to copy xi to h and then h to ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 133. 24/55 xi W h W∗ ˆxi While poor generalization could hap- pen even in undercomplete autoen- coders it is an even more serious prob- lem for overcomplete auto encoders Here, (as stated earlier) the model can simply learn to copy xi to h and then h to ˆxi To avoid poor generalization, we need to introduce regularization Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 134. 25/55 xi W h W∗ ˆxi The simplest solution is to add a L2- regularization term to the objective function min θ,w,w∗,b,c 1 m m i=1 n j=1 (ˆxij − xij)2 + λ θ 2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 135. 25/55 xi W h W∗ ˆxi The simplest solution is to add a L2- regularization term to the objective function min θ,w,w∗,b,c 1 m m i=1 n j=1 (ˆxij − xij)2 + λ θ 2 This is very easy to implement and just adds a term λW to the gradient ∂L (θ) ∂W (and similarly for other para- meters) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 136. 26/55 xi W h W∗ ˆxi Another trick is to tie the weights of the encoder and decoder Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 137. 26/55 xi W h W∗ ˆxi Another trick is to tie the weights of the encoder and decoder i.e., W∗ = WT Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 138. 26/55 xi W h W∗ ˆxi Another trick is to tie the weights of the encoder and decoder i.e., W∗ = WT This effectively reduces the capacity of Autoencoder and acts as a regular- izer Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 139. 27/55 Module 7.4: Denoising Autoencoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 140. 28/55 xi ˜xi h ˆxi P(xij|xij) A denoising encoder simply corrupts the input data using a probabilistic process (P(xij|xij)) before feeding it to the network Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 141. 28/55 xi ˜xi h ˆxi P(xij|xij) A denoising encoder simply corrupts the input data using a probabilistic process (P(xij|xij)) before feeding it to the network A simple P(xij|xij) used in practice is the following Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 142. 28/55 xi ˜xi h ˆxi P(xij|xij) A denoising encoder simply corrupts the input data using a probabilistic process (P(xij|xij)) before feeding it to the network A simple P(xij|xij) used in practice is the following P(xij = 0|xij) = q Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 143. 28/55 xi ˜xi h ˆxi P(xij|xij) A denoising encoder simply corrupts the input data using a probabilistic process (P(xij|xij)) before feeding it to the network A simple P(xij|xij) used in practice is the following P(xij = 0|xij) = q P(xij = xij|xij) = 1 − q Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 144. 28/55 xi ˜xi h ˆxi P(xij|xij) A denoising encoder simply corrupts the input data using a probabilistic process (P(xij|xij)) before feeding it to the network A simple P(xij|xij) used in practice is the following P(xij = 0|xij) = q P(xij = xij|xij) = 1 − q In other words, with probability q the input is flipped to 0 and with probab- ility (1 − q) it is retained as it is Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 145. 29/55 xi ˜xi h ˆxi P(xij|xij) How does this help ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 146. 29/55 xi ˜xi h ˆxi P(xij|xij) How does this help ? This helps because the objective is still to reconstruct the original (un- corrupted) xi arg min θ 1 m m i=1 n j=1 (ˆxij − xij)2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 147. 29/55 xi ˜xi h ˆxi P(xij|xij) How does this help ? This helps because the objective is still to reconstruct the original (un- corrupted) xi arg min θ 1 m m i=1 n j=1 (ˆxij − xij)2 It no longer makes sense for the model to copy the corrupted xi into h(xi) and then into ˆxi (the objective func- tion will not be minimized by doing so) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 148. 29/55 xi ˜xi h ˆxi P(xij|xij) How does this help ? This helps because the objective is still to reconstruct the original (un- corrupted) xi arg min θ 1 m m i=1 n j=1 (ˆxij − xij)2 It no longer makes sense for the model to copy the corrupted xi into h(xi) and then into ˆxi (the objective func- tion will not be minimized by doing so) Instead the model will now have to capture the characteristics of the data correctly. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 149. 29/55 xi ˜xi h ˆxi P(xij|xij) For example, it will have to learn to reconstruct a corrupted xij correctly by relying on its interactions with other elements of xi How does this help ? This helps because the objective is still to reconstruct the original (un- corrupted) xi arg min θ 1 m m i=1 n j=1 (ˆxij − xij)2 It no longer makes sense for the model to copy the corrupted xi into h(xi) and then into ˆxi (the objective func- tion will not be minimized by doing so) Instead the model will now have to capture the characteristics of the data correctly. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 150. 30/55 We will now see a practical application in which AEs are used and then compare Denoising Autoencoders with regular autoencoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 151. 31/55 Task: Hand-written digit recognition Figure: MNIST Data 0 1 2 3 9 |xi| = 784 = 28 × 28 28*28 Figure: Basic approach(we use raw data as input features) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 152. 32/55 Task: Hand-written digit recognition Figure: MNIST Data |xi| = 784 = 28 × 28 ˆxi ∈ R784 h ∈ Rd Figure: AE approach (first learn important characteristics of data) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 153. 33/55 Task: Hand-written digit recognition Figure: MNIST Data 0 1 2 3 9 |xi| = 784 = 28 × 28 h ∈ Rd Figure: AE approach (and then train a classifier on top of this hidden representation) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 154. 34/55 We will now see a way of visualizing AEs and use this visualization to compare different AEs Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 155. 35/55 xi h ˆxi We can think of each neuron as a filter which will fire (or get maximally) activated for a cer- tain input configuration xi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 156. 35/55 xi h ˆxi We can think of each neuron as a filter which will fire (or get maximally) activated for a cer- tain input configuration xi For example, h1 = σ(WT 1 xi) [ignoring bias b] Where W1 is the trained vector of weights con- necting the input to the first hidden neuron Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 157. 35/55 xi h ˆxi We can think of each neuron as a filter which will fire (or get maximally) activated for a cer- tain input configuration xi For example, h1 = σ(WT 1 xi) [ignoring bias b] Where W1 is the trained vector of weights con- necting the input to the first hidden neuron What values of xi will cause h1 to be max- imum (or maximally activated) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 158. 35/55 xi h ˆxi We can think of each neuron as a filter which will fire (or get maximally) activated for a cer- tain input configuration xi For example, h1 = σ(WT 1 xi) [ignoring bias b] Where W1 is the trained vector of weights con- necting the input to the first hidden neuron What values of xi will cause h1 to be max- imum (or maximally activated) Suppose we assume that our inputs are nor- malized so that xi = 1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 159. 35/55 xi h ˆxi max xi {WT 1 xi} s.t. ||xi||2 = xT i xi = 1 We can think of each neuron as a filter which will fire (or get maximally) activated for a cer- tain input configuration xi For example, h1 = σ(WT 1 xi) [ignoring bias b] Where W1 is the trained vector of weights con- necting the input to the first hidden neuron What values of xi will cause h1 to be max- imum (or maximally activated) Suppose we assume that our inputs are nor- malized so that xi = 1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 160. 35/55 xi h ˆxi max xi {WT 1 xi} s.t. ||xi||2 = xT i xi = 1 Solution: xi = W1 WT 1 W1 We can think of each neuron as a filter which will fire (or get maximally) activated for a cer- tain input configuration xi For example, h1 = σ(WT 1 xi) [ignoring bias b] Where W1 is the trained vector of weights con- necting the input to the first hidden neuron What values of xi will cause h1 to be max- imum (or maximally activated) Suppose we assume that our inputs are nor- malized so that xi = 1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 161. 36/55 xi h ˆxi max xi {WT 1 xi} s.t. ||xi||2 = xT i xi = 1 Solution: xi = W1 WT 1 W1 Thus the inputs xi = W1 WT 1 W1 , W2 WT 2 W2 , . . . Wn WT n Wn will respectively cause hidden neurons 1 to n to maximally fire Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 162. 36/55 xi h ˆxi max xi {WT 1 xi} s.t. ||xi||2 = xT i xi = 1 Solution: xi = W1 WT 1 W1 Thus the inputs xi = W1 WT 1 W1 , W2 WT 2 W2 , . . . Wn WT n Wn will respectively cause hidden neurons 1 to n to maximally fire Let us plot these images (xi’s) which maxim- ally activate the first k neurons of the hidden representations learned by a vanilla autoen- coder and different denoising autoencoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 163. 36/55 xi h ˆxi max xi {WT 1 xi} s.t. ||xi||2 = xT i xi = 1 Solution: xi = W1 WT 1 W1 Thus the inputs xi = W1 WT 1 W1 , W2 WT 2 W2 , . . . Wn WT n Wn will respectively cause hidden neurons 1 to n to maximally fire Let us plot these images (xi’s) which maxim- ally activate the first k neurons of the hidden representations learned by a vanilla autoen- coder and different denoising autoencoders These xi’s are computed by the above formula using the weights (W1, W2 . . . Wk) learned by the respective autoencoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 164. 37/55 Figure: Vanilla AE (No noise) Figure: 25% Denoising AE (q=0.25) Figure: 50% Denoising AE (q=0.5) The vanilla AE does not learn many meaningful patterns Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 165. 37/55 Figure: Vanilla AE (No noise) Figure: 25% Denoising AE (q=0.25) Figure: 50% Denoising AE (q=0.5) The vanilla AE does not learn many meaningful patterns The hidden neurons of the denoising AEs seem to act like pen-stroke detectors (for example, in the highlighted neuron the black region is a stroke that you would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 166. 37/55 Figure: Vanilla AE (No noise) Figure: 25% Denoising AE (q=0.25) Figure: 50% Denoising AE (q=0.5) The vanilla AE does not learn many meaningful patterns The hidden neurons of the denoising AEs seem to act like pen-stroke detectors (for example, in the highlighted neuron the black region is a stroke that you would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’) As the noise increases the filters become more wide because the neuron has to rely on more adjacent pixels to feel confident about a stroke Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 167. 38/55 xi ˜xi h ˆxi P(xij|xij) We saw one form of P(xij|xij) which flips a fraction q of the inputs to zero Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 168. 38/55 xi ˜xi h ˆxi P(xij|xij) We saw one form of P(xij|xij) which flips a fraction q of the inputs to zero Another way of corrupting the inputs is to add a Gaussian noise to the input xij = xij + N (0, 1) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 169. 38/55 xi ˜xi h ˆxi P(xij|xij) We saw one form of P(xij|xij) which flips a fraction q of the inputs to zero Another way of corrupting the inputs is to add a Gaussian noise to the input xij = xij + N (0, 1) We will now use such a denoising AE on a different dataset and see their performance Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 170. 39/55 Figure: Data Figure: AE filters Figure: Weight decay filters The hidden neurons essentially behave like edge detectors Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 171. 39/55 Figure: Data Figure: AE filters Figure: Weight decay filters The hidden neurons essentially behave like edge detectors PCA does not give such edge detectors Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 172. 40/55 Module 7.5: Sparse Autoencoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 173. 41/55 xi h ˆxi Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 174. 41/55 xi h ˆxi A hidden neuron with sigmoid activation will have values between 0 and 1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 175. 41/55 xi h ˆxi A hidden neuron with sigmoid activation will have values between 0 and 1 We say that the neuron is activated when its output is close to 1 and not activated when its output is close to 0. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 176. 41/55 xi h ˆxi A hidden neuron with sigmoid activation will have values between 0 and 1 We say that the neuron is activated when its output is close to 1 and not activated when its output is close to 0. A sparse autoencoder tries to ensure the neuron is inactive most of the times. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 177. 42/55 xi h ˆxi The average value of the activation of a neuron l is given by ˆρl = 1 m m i=1 h(xi)l If the neuron l is sparse (i.e. mostly inactive) then ˆρl → 0 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 178. 42/55 xi h ˆxi The average value of the activation of a neuron l is given by ˆρl = 1 m m i=1 h(xi)l If the neuron l is sparse (i.e. mostly inactive) then ˆρl → 0 A sparse autoencoder uses a sparsity para- meter ρ (typically very close to 0, say, 0.005) and tries to enforce the constraint ˆρl = ρ Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 179. 42/55 xi h ˆxi The average value of the activation of a neuron l is given by ˆρl = 1 m m i=1 h(xi)l If the neuron l is sparse (i.e. mostly inactive) then ˆρl → 0 A sparse autoencoder uses a sparsity para- meter ρ (typically very close to 0, say, 0.005) and tries to enforce the constraint ˆρl = ρ One way of ensuring this is to add the follow- ing term to the objective function Ω(θ) = k l=1 ρ log ρ ˆρl + (1 − ρ) log 1 − ρ 1 − ˆρl Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 180. 42/55 xi h ˆxi The average value of the activation of a neuron l is given by ˆρl = 1 m m i=1 h(xi)l If the neuron l is sparse (i.e. mostly inactive) then ˆρl → 0 A sparse autoencoder uses a sparsity para- meter ρ (typically very close to 0, say, 0.005) and tries to enforce the constraint ˆρl = ρ One way of ensuring this is to add the follow- ing term to the objective function Ω(θ) = k l=1 ρ log ρ ˆρl + (1 − ρ) log 1 − ρ 1 − ˆρl When will this term reach its minimum value and what is the minimum value? Let us plot it and check. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 181. 43/55 Ω(θ) 0.2 ˆρl ρ = 0.2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 182. 43/55 Ω(θ) 0.2 ˆρl ρ = 0.2 The function will reach its minimum value(s) when ˆρl = ρ. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 183. 44/55 Now, ˆL (θ) = L (θ) + Ω(θ) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 184. 44/55 Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 185. 44/55 Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 186. 44/55 Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 187. 44/55 Ω(θ) = k l=1 ρlog ρ ˆρl + (1 − ρ)log 1 − ρ 1 − ˆρl Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 188. 44/55 Ω(θ) = k l=1 ρlog ρ ˆρl + (1 − ρ)log 1 − ρ 1 − ˆρl Can be re-written as Ω(θ) = k l=1 ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl) Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 189. 44/55 Ω(θ) = k l=1 ρlog ρ ˆρl + (1 − ρ)log 1 − ρ 1 − ˆρl Can be re-written as Ω(θ) = k l=1 ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl) By Chain rule: ∂Ω(θ) ∂W = ∂Ω(θ) ∂ˆρ . ∂ˆρ ∂W Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 190. 44/55 Ω(θ) = k l=1 ρlog ρ ˆρl + (1 − ρ)log 1 − ρ 1 − ˆρl Can be re-written as Ω(θ) = k l=1 ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl) By Chain rule: ∂Ω(θ) ∂W = ∂Ω(θ) ∂ˆρ . ∂ˆρ ∂W ∂Ω(θ) ∂ˆρ = ∂Ω(θ) ∂ ˆρ1 , ∂Ω(θ) ∂ ˆρ2 , . . . ∂Ω(θ) ∂ ˆρk T Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 191. 44/55 Ω(θ) = k l=1 ρlog ρ ˆρl + (1 − ρ)log 1 − ρ 1 − ˆρl Can be re-written as Ω(θ) = k l=1 ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl) By Chain rule: ∂Ω(θ) ∂W = ∂Ω(θ) ∂ˆρ . ∂ˆρ ∂W ∂Ω(θ) ∂ˆρ = ∂Ω(θ) ∂ ˆρ1 , ∂Ω(θ) ∂ ˆρ2 , . . . ∂Ω(θ) ∂ ˆρk T For each neuron l ∈ 1 . . . k in hidden layer, we have Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 192. 44/55 Ω(θ) = k l=1 ρlog ρ ˆρl + (1 − ρ)log 1 − ρ 1 − ˆρl Can be re-written as Ω(θ) = k l=1 ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl) By Chain rule: ∂Ω(θ) ∂W = ∂Ω(θ) ∂ˆρ . ∂ˆρ ∂W ∂Ω(θ) ∂ˆρ = ∂Ω(θ) ∂ ˆρ1 , ∂Ω(θ) ∂ ˆρ2 , . . . ∂Ω(θ) ∂ ˆρk T For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ∂ ˆρl = − ρ ˆρl + (1 − ρ) 1 − ˆρl Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 193. 44/55 Ω(θ) = k l=1 ρlog ρ ˆρl + (1 − ρ)log 1 − ρ 1 − ˆρl Can be re-written as Ω(θ) = k l=1 ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl) By Chain rule: ∂Ω(θ) ∂W = ∂Ω(θ) ∂ˆρ . ∂ˆρ ∂W ∂Ω(θ) ∂ˆρ = ∂Ω(θ) ∂ ˆρ1 , ∂Ω(θ) ∂ ˆρ2 , . . . ∂Ω(θ) ∂ ˆρk T For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ∂ ˆρl = − ρ ˆρl + (1 − ρ) 1 − ˆρl and ∂ ˆρl ∂W = xi(g (WT xi + b))T (see next slide) Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 194. 44/55 Ω(θ) = k l=1 ρlog ρ ˆρl + (1 − ρ)log 1 − ρ 1 − ˆρl Can be re-written as Ω(θ) = k l=1 ρlogρ−ρlogˆρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆρl) By Chain rule: ∂Ω(θ) ∂W = ∂Ω(θ) ∂ˆρ . ∂ˆρ ∂W ∂Ω(θ) ∂ˆρ = ∂Ω(θ) ∂ ˆρ1 , ∂Ω(θ) ∂ ˆρ2 , . . . ∂Ω(θ) ∂ ˆρk T For each neuron l ∈ 1 . . . k in hidden layer, we have ∂Ω(θ) ∂ ˆρl = − ρ ˆρl + (1 − ρ) 1 − ˆρl and ∂ ˆρl ∂W = xi(g (WT xi + b))T (see next slide) Now, ˆL (θ) = L (θ) + Ω(θ) L (θ) is the squared error loss or cross entropy loss and Ω(θ) is the sparsity constraint. We already know how to calculate ∂L (θ) ∂W Let us see how to calculate ∂Ω(θ) ∂W . Finally, ∂ ˆL (θ) ∂W = ∂L (θ) ∂W + ∂Ω(θ) ∂W (and we know how to calculate both terms on R.H.S) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 195. 45/55 Derivation ∂ˆρ ∂W = ∂ ˆρ1 ∂W ∂ ˆρ2 ∂W . . . ∂ ˆρk ∂W For each element in the above equation we can calculate ∂ ˆρl ∂W (which is the partial derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl:- ∂ˆρl ∂Wjl = ∂ 1 m m i=1 g WT :,lxi + bl ∂Wjl = 1 m m i=1 ∂ g WT :,lxi + bl ∂Wjl = 1 m m i=1 g WT :,lxi + bl xij So in matrix notation we can write it as : ∂ˆρl ∂W = xi(g (WT xi + b))T Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 196. 46/55 Module 7.6: Contractive Autoencoders Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 197. 47/55 A contractive autoencoder also tries to prevent an overcomplete autoen- coder from learning the identity func- tion. x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 198. 47/55 A contractive autoencoder also tries to prevent an overcomplete autoen- coder from learning the identity func- tion. It does so by adding the following reg- ularization term to the loss function Ω(θ) = Jx(h) 2 F x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 199. 47/55 A contractive autoencoder also tries to prevent an overcomplete autoen- coder from learning the identity func- tion. It does so by adding the following reg- ularization term to the loss function Ω(θ) = Jx(h) 2 F where Jx(h) is the Jacobian of the en- coder. x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 200. 47/55 A contractive autoencoder also tries to prevent an overcomplete autoen- coder from learning the identity func- tion. It does so by adding the following reg- ularization term to the loss function Ω(θ) = Jx(h) 2 F where Jx(h) is the Jacobian of the en- coder. Let us see what it looks like. x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 201. 48/55 If the input has n dimensions and the hidden layer has k dimensions then Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 202. 48/55 If the input has n dimensions and the hidden layer has k dimensions then Jx(h) =       ∂h1 ∂x1 . . . . . . . . . ∂h1 ∂xn ∂h2 ∂x1 . . . . . . . . . ∂h2 ∂xn ... ... ... ∂hk ∂x1 . . . . . . . . . ∂hk ∂xn       Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 203. 48/55 If the input has n dimensions and the hidden layer has k dimensions then In other words, the (j, l) entry of the Jacobian captures the variation in the output of the lth neuron with a small variation in the jth input. Jx(h) =       ∂h1 ∂x1 . . . . . . . . . ∂h1 ∂xn ∂h2 ∂x1 . . . . . . . . . ∂h2 ∂xn ... ... ... ∂hk ∂x1 . . . . . . . . . ∂hk ∂xn       Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 204. 48/55 If the input has n dimensions and the hidden layer has k dimensions then In other words, the (j, l) entry of the Jacobian captures the variation in the output of the lth neuron with a small variation in the jth input. Jx(h) =       ∂h1 ∂x1 . . . . . . . . . ∂h1 ∂xn ∂h2 ∂x1 . . . . . . . . . ∂h2 ∂xn ... ... ... ∂hk ∂x1 . . . . . . . . . ∂hk ∂xn       Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 205. 49/55 What is the intuition behind this ? Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 206. 49/55 What is the intuition behind this ? Consider ∂h1 ∂x1 , what does it mean if ∂h1 ∂x1 = 0 Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 207. 49/55 What is the intuition behind this ? Consider ∂h1 ∂x1 , what does it mean if ∂h1 ∂x1 = 0 It means that this neuron is not very sensitive to variations in the input x1. Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 208. 49/55 What is the intuition behind this ? Consider ∂h1 ∂x1 , what does it mean if ∂h1 ∂x1 = 0 It means that this neuron is not very sensitive to variations in the input x1. But doesn’t this contradict our other goal of minimizing L(θ) which re- quires h to capture variations in the input. Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 209. 50/55 Indeed it does and that’s the idea Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 210. 50/55 Indeed it does and that’s the idea By putting these two contradicting objectives against each other we en- sure that h is sensitive to only very important variations as observed in the training data. Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 211. 50/55 Indeed it does and that’s the idea By putting these two contradicting objectives against each other we en- sure that h is sensitive to only very important variations as observed in the training data. L(θ) - capture important variations in data Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 212. 50/55 Indeed it does and that’s the idea By putting these two contradicting objectives against each other we en- sure that h is sensitive to only very important variations as observed in the training data. L(θ) - capture important variations in data Ω(θ) - do not capture variations in data Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 213. 50/55 Indeed it does and that’s the idea By putting these two contradicting objectives against each other we en- sure that h is sensitive to only very important variations as observed in the training data. L(θ) - capture important variations in data Ω(θ) - do not capture variations in data Tradeoff - capture only very import- ant variations in the data Jx(h) 2 F = n j=1 k l=1 ∂hl ∂xj 2 x h ˆx Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 214. 51/55 Let us try to understand this with the help of an illustration. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 215. 52/55 x y u1 u2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 216. 52/55 x y u1 u2 Consider the variations in the data along directions u1 and u2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 217. 52/55 x y u1 u2 Consider the variations in the data along directions u1 and u2 It makes sense to maximize a neuron to be sensitive to variations along u1 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 218. 52/55 x y u1 u2 Consider the variations in the data along directions u1 and u2 It makes sense to maximize a neuron to be sensitive to variations along u1 At the same time it makes sense to inhibit a neuron from being sensitive to variations along u2 (as there seems to be small noise and unimportant for reconstruction) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 219. 52/55 x y u1 u2 Consider the variations in the data along directions u1 and u2 It makes sense to maximize a neuron to be sensitive to variations along u1 At the same time it makes sense to inhibit a neuron from being sensitive to variations along u2 (as there seems to be small noise and unimportant for reconstruction) By doing so we can balance between the contradicting goals of good recon- struction and low sensitivity. Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 220. 52/55 x y u1 u2 Consider the variations in the data along directions u1 and u2 It makes sense to maximize a neuron to be sensitive to variations along u1 At the same time it makes sense to inhibit a neuron from being sensitive to variations along u2 (as there seems to be small noise and unimportant for reconstruction) By doing so we can balance between the contradicting goals of good recon- struction and low sensitivity. What does this remind you of ? Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 221. 53/55 Module 7.7 : Summary Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 222. 54/55 x h ˆx ≡ PCA PT XT XP = D y x u1 u2 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 223. 54/55 x h ˆx ≡ PCA PT XT XP = D y x u1 u2 min θ X − HW∗ UΣV T (SVD) 2 F Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 224. 55/55 xi ˜xi h ˆxi P(xij|xij) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 226. 55/55 xi ˜xi h ˆxi P(xij|xij) Regularization Ω(θ) = λ θ 2 Weight decaying Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 227. 55/55 xi ˜xi h ˆxi P(xij|xij) Regularization Ω(θ) = λ θ 2 Weight decaying Ω(θ) = k l=1 ρ log ρ ˆρl + (1 − ρ) log 1 − ρ 1 − ˆρl Sparse Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
  • 228. 55/55 xi ˜xi h ˆxi P(xij|xij) Regularization Ω(θ) = λ θ 2 Weight decaying Ω(θ) = k l=1 ρ log ρ ˆρl + (1 − ρ) log 1 − ρ 1 − ˆρl Sparse Ω(θ) = n j=1 k l=1 ∂hl ∂xj 2 Contractive Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7