SlideShare uma empresa Scribd logo
1 de 234
Baixar para ler offline
Deep Learning
Marc'Aurelio Ranzato
Facebook A.I. Research
ICVSS – 15 July 2014www.cs.toronto.edu/~ranzato
https://sites.google.com/site/deeplearningcvpr2014/
6
fixed unsupervised supervised
classifier
Mixture of
Gaussians
MFCC
ˈd ē p
fixed unsupervised supervised
classifier
K-Means/
pooling
SIFT/HOG
“car”
fixed unsupervised supervised
classifiern-grams
Parse Tree
Syntactic “+”
This burrito place
is yummy and fun!
Traditional Pattern Recognition
VISION
SPEECH
NLP
Ranzato
7
Hierarchical Compositionality (DEEP)
VISION
SPEECH
NLP
pixels edge texton motif part object
sample spectral
band
formant motif phone word
character NP/VP/.. clause sentence storyword
Ranzato
8
fixed unsupervised supervised
classifier
Mixture of
Gaussians
MFCC
ˈd ē p
fixed unsupervised supervised
classifier
K-Means/
pooling
SIFT/HOG
“car”
fixed unsupervised supervised
classifiern-grams
Parse Tree
Syntactic “+”
This burrito place
is yummy and fun!
Traditional Pattern Recognition
VISION
SPEECH
NLP
Ranzato
9
Deep Learning
“car”
Cascade of non-linear transformations
End to end learning
General framework (any hierarchical model is deep)
What is Deep Learning
Ranzato
10
Ranzato
THE SPACE OF
MACHINE LEARNING METHODS
11
PerceptronNeural Net
Boosting
SVM
GMMΣΠ
BayesNP
Convolutional
Neural Net
Recurrent
Neural Net
Autoencoder
Neural Net
Sparse Coding
Restricted BMDeep Belief Net
Deep (sparse/denoising)
Autoencoder
Disclaimer: showing only a
subset of the known methods
12
PerceptronNeural Net
Boosting
SVM
GMMΣΠ
BayesNP
Convolutional
Neural Net
Recurrent
Neural Net
Autoencoder
Neural Net
Sparse Coding
Restricted BMDeep Belief Net
Deep (sparse/denoising)
Autoencoder
SHALLOW
DEEP
13
PerceptronNeural Net
Boosting
SVM
GMMΣΠ
BayesNP
Convolutional
Neural Net
Recurrent
Neural Net
Autoencoder
Neural Net
Sparse Coding
Restricted BMDeep Belief Net
Deep (sparse/denoising)
Autoencoder
UNSUPERVISED
SUPERVISED
DEEP
SHALLOW
14
PerceptronNeural Net
Boosting
SVM
Convolutional
Neural Net
Recurrent
Neural Net
Autoencoder
Neural Net
Deep (sparse/denoising)
Autoencoder
UNSUPERVISED
SUPERVISED
DEEP
SHALLOWΣΠ
BayesNP
Deep Belief Net
GMM
Sparse Coding
Restricted BM
PROBABILISTIC
15
Main types of deep architectures
Ranzato
Deep Learning is B I G
input input
input
feed-forward
Feed-back
Bi-directional
Neural nets
Conv Nets
Hierar. Sparse Coding
Deconv Nets
Stacked
Auto-encoders
DBM
input
Recurrent
Recurrent Neural nets
Recursive Nets
LISTA
16
Ranzato
Deep Learning is B I G
input input
input
feed-forward
Feed-back
Bi-directional
Neural nets
Conv Nets
Hierar. Sparse Coding
Deconv Nets
Stacked
Auto-encoders
DBM
input
Recurrent
Recurrent Neural nets
Recursive Nets
LISTA
Main types of deep architectures
17
Ranzato
Deep Learning is B I G
Main types of learning protocols
Purely supervised
Backprop + SGD
Good when there is lots of labeled data.
Layer-wise unsupervised + superv. linear classifier
Train each layer in sequence using regularized auto-encoders or
RBMs
Hold fix the feature extractor, train linear classifier on features
Good when labeled data is scarce but there is lots of unlabeled
data.
Layer-wise unsupervised + supervised backprop
Train each layer in sequence
Backprop through the whole system
Good when learning problem is very difficult.
18
Ranzato
Deep Learning is B I G
Main types of learning protocols
Purely supervised
Backprop + SGD
Good when there is lots of labeled data.
Layer-wise unsupervised + superv. linear classifier
Train each layer in sequence using regularized auto-encoders or
RBMs
Hold fix the feature extractor, train linear classifier on features
Good when labeled data is scarce but there is lots of unlabeled
data.
Layer-wise unsupervised + supervised backprop
Train each layer in sequence
Backprop through the whole system
Good when learning problem is very difficult.
19
Outline Part I
Ranzato
Supervised Neural Networks
Convolutional Neural Networks
Examples
Tips
20
Neural Networks
Ranzato
Assumptions (for the next few slides):
The input image is vectorized (disregard the spatial layout of pixels)
The target label is discrete (classification)
Question: what class of functions shall we consider to map the input
into the output?
Answer: composition of simpler functions.
Follow-up questions: Why not a linear combination? What are the
“simpler” functions? What is the interpretation?
Answer: later...
21
Neural Networks: example
h2
h1
x
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
Ranzato
input
1-st layer hidden units
2-nd layer hidden units
output
Example of a 2 hidden layer neural network (or 4 layer network,
counting also input and output).
x
h1
h2
o
o
22
Forward Propagation
Ranzato
Def.: Forward propagation is the process of computing the
output of the network given its input.
23
Forward Propagation
Ranzato
h1
=max0,W 1
xb1

x∈ RD
W 1
∈R
N 1
×D
b1
∈R
N 1
h
1
∈R
N 1
x
1-st layer weight matrix or weightsW 1
1-st layer biasesb1
o
The non-linearity is called ReLU in the DL literature.
Each output hidden unit takes as input all the units at the previous
layer: each such layer is called “fully connected”.
u=max 0,v
h2
h1
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
24
Forward Propagation
Ranzato
h2
=max 0,W 2
h1
b2

h1
∈R
N 1
W 2
∈R
N 2
×N 1
b2
∈R
N 2
h
2
∈R
N 2
x o
2-nd layer weight matrix or weightsW 2
2-nd layer biasesb2
h2
h1
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
25
Forward Propagation
Ranzato
o=max 0,W 3
h2
b3

h2
∈R
N 2
W 3
∈R
N 3
×N 2
b3
∈R
N 3
o∈R
N 3
x o
3-rd layer weight matrix or weightsW 3
3-rd layer biasesb3
h2
h1
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
26
Alternative Graphical Representation
Ranzato
h
k1
h
k
max 0,W
k1
h
k

hk1
h
k
W k 1
h1
k
h2
k
h3
k
h4
k
h1
k 1
h2
k 1
h3
k 1
w1,1
k1
w3,4
k1
h
k
hk1
W k 1
27
Interpretation
Ranzato
Question: Why can't the mapping between layers be linear?
Answer: Because composition of linear functions is a linear function.
Neural network would reduce to (1 layer) logistic regression.
Question: What do ReLU layers accomplish?
Answer: Piece-wise linear tiling: mapping is locally linear.
Montufar et al. “On the number of linear regions of DNNs” arXiv 2014
28
Ranzato
[0/1]
[0/1]
[0/1]
[0/1] [0/1]
[0/1]
[0/1]
[0/1]
ReLU layers do local linear approximation. Number of planes grows
exponentially with number of hidden units. Multiple layers yeild
exponential savings in number of parameters (parameter sharing).
Montufar et al. “On the number of linear regions of DNNs” arXiv 2014
29
Interpretation
Ranzato
Question: Why do we need many layers?
Answer: When input has hierarchical structure, the use of a
hierarchical architecture is potentially more efficient because
intermediate computations can be re-used. DL architectures are
efficient also because they use distributed representations which
are shared across classes.
[0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 … ]
Exponentially more efficient than a
1-of-N representation (a la k-means)
truck feature
30
Interpretation
Ranzato
[0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 … ]
[1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1… ] motorbike
truck
31
Interpretation
Ranzato
Input image
low level
parts
prediction of class
mid-level
parts
high-level
parts
distributed representations
feature sharing
compositionality
...
Lee et al. “Convolutional DBN's ...” ICML 2009
32
Interpretation
Ranzato
Question: How many layers? How many hidden units?
Answer: Cross-validation or hyper-parameter search methods are
the answer. In general, the wider and the deeper the network the
more complicated the mapping.
Question: What does a hidden unit do?
Answer: It can be thought of as a classifier or feature detector.
Question: How do I set the weight matrices?
Answer: Weight matrices and biases are learned.
First, we need to define a measure of quality of the current mapping.
Then, we need to define a procedure to adjust the parameters.
33
h
2
h
1
x o
Loss
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
Lx , y ; =−∑j
y j log pc j∣x
pck =1∣x=
e
ok
∑j=1
C
eo j
Probability of class k given input (softmax):
(Per-sample) Loss; e.g., negative log-likelihood (good for classification
of small number of classes):
Ranzato
How Good is a Network?
y=[00..010..0]
k1 C
34
Training
∗
=arg min ∑n=1
P
Lxn
, yn
; 
Learning consists of minimizing the loss (plus some
regularization term) w.r.t. parameters over the whole training set.
Question: How to minimize a complicated function of the
parameters?
Answer: Chain rule, a.k.a. Backpropagation! That is the procedure
to compute gradients of the loss w.r.t. parameters in a multi-layer
neural network.
Rumelhart et al. “Learning internal representations by back-propagating..” Nature 1986
35
Derivative w.r.t. Input of Softmax
Lx , y ; =−∑j
y j log pc j∣x
pck =1∣x=
e
ok
∑j
e
o j
By substituting the fist formula in the second, and taking the
derivative w.r.t. we get:o
∂ L
∂o
= pc∣x− y
HOMEWORK: prove it!
Ranzato
y=[00..010..0]
k1 C
36
Backward Propagation
h
2
h
1
x
Loss
y
Given and assuming we can easily compute the
Jacobian of each module, we have:
∂ L/∂ o
∂ L
∂ o
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
∂ L
∂W
3
=
∂ L
∂ o
∂ o
∂W
3
37
Backward Propagation
h
2
h
1
x
Loss
y
Given and assuming we can easily compute the
Jacobian of each module, we have:
∂ L/∂ o
∂ L
∂W
3
=
∂ L
∂ o
∂ o
∂W
3
∂ L
∂ o
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
∂ L
∂W
3
=  pc∣x−y h
2 T
38
Backward Propagation
h
2
h
1
x
Loss
y
Given and assuming we can easily compute the
Jacobian of each module, we have:
∂ L/∂ o
∂ L
∂ h
2
=
∂ L
∂ o
∂ o
∂ h
2
∂ L
∂W
3
=
∂ L
∂ o
∂ o
∂W
3
∂ L
∂ o
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
∂ L
∂W
3
=  pc∣x−y h
2 T
39
Backward Propagation
h
2
h
1
x
Loss
y
Given and assuming we can easily compute the
Jacobian of each module, we have:
∂ L/∂ o
∂ L
∂ h
2
=
∂ L
∂ o
∂ o
∂ h
2
∂ L
∂W
3
=
∂ L
∂ o
∂ o
∂W
3
∂ L
∂ o
max 0,W
1
x max0,W
2
h
1
 W
3
h
2
∂ L
∂W
3
=  pc∣x−y h
2 T ∂ L
∂ h
2
= W
3 T
 pc∣x−y
40
Backward Propagation
h
1
x
Loss
y
Given we can compute now:
∂ L
∂ h
2
∂ L
∂ h
1
=
∂ L
∂ h
2
∂ h2
∂ h
1
∂ L
∂W
2
=
∂ L
∂ h
2
∂ h
2
∂ W
2
∂ L
∂ o
∂ L
∂ h2
Ranzato
max 0,W
1
x max0,W
2
h
1
 W 3
h2
41
Backward Propagation
x
Loss
y
Given we can compute now:
∂ L
∂ h
1
∂ L
∂W
1
=
∂ L
∂ h
1
∂ h1
∂ W
1
∂ L
∂ h
1
Ranzato
max 0,W
1
x max0,W
2
h
1

∂ L
∂ o
∂ L
∂ h
2
W 3
h2
42
Backward Propagation
Ranzato
Question: Does BPROP work with ReLU layers only?
Answer: Nope, any a.e. differentiable transformation works.
Question: What's the computational cost of BPROP?
Answer: About twice FPROP (need to compute gradients w.r.t. input
and parameters at every layer).
Note: FPROP and BPROP are dual of each other. E.g.,:
+
+
FPROP BPROP
SUMCOPY
43
Optimization
Stochastic Gradient Descent (on mini-batches):
  −
∂ L
∂
,∈0,1
Stochastic Gradient Descent with Momentum:
 0.9 
∂ L
∂
  −
Ranzato
Note: there are many other variants...
44
Outline
Ranzato
Supervised Neural Networks
Convolutional Neural Networks
Examples
Tips
45
Example: 200x200 image
40K hidden units
~2B parameters!!!
- Spatial correlation is local
- Waste of resources + we have not enough
training samples anyway..
Fully Connected Layer
Ranzato
46
Locally Connected Layer
Example: 200x200 image
40K hidden units
Filter size: 10x10
4M parameters
Ranzato
Note: This parameterization is good
when input image is registered (e.g.,
face recognition).
47
STATIONARITY? Statistics is similar at
different locations
Ranzato
Note: This parameterization is good
when input image is registered (e.g.,
face recognition).
Locally Connected Layer
Example: 200x200 image
40K hidden units
Filter size: 10x10
4M parameters
48
Convolutional Layer
Share the same parameters across
different locations (assuming input is
stationary):
Convolutions with learned kernels
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
Ranzato
Convolutional Layer
RanzatoMathieu et al. “Fast training of CNNs through FFTs” ICLR 2014
Convolutional Layer
*
-1 0 1
-1 0 1
-1 0 1
Ranzato
=
66
Learn multiple filters.
E.g.: 200x200 image
100 Filters
Filter size: 10x10
10K parameters
Ranzato
Convolutional Layer
67
h j
n
=max 0,∑k =1
K
hk
n−1
∗wkj
n

Ranzato
Conv.
layer
h1
n−1
h2
n−1
h3
n−1
h1
n
h2
n
output
feature map
input feature
map
kernel
Convolutional Layer
68
h j
n
=max 0,∑k =1
K
hk
n−1
∗wkj
n

Ranzato
h1
n−1
h2
n−1
h3
n−1
h1
n
h2
n
output
feature map
input feature
map
kernel
Convolutional Layer
69
h j
n
=max 0,∑k =1
K
hk
n−1
∗wkj
n

Ranzato
h1
n−1
h2
n−1
h3
n−1
h1
n
h2
n
output
feature map
input feature
map
kernel
Convolutional Layer
70
Ranzato
Question: What is the size of the output? What's the computational
cost?
Answer: It is proportional to the number of filters and depends on the
stride. If kernels have size KxK, input has size DxD, stride is 1, and
there are M input feature maps and N output feature maps then:
- the input has size M@DxD
- the output has size N@(D-K+1)x(D-K+1)
- the kernels have MxNxKxK coefficients (which have to be learned)
- cost: M*K*K*N*(D-K+1)*(D-K+1)
Question: How many feature maps? What's the size of the filters?
Answer: Usually, there are more output feature maps than input
feature maps. Convolutional layers can increase the number of
hidden units by big factors (and are expensive to compute).
The size of the filters has to match the size/scale of the patterns we
want to detect (task dependent).
Convolutional Layer
71
A standard neural net applied to images:
- scales quadratically with the size of the input
- does not leverage stationarity
Solution:
- connect each hidden unit to a small patch of the input
- share the weight across space
This is called: convolutional layer.
A network with convolutional layers is called convolutional network.
LeCun et al. “Gradient-based learning applied to document recognition” IEEE 1998
Key Ideas
72
Let us assume filter is an “eye” detector.
Q.: how can we make the detection robust to
the exact location of the eye?
Pooling Layer
Ranzato
73
By “pooling” (e.g., taking max) filter
responses at different locations we gain
robustness to the exact spatial location
of features.
Ranzato
Pooling Layer
74
Ranzato
Pooling Layer: Examples
h j
n
 x , y=maxx∈N x , y∈N y  hj
n−1
x ,y
Max-pooling:
h j
n
 x , y=1/ K ∑x∈N  x , y ∈N  y
hj
n−1
x ,y
Average-pooling:
h j
n
 x , y=∑x ∈N  x, y∈N  y
hj
n−1
x ,y2
L2-pooling:
h j
n
 x , y=∑k ∈N  j
hk
n−1
 x , y2
L2-pooling over features:
75
Ranzato
Pooling Layer
Question: What is the size of the output? What's the computational
cost?
Answer: The size of the output depends on the stride between the
pools. For instance, if pools do not overlap and have size KxK, and
the input has size DxD with M input feature maps, then:
- output is M@(D/K)x(D/K)
- the computational cost is proportional to the size of the input
(negligible compared to a convolutional layer)
Question: How should I set the size of the pools?
Answer: It depends on how much “invariant” or robust to distortions
we want the representation to be. It is best to pool slowly (via a few
stacks of conv-pooling layers).
76
Ranzato
Pooling Layer: Interpretation
Task: detect orientation L/R
Conv layer:
linearizes manifold
77
Ranzato
Pooling Layer: Interpretation
Conv layer:
linearizes manifold
Pooling layer:
collapses manifold
Task: detect orientation L/R
78
Ranzato
Pooling Layer: Receptive Field Size
Conv.
layer
hn−1 h
n
Pool.
layer
h
n1
If convolutional filters have size KxK and stride 1, and pooling layer
has pools of size PxP, then each unit in the pooling layer depends
upon a patch (at the input of the preceding conv. layer) of size:
(P+K-1)x(P+K-1)
79
Ranzato
Pooling Layer: Receptive Field Size
Conv.
layer
hn−1 h
n
Pool.
layer
h
n1
If convolutional filters have size KxK and stride 1, and pooling layer
has pools of size PxP, then each unit in the pooling layer depends
upon a patch (at the input of the preceding conv. layer) of size:
(P+K-1)x(P+K-1)
80
Local Contrast Normalization
h
i1
 x , y=
hi
x , y−mi
N x , y
max  ,
i
N x , y
Ranzato
81
We want the same response.
Ranzato
Local Contrast Normalization
h
i1
 x , y=
hi
x , y−mi
N x , y
max  ,
i
N x , y
82
Performed also across features
and in the higher layers..
Effects:
– improves invariance
– improves optimization
– increases sparsity
Note: computational cost is
negligible w.r.t. conv. layer.
Ranzato
Local Contrast Normalization
h
i1
 x , y=
hi
x , y−mi
N x , y
max  ,
i
N x , y
83
ConvNets: Typical Stage
Convol. LCN Pooling
One stage (zoom)
courtesy of
K. Kavukcuoglu Ranzato
84
Convol. LCN Pooling
One stage (zoom)
Conceptually similar to: SIFT, HoG, etc.
Ranzato
ConvNets: Typical Stage
85courtesy of
K. Kavukcuoglu Ranzato
Note: after one stage the number of feature maps is usually increased
(conv. layer) and the spatial resolution is usually decreased (stride in
conv. and pooling layers). Receptive field gets bigger.
Reasons:
- gain invariance to spatial translation (pooling layer)
- increase specificity of features (approaching object specific units)
86
Convol. LCN Pooling
One stage (zoom)
Fully Conn.
Layers
Whole system
1st
stage 2nd
stage 3rd
stage
Input
Image
Class
Labels
Ranzato
ConvNets: Typical Architecture
87
SIFT → K-Means → Pyramid Pooling → SVM
SIFT → Fisher Vect. → Pooling → SVM
Lazebnik et al. “...Spatial Pyramid Matching...” CVPR 2006
Sanchez et al. “Image classifcation with F.V.: Theory and practice” IJCV 2012
Conceptually similar to:
Ranzato
Fully Conn.
Layers
Whole system
1st
stage 2nd
stage 3rd
stage
Input
Image
Class
Labels
ConvNets: Typical Architecture
88
ConvNets: Training
Algorithm:
Given a small mini-batch
- F-PROP
- B-PROP
- PARAMETER UPDATE
All layers are differentiable (a.e.).
We can use standard back-propagation.
Ranzato
89
Ranzato
Note: After several stages of convolution-pooling, the spatial resolution
is greatly reduced (usually to about 5x5) and the number of feature
maps is large (several hundreds depending on the application).
It would not make sense to convolve again (there is no translation
invariance and support is too small). Everything is vectorized and fed
into several fully connected layers.
If the input of the fully connected layers is of size Nx5x5, the first fully
connected layer can be seen as a conv. layer with 5x5 kernels.
The next fully connected layer can be seen as a conv. layer with 1x1
kernels.
90
Ranzato
NxMxM, M small
H hidden units /
Hx1x1 feature maps
Fully conn. layer /
Conv. layer (H kernels of size NxMxM)
91
NxMxM, M small
H hidden units /
Hx1x1 feature maps
Fully conn. layer /
Conv. layer (H kernels of size NxMxM)
K hidden units /
Kx1x1 feature maps
Fully conn. layer /
Conv. layer (K kernels of size Hx1x1)
92
Viewing fully connected layers as convolutional layers enables efficient
use of convnets on bigger images (no need to slide windows but unroll
network over space as needed to re-use computation).
CNNInput
Image
CNN
Input
Image
Input
Image
TRAINING TIME
TEST TIME
x
y
93
Viewing fully connected layers as convolutional layers enables efficient
use of convnets on bigger images (no need to slide windows but unroll
network over space as needed to re-use computation).
CNNInput
Image
CNN
Input
Image
TRAINING TIME
TEST TIME
x
y
Unrolling is order of magnitudes more eficient than sliding windows!
CNNs work on any image size!
94
ConvNets: Test
At test time, run only is forward mode (FPROP).
Ranzato
95
Fancier Architectures: Multi-Scale
Farabet et al. “Learning hierarchical features for scene labeling” PAMI 2013
96
Fancier Architectures: Multi-Modal
Frome et al. “Devise: a deep visual semantic embedding model” NIPS 2013
CNN
Text
Embedding
tiger
Matching
shared representation
97
Fancier Architectures: Multi-Task
Zhang et al. “PANDA..” CVPR 2014
Conv
Norm
Pool
Conv
Norm
Pool
Conv
Norm
Pool
Conv
Norm
Pool
Fully
Conn.
Fully
Conn.
Fully
Conn.
Fully
Conn.
...
Attr. 1
Attr. 2
Attr. N
image
98
Fancier Architectures: Multi-Task
Osadchy et al. “Synergistic face detection and pose estimation..” JMLR 2007
99
Fancier Architectures: Generic DAG
Any DAG of differentialble
modules is allowed!
100
Fancier Architectures: Generic DAG
If there are cycles (RNN), one needs to un-roll it.
Graves “Offline Arabic handwriting recognition..” Springer 2012
101
Outline
Ranzato
Supervised Neural Networks
Convolutional Neural Networks
Examples
Tips
102
CONV NETS: EXAMPLES
- OCR / House number & Traffic sign classification
Ciresan et al. “MCDNN for image classification” CVPR 2012
Wan et al. “Regularization of neural networks using dropconnect” ICML 2013
Goodfellow et al. “Multi-digit nuber recognition from StreetView...” ICLR 2014
Jaderberg et al. “Synthetic data and ANN for natural scene text recognition” arXiv 2014
103
CONV NETS: EXAMPLES
- Texture classification
Sifre et al. “Rotation, scaling and deformation invariant scattering...” CVPR 2013
104
CONV NETS: EXAMPLES
- Pedestrian detection
Sermanet et al. “Pedestrian detection with unsupervised multi-stage..” CVPR 2013
105
CONV NETS: EXAMPLES
- Scene Parsing
Farabet et al. “Learning hierarchical features for scene labeling” PAMI 2013
RanzatoPinheiro et al. “Recurrent CNN for scene parsing” arxiv 2013
106
CONV NETS: EXAMPLES
- Segmentation 3D volumetric images
Ciresan et al. “DNN segment neuronal membranes...” NIPS 2012
Turaga et al. “Maximin learning of image segmentation” NIPS 2009 Ranzato
107
CONV NETS: EXAMPLES
- Action recognition from videos
Taylor et al. “Convolutional learning of spatio-temporal features” ECCV 2010
Karpathy et al. “Large-scale video classification with CNNs” CVPR 2014
Simonyan et al. “Two-stream CNNs for action recognition in videos” arXiv 2014
108
CONV NETS: EXAMPLES
- Robotics
Sermanet et al. “Mapping and planning ...with long range perception” IROS 2008
109
CONV NETS: EXAMPLES
- Denoising
Burger et al. “Can plain NNs compete with BM3D?” CVPR 2012
original noised denoised
Ranzato
110
CONV NETS: EXAMPLES
- Dimensionality reduction / learning embeddings
Hadsell et al. “Dimensionality reduction by learning an invariant mapping” CVPR 2006
111
CONV NETS: EXAMPLES
- Object detection
Sermanet et al. “OverFeat: Integrated recognition, localization, ...” arxiv 2013
Szegedy et al. “DNN for object detection” NIPS 2013 Ranzato
Girshick et al. “Rich feature hierarchies for accurate object detection...” arxiv 2013
Dataset: ImageNet 2012
Deng et al. “Imagenet: a large scale hierarchical image database” CVPR 2009
ImageNet
Examples of hammer:
114
Architecture for Classification
CONV
LOCAL CONTRAST NORM
MAX POOLING
FULLY CONNECTED
LINEAR
CONV
LOCAL CONTRAST NORM
MAX POOLING
CONV
CONV
CONV
MAX POOLING
FULLY CONNECTED
Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012
category
prediction
input
Ranzato
115CONV
LOCAL CONTRAST NORM
MAX POOLING
FULLY CONNECTED
LINEAR
CONV
LOCAL CONTRAST NORM
MAX POOLING
CONV
CONV
CONV
MAX POOLING
FULLY CONNECTED
Total nr. params: 60M
4M
16M
37M
442K
1.3M
884K
307K
35K
Total nr. flops: 832M
4M
16M
37M
74M
224M
149M
223M
105M
Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012
category
prediction
input
Ranzato
Architecture for Classification
116
Optimization
SGD with momentum:
Learning rate = 0.01
Momentum = 0.9
Improving generalization by:
Weight sharing (convolution)
Input distortions
Dropout = 0.5
Weight decay = 0.0005
Ranzato
117
Results: ILSVRC 2012
RanzatoKrizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012
118
119
TEST
IMAGE
RETRIEVED IMAGES
120
Outline
Ranzato
Supervised Neural Networks
Convolutional Neural Networks
Examples
Tips
121
Choosing The Architecture
Task dependent
Cross-validation
[Convolution → LCN → pooling]* + fully connected layer
The more data: the more layers and the more kernels
Look at the number of parameters at each layer
Look at the number of flops at each layer
Computational resources
Be creative :)
Ranzato
122
How To Optimize
SGD (with momentum) usually works very well
Pick learning rate by running on a subset of the data
Bottou “Stochastic Gradient Tricks” Neural Networks 2012
Start with large learning rate and divide by 2 until loss does not diverge
Decay learning rate by a factor of ~1000 or more by the end of training
Use non-linearity
Initialize parameters so that each feature across layers has
similar variance. Avoid units in saturation.
Ranzato
123
Improving Generalization
Weight sharing (greatly reduce the number of parameters)
Data augmentation (e.g., jittering, noise injection, etc.)
Dropout
Hinton et al. “Improving Nns by preventing co-adaptation of feature detectors”
arxiv 2012
Weight decay (L2, L1)
Sparsity in the hidden units
Multi-task (unsupervised learning)
Ranzato
124
ConvNets: till 2012
Loss
parameter
Common wisdom: training does not work
because we “get stuck in local minima”
125
ConvNets: today
Loss
parameter
Local minima are all similar, there are long plateaus,
it can take long time to break symmetries.
w w
input/output invariant to permutations
breaking ties
between parameters
W T
X
1
Saturating units
Dauphin et al. “Identifying and attacking the saddle point problem..” arXiv 2014
126
Like walking on a ridge between valleys
Neural Net Optimization is...
127
ConvNets: today
Loss
parameter
Local minima are all similar, there are long
plateaus, it can take long to break symmetries.
Optimization is not the real problem when:
– dataset is large
– unit do not saturate too much
– normalization layer
128
ConvNets: today
Loss
parameter
Today's belief is that the challenge is about:
– generalization
How many training samples to fit 1B parameters?
How many parameters/samples to model spaces with 1M dim.?
– scalability
129
data
flops/s
capacity
T
IM
E
ConvNets: Why so successful today?
TIME
As time goes by, we get more data and more
flops/s. The capacity of ML models should grow
accordingly.
1K 1M 1B
1M
100M
10T
Ranzato
130
data
capacity
T
IM
E
ConvNets: Why so successful today?
CNN were in many ways
premature, we did not have
enough data and flops/s to
train them.
They would overfit and be
too slow to tran (apparent
local minima).
flops/s
1B
NOTE: methods have to
be easily scalable!
Ranzato
131
Good To Know
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
samples
hidden unit
Good training: hidden units are sparse across samples
and across features. Ranzato
132
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
samples
hidden unit
Bad training: many hidden units ignore the input and/or
exhibit strong correlations. Ranzato
Good To Know
133
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
Visualize parameters
Good training: learned filters exhibit structure and are uncorrelated.
GOOD BADBAD BAD
too noisy too correlated lack structure
Ranzato
Good To Know
Zeiler, Fergus “Visualizing and understanding CNNs” arXiv 2013
Simonyan, Vedaldi, Zisserman “Deep inside CNNs: visualizing image classification models..” ICLR 2014
134
Check gradients numerically by finite differences
Visualize features (feature maps need to be uncorrelated)
and have high variance.
Visualize parameters
Measure error on both training and validation set.
Test on a small subset of the data and check the error → 0.
Ranzato
Good To Know
135
What If It Does Not Work?
Training diverges:
Learning rate may be too large → decrease learning rate
BPROP is buggy → numerical gradient checking
Parameters collapse / loss is minimized but accuracy is low
Check loss function:
Is it appropriate for the task you want to solve?
Does it have degenerate solutions? Check “pull-up” term.
Network is underperforming
Compute flops and nr. params. → if too small, make net larger
Visualize hidden units/params → fix optmization
Network is too slow
Compute flops and nr. params. → GPU,distrib. framework, make
net smaller
Ranzato
136
Summary Part I
Deep Learning = learning hierarhical models. ConvNets are the
most successful example. Leverage large labeled datasets.
Optimization
Don't we get stuck in local minima? No, they are all the same!
In large scale applications, local minima are even less of an issue.
Scaling
GPUs
Distributed framework (Google)
Better optimization techniques
Generalization on small datasets (curse of dimensionality):
data augmentation
weight decay
dropout
unsupervised learning
multi-task learning
Ranzato
137
SOFTWARE
Torch7: learning library that supports neural net training
http://www.torch.ch
http://code.cogbits.com/wiki/doku.php (tutorial with demos by C. Farabet)
https://github.com/sermanet/OverFeat
Python-based learning library (U. Montreal)
- http://deeplearning.net/software/theano/ (does automatic differentiation)
Caffe (Yangqing Jia)
– http://caffe.berkeleyvision.org
Efficient CUDA kernels for ConvNets (Krizhevsky)
– code.google.com/p/cuda-convnet
Ranzato
138
REFERENCES
Convolutional Nets
– LeCun, Bottou, Bengio and Haffner: Gradient-Based Learning Applied to Document
Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
- Krizhevsky, Sutskever, Hinton “ImageNet Classification with deep convolutional
neural networks” NIPS 2012
– Jarrett, Kavukcuoglu, Ranzato, LeCun: What is the Best Multi-Stage Architecture for
Object Recognition?, Proc. International Conference on Computer Vision (ICCV'09),
IEEE, 2009
- Kavukcuoglu, Sermanet, Boureau, Gregor, Mathieu, LeCun: Learning Convolutional
Feature Hierachies for Visual Recognition, Advances in Neural Information
Processing Systems (NIPS 2010), 23, 2010
– see yann.lecun.com/exdb/publis for references on many different kinds of
convnets.
– see http://www.cmap.polytechnique.fr/scattering/ for scattering networks (similar to
convnets but with less learning and stronger mathematical foundations)
– see http://www.idsia.ch/~juergen/ for other references to ConvNets and LSTMs.
Ranzato
139
REFERENCES
Applications of Convolutional Nets
– Farabet, Couprie, Najman, LeCun. Scene Parsing with Multiscale Feature Learning,
Purity Trees, and Optimal Covers”, ICML 2012
– Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala and Yann LeCun:
Pedestrian Detection with Unsupervised Multi-Stage Feature Learning, CVPR 2013
- D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. Deep Neural Networks
Segment Neuronal Membranes in Electron Microscopy Images. NIPS 2012
- Raia Hadsell, Pierre Sermanet, Marco Scoffier, Ayse Erkan, Koray Kavackuoglu, Urs
Muller and Yann LeCun. Learning Long-Range Vision for Autonomous Off-Road
Driving, Journal of Field Robotics, 26(2):120-144, 2009
– Burger, Schuler, Harmeling. Image Denoisng: Can Plain Neural Networks Compete
with BM3D?, CVPR 2012
– Hadsell, Chopra, LeCun. Dimensionality reduction by learning an invariant mapping,
CVPR 2006
– Bergstra et al. Making a science of model search: hyperparameter optimization in
hundred of dimensions for vision architectures, ICML 2013
Ranzato
140
REFERENCES
Latest and Greatest Convolutional Nets
– Girshick, Donahue, Darrell, Malick. “Rich feature hierarchies for accurate object
detection and semantic segmentation”, arXiv 2014
– Simonyan, Zisserman “Two-stream CNNs for action recognition in videos” arXiv
2014
- Cadieu, Hong, Yamins, Pinto, Ardila, Solomon, Majaj, DiCarlo. “DNN rival in
representation of primate IT cortex for core visual object recognition”. arXiv 2014
- Erhan, Szegedy, Toshev, Anguelov “Scalable object detection using DNN” CVPR
2014
- Razavian, Azizpour, Sullivan, Carlsson “CNN features off-the-shelf: and astounding
baseline for recognition” arXiv 2014
- Krizhevsky “One weird trick for parallelizing CNNs” arXiv 2014
Ranzato
141
REFERENCES
Deep Learning in general
– deep learning tutorial @ CVPR 2014 https://sites.google.com/site/deeplearningcvpr2014/
– deep learning tutorial slides at ICML 2013: icml.cc/2013/?page_id=39
– Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in
Machine Learning, 2(1), pp.1-127, 2009.
– LeCun, Chopra, Hadsell, Ranzato, Huang: A Tutorial on Energy-Based Learning, in
Bakir, G. and Hofman, T. and Schölkopf, B. and Smola, A. and Taskar, B. (Eds),
Predicting Structured Data, MIT Press, 2006
Ranzato
“Theory” of Deep Learning
– Mallat: Group Invariant Scattering, Comm. In Pure and Applied Math. 2012
– Pascanu, Montufar, Bengio: On the number of inference regions of DNNs with piece
wise linear activations, ICLR 2014
– Pascanu, Dauphin, Ganguli, Bengio: On the saddle-point problem for non-convex
optimization, arXiv 2014
- Delalleau, Bengio: Shallow vs deep Sum-Product Networks, NIPS 2011
142
THE END - PART I
Ranzato
143
Outline Part II
Theory: Energy-Based Models
Energy function
Loss function
Examples:
Supervised learning: neural nets
Unsupervised learning: sparse coding
Unsupervised learning: gated MRF
Conclusions
Ranzato
144
Energy:
LeCun et al. “Tutorial on Energy-based learning ...” Predicting Structure Data 2006
Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007
Energy-Based Models: Energy Function
E y ; E y ; x ,or
unsupervised supervised
145
Energy:
LeCun et al. “Tutorial on Energy-based learning ...” Predicting Structure Data 2006
Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007
Energy-Based Models: Energy Function
E y ; E y ; x ,or
unsupervised supervised
y can be
discrete
continuous
{
146
Energy:
LeCun et al. “Tutorial on Energy-based learning ...” Predicting Structure Data 2006
Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007
Energy-Based Models: Energy Function
E y ; E y ; x ,or
unsupervised supervised
y can be
discrete
continuous
{
We will refer to the
unsupervised/continous case, but
much of the following applies to
the other cases as well.
147
Energy should be lower for desired output
E
y
Energy-Based Models: Energy Function
Ranzato
148
Make energy lower at the desired output
E
yy∗
BEFORE TRAINING
Energy-Based Models: Energy Function
Ranzato
149
Make energy lower at the desired output
E
yy∗
AFTER TRAINING
Energy-Based Models: Energy Function
Ranzato
150
Examples of energy function:
PCA
Linear binary classifier
Neural net binary classifier
Energy-Based Models: Energy Function
E y=∥y−W W
T
y∥2
2
E y ; x=−yW T
x
y∈{−1,1}
E y ; x=−yW 2
T
f x ;W 1
151
Energy-Based Models: Loss Function
Loss is a function of the energy
Minimizing the loss over the training set yields
the desired energy landscape.

∗
=min∑p
LE y
p
;
Examples of loss function:
PCA
Logistic regression classifier
E y=∥y−W W T
y∥2
2
E y ; x=−yW
T
x
LE y=E y
LE y ; x=log 1expE  y ; x
152
Energy-Based Models: Loss Function
Loss is a function of the energy
Minimizing the loss over the training set yields
the desired energy landscape.

∗
=min∑p
LE y
p
;
How to design loss good functions?
Ranzato
153
Energy-Based Models: Loss Function
Loss is a function of the energy
Minimizing the loss over the training set yields
the desired energy landscape.

∗
=min∑p
LE y
p
;
E
y
L=E  y
How to design loss good functions?
154
Energy-Based Models: Loss Function
Loss is a function of the energy
Minimizing the loss over the training set yields
the desired energy landscape.

∗
=min∑p
LE y
p
;
E
y
L=E  y
How to design loss good functions?
155
Energy-Based Models: Loss Function
Loss is a function of the energy
Minimizing the loss over the training set yields
the desired energy landscape.

∗
=min∑p
LE y
p
;
E
y
L=E  y
BAD LOSS
How to design loss good functions?
Energy is degenerate:
low everywhere!
156
Energy-Based Models: Loss Function
Loss is a function of the energy
Minimizing the loss over the training set yields
the desired energy landscape.

∗
=min∑p
LE y
p
;
How to design loss good functions?
L=E  ylog∑y
exp−Ey
E
y
157
Energy-Based Models: Loss Function
Loss is a function of the energy
Minimizing the loss over the training set yields
the desired energy landscape.

∗
=min∑p
LE y
p
;
How to design loss good functions?
L=E  ylog∑y
exp−Ey
E
y
GOOD LOSS
but potentially
very expensive
158
Strategies to Shape E: #1
Pull down the correct answer and pull up
everywhere else.
E
y
L=E  ylog∑y
exp−E y
PROS
It produces calibrated probabilities.
CONS
Expensive to compute when y is discrete and high dimensional.
Generally intractable when y is continuous.
Negative
Log-Likelihood
p y=
e
−E  y
∑u
e
−Eu
159
Strategies to Shape E: #2
Pull down the correct answer and pull up
carefully chosen points.
L=max0,m−E y−E y, e.g. y=miny≠ y E  y
E
yy
E.g.: Contrastive Divergence, Ratio Matching, Noise Contrastive
Estimation, Minimum Probability Flow...
Hinton et al. “A fast learning algorithm for DBNs” Neural Comp. 2008
Gutmann et al. “Noise contrastive estimation of unnormalized...” JMLR 2012
Hyvarinen “Some extensions of score matchine” Comp Stats 2007
Sohl-Dickstein et al. “Minimum probability flow learning” ICML 2011
160
Strategies to Shape E: #2
Pull down the correct answer and pull up
carefully chosen points.
PROS
Efficient.
CONS
The criterion to pick where to pull up is tricky (overall in high
dimensional spaces): trades-off computational and statistical efficiency.
L=max0,m−E y−E y, e.g. y=miny≠ y E  y
E
yy
161
Strategies to Shape E: #3
Pull down the correct answer and increase local
curvature.
E
y
Score Matching
L=∂ E  y
∂ y 
2

∂2
E  y
∂ y2
Hyvarinen “Estimation of non-normalized statistical models using score matching”
JMLR 2005
162
Strategies to Shape E: #3
Pull down the correct answer and increase local
curvature.
E Score Matching
y
Hyvarinen “Estimation of non-normalized statistical models using score matching”
JMLR 2005
L=∂ E  y
∂ y 
2

∂2
E  y
∂ y2
163
Strategies to Shape E: #3
Pull down the correct answer and increase local
curvature.
PROS
Efficient in continuous but not too high dimensional spaces.
CONS
Very complicated to compute and not practical in very high
dimensional spaces. Not applicable in discrete spaces.
E Score Matching
y
L=∂ E  y
∂ y 
2

∂2
E  y
∂ y2
164
Strategies to Shape E: #4
Pull down correct answer and have global
constrain on the energy: only few minima exist
E
y
PCA, ICA,
sparse coding,...
PROS
Efficient in continuous, high dimensional spaces.
L=E  y
Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007
CONS
Need to design good global constraints. Used in unsup. learning only.
165
Strategies to Shape E: #4
E
y
PCA, ICA,
sparse coding,...
L=E  y
Pull down correct answer and have global
constrain on the energy: only few minima exist
PROS
Efficient in continuous, high dimensional spaces.
CONS
Need to design good global constraints. Used in unsup. learning only.
Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007
166
Strategies to Shape E: #4
Pull down correct answer and have global
constrain on the energy: only few minima exist
Typical methods (unsup. learning):
Use compact internal representation (PCA)
Have finite number of internal states (K-Means)
Use sparse codes (ICA, sparse coding)
Ranzato
167
INPUT SPACE: FEATURE SPACE:
f  x ,h
g x ,h
hx
Ranzato
training sample
input data point which is not a training sample
feature (code)
168
Wh
f  x ,h
hxINPUT SPACE: FEATURE SPACE:
E.g. K-means: is 1-of-N.h
Since there are very few “codes” available and the
energy (MSE) is minimized on the training set, the
energy must be higher elsewhere.
Ranzato
169
Strategies to Shape E: #5
F Denoising AE
L=E  y
Make the observed an attractor state of some
energy function. Need only to define a dynamics.
y
E.g.:
L=
1
2
∥y−y∥
2
y=W 2 W 1 yn, n∈N 0, I 
y
Vincent et al. “Extracting and composing robust features with denoising autoencoders”
ICML 2008
Training samples must be
stable points (local minima)
and attractors.
170
Make the observed an attractor state of some
energy function. Need only to define a dynamics.
Strategies to Shape E: #5
Denoising AE
E.g.:
yn
Training samples must be
stable points (local minima)
and attractors.
F
Kamyshanska et al. “On autoencoder scoring” ICML 2013
y
L=E  y
y
y=W 2 W 1 yn, n∈N 0, I 
L=
1
2
∥y−y∥
2
Ranzato
171
Make the observed an attractor state of some
energy function. Need only to define a dynamics.
Strategies to Shape E: #5
Denoising AE
E.g.:
F
Training samples must be
stable points (local minima)
and attractors.
Kamyshanska et al. “On autoencoder scoring” ICML 2013
y
L=E  y
y yn
y=W 2 W 1 yn, n∈N 0, I 
L=
1
2
∥y−y∥
2
Ranzato
172
Make the observed an attractor state of some
energy function. Need only to define a dynamics.
Strategies to Shape E: #5
PROS
Efficient in high dimensional spaces.
CONS
Need to pick noise distribution. May need to pick many noisy points.
Kamyshanska et al. “On autoencoder scoring” ICML 2013
Denoising AEF
y
L=E  y
y yn
173
Loss: summary
Goal of loss: make energy lower for correct answer.
Different losses choose differently how to “pull-up”
Pull-up all points
Pull up one or a few points only
Make observations minima & increase curvature
Add global constraints/penalties on internal states
Define a dynamics with observations at the minima
Choice of loss depends on desired landscape,
computational budget, domain of input
(discrete/continouus), task, etc.
Ranzato
174
Final Notes on EBMs
EBMs apply to any predictor (shallow & deep).
EBMs subsume graphical models (e.g., use
strategy #1 or #2 to “pull-up”).
EBM is general framework to design good loss
functions.
Ranzato
175
Outline
Ranzato
Theory: Energy-Based Models
Energy function
Loss function
Examples:
Supervised learning: neural nets
Unsupervised learning: sparse coding
Unsupervised learning: gated MRF
Conclusions
176
Perceptron
Neural Net
Boosting
SVM
GMM
ΣΠ
BayesNP
CNN
Recurrent
Neural Net
Autoencoder
Neural Net
Sparse Coding
Restricted BMDBN
Deep (sparse/denoising)
Autoencoder
UNSUPERVISED
SUPERVISED
DEEP
SHALLOWPROBABILISTIC
Loss: type #1
177
Neural Nets for Classification
h2h1x h3
Loss
y
max 0,W 1 x max 0,W 2 h1 W 3 h2
LE y ; x=log 1expE  y ; x
E y ; x=−y h3
{W 1,
∗
W 2,
∗
W 3
∗
}=arg minW 1,W 2,W 3
LE y ; x
Loss: type #1
Ranzato
178
Perceptron
Neural Net
Boosting
SVM
GMM
ΣΠ
BayesNP
CNN
Recurrent
Neural Net
Autoencoder
Neural Net
Sparse Coding
Restricted BMDBN
Deep (sparse/denoising)
Autoencoder
UNSUPERVISED
SUPERVISED
DEEP
SHALLOWPROBABILISTIC Loss: type #4
179
Energy & latent variables
Energy may have latent variables.
Two major approaches:
Marginalization (intractable if space is large)
Minimization
E y=minh E y ,h
E y=−log ∑h
exp−E y ,h
Ranzato
180
Sparse Coding
L= E x ;W 
E x ,h;W =
1
2
∥x−W h∥2
2
∥h∥1
E x ;W =minh E x ,h;W  Loss type #4: energy loss
with (sparsity) constraints.
Ranzato et al. NIPS 2006Olshausen & Field, Nature 1996
181
Inference
Prediction of latent variables
h∗
=arg minh E x ,h;W 
Inference is an iterative process.
E x ,h;W =
1
2
∥x−W h∥2
2
∥h∥1
Ranzato
182
Inference
Prediction of latent variables
h∗
=arg minh E x ,h;W 
Inference is an iterative process.
E x ,h;W =
1
2
∥x−W h∥2
2
∥h∥1
Q.: Is it possible to make inference more efficient?
A.: Yes, by training another module to directly predict
Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
183
Inference in Sparse Coding
E x ,h=
1
2
∥x−W 2 h∥2
2
∥h∥1
Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
x
h
x
h
W
arg minh E
Ranzato
184
E x ,h=
1
2
∥x−W 2 h∥2
2
∥h∥1
Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
W 2 h
∥x−W 2 h∥2
2
∥h∥1
h
x
Inference in Sparse Coding
Ranzato
185
Learning To Perform Fast Inference
E x ,h=
1
2
∥x−W 2 h∥2
2
∥h∥1
1
2
∥h−g x ;W 1∥2
2
Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
g x ;W 1
∥h−g x ;W 1∥2
2
h
x
Ranzato
186
Predictive Sparse Decomposition
E x ,h=
1
2
∥x−W 2 h∥2
2
∥h∥1
1
2
∥h−g x ;W 1∥2
2
Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
g x ;W 1
W 2 h
∥x−W 2 h∥2
2
∥h−g x ;W 1∥2
2
∥h∥1
h
x
Ranzato
187
Sparse Auto-Encoders
Example: Predictive Sparse Decomposition
E x ,h=
1
2
∥x−W 2 h∥2
2
∥h∥1
1
2
∥h−g x ;W 1∥2
2
Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
TRAINING:
For every sample:
Initialize
(E) Infer optimal latent variables:
(M) Update parameters
h=g  x ;W 1
h∗
=minh E x ,h;W 
W 1 ,W 2
Inference at test time (fast):
h∗
≈g x ;W 1
Ranzato
188
E x ,h=
1
2
∥x−W 2 h∥2
2
∥h∥1
1
2
∥h−g x ;W 1∥2
2
Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
x
h
W x h
alternative graphical representations
Predictive Sparse Decomposition
Ranzato
189
Gregor et al. “Learning fast approximations of sparse coding” ICML 2010
LISTA
W 1
W 2
∥x−W 2 h∥2
2
∥h−g x ;W 1∥2
2
∥h∥1
h
x
++W 3 W 3 W 3
Ranzato
190
KEY IDEAS
Inference can require expensive optimization
We may approximate exact inference well by using a non-
linear function (learn optimal approximation to perform fast
inference)
The original model and the fast predictor can be trained
jointly
Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
Rolfe et al. “Discriminative Recurrent Sparse Autoencoders” ICLR 2013
Szlam et al. “Fast approximations to structured sparse coding...” ECCV 2012
Gregor et al. “Structured sparse coding via lateral inhibition” NIPS 2011
Kavukcuoglu et al. “Learning convolutonal feature hierarchies..” NIPS 2010
Ranzato
191
Outline
Theory: Energy-Based Models
Energy function
Loss function
Examples:
Supervised learning: neural nets
Supervised learning: convnets
Unsupervised learning: sparse coding
Unsupervised learning: gated MRF
Other examples
Practical tricks
Ranzato
192
Perceptron
Neural Net
Boosting
SVM
GMM
ΣΠ
BayesNP
CNN
Recurrent
Neural Net
Autoencoder
Neural Net
Sparse Coding
Restricted BMDBN
Deep (sparse/denoising)
Autoencoder
UNSUPERVISED
SUPERVISED
DEEP
SHALLOWPROBABILISTIC
Loss: type #2
193
Probabilistic Models of Natural Images
INPUT SPACE LATENT SPACE
Training sample Latent vector
p x∣h=N meanh ,D 
- examples: PPCA, Factor Analysis, ICA, Gaussian RBM
p x∣h
Ranzato
194
Probabilistic Models of Natural Images
input image
model does not represent well dependecies, only mean intensity
ph∣x p x∣h
generated image
latent variables
p x∣h=N meanh ,D 
- examples: PPCA, Factor Analysis, ICA, Gaussian RBM
Ranzato
195
Probabilistic Models of Natural Images
Training sample Latent vector
p x∣h=N 0,Covarianceh
- examples: PoT, cRBM
p x∣h
Welling et al. NIPS 2003, Ranzato et al. AISTATS 10
INPUT SPACE LATENT SPACE
Ranzato
196
Probabilistic Models of Natural Images
Welling et al. NIPS 2003, Ranzato et al. AISTATS 10
model does not represent well mean intensity, only dependencies
Andy Warhol 1960
input image
ph∣x p x∣h
generated image
latent variables
p x∣h=N 0,Covarianceh
- examples: PoT, cRBM
197
Probabilistic Models of Natural Images
Training sample Latent vector
p x∣h=N meanh ,Covariance h
- this is what we propose: mcRBM, mPoT
Ranzato et al. CVPR 10, Ranzato et al. NIPS 2010, Ranzato et al. CVPR 11
p x∣h
INPUT SPACE LATENT SPACE
Ranzato
198
Probabilistic Models of Natural Images
Training sample Latent vector
Ranzato et al. CVPR 10, Ranzato et al. NIPS 2010, Ranzato et al. CVPR 11
p x∣h
p x∣h=N meanh ,Covariance h
- this is what we propose: mcRBM, mPoT
INPUT SPACE LATENT SPACE
Ranzato
199
PoT PPCA
Our model
N(0,Σ) N(m,I)
N(m,Σ)
Ranzato
200
Deep Gated MRF
Layer 1:
E x ,h
c
,h
m
=
1
2
x ' 
−1
x
pair-wise MRF
x p xq
Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013
p x ,hc
,hm
e−E x , h
c
,h
m

Ranzato
201
Deep Gated MRF
Layer 1:
E x ,h
c
,h
m
=
1
2
x ' C C ' x
pair-wise MRF
x p xq
F
Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Ranzato
202
Deep Gated MRF
Layer 1:
E x ,h
c
,h
m
=
1
2
x ' C [diag h
c
]C ' x
gated MRF
x p xq
hk
c
CC
F
F
Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Ranzato
203
Deep Gated MRF
Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013
Layer 1:
E x ,h
c
,h
m
=
1
2
x ' C [diag h
c
]C ' x
1
2
x' x−x' W h
m
x p xq
h j
m
W
CC
F
M
hk
c
N
gated MRF
p x∫h
c ∫h
m e−E x ,h
c
, h
m

204
Deep Gated MRF
Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013
Layer 1:
E x ,h
c
,h
m
=
1
2
x ' C [diag h
c
]C ' x
1
2
x' x−x' W h
m
Inference of latent variables: just a forward pass
Generation: sampling from multi-variate Gaussian
p x∣hc
,hm
=N Whm
, 

−1
=C diag [h
c
]C 'I
phk
c
=1∣x=−
1
2
Pk C ' x
2
bk 
phj
m
=1∣x= W j ' xb j
205
Deep Gated MRF: Training
Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013
Layer 1:
E x ,h
c
,h
m
=
1
2
x ' C [diag h
c
]C ' x
1
2
x' x−x' W h
m
Training by maximum likelihood (PCD):
p x=
∫h
m
, h
c e
−E x , h
m
, h
c

∫x ,h
m
, h
c e
−Ex ,h
m
,h
c

Loss type #2
Since integrating over the input space is intractable we use:
206
∂ F
∂ w
training sample
F= - log(p(x)) + K
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
207
HMC dynamics
∂ F
∂ w
training sample
F= - log(p(x)) + K
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
208
∂ F
∂ w
training sample
F= - log(p(x)) + K HMC dynamics
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
209
∂ F
∂ w
training sample
F= - log(p(x)) + K HMC dynamics
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
210
∂ F
∂ w
training sample
F= - log(p(x)) + K HMC dynamics
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
211
∂ F
∂ w
training sample
F= - log(p(x)) + K HMC dynamics
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
212
∂ F
∂ w
F= - log(p(x)) + K HMC dynamics
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
213
∂ F
∂ w
weight update
∂ F
∂ w
sampledata
point
w  w− − 
∂ F
∂ w
∣sample
∂ F
∂ w
∣data
F= - log(p(x)) + K
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
214
∂ F
∂ w
∂ F
∂ w
sampledata
point
w  w− − 
∂ F
∂ w
∣sample
∂ F
∂ w
∣data
F= - log(p(x)) + K weight update
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
215
w  w− − 
∂ F
∂ w
∣sample
∂ F
∂ w
∣data
Start dynamics from previous point
data
point
sample
F= - log(p(x)) + K
p x=
e−F x
∫x
e−F  x
F x=−log∫h
m
, h
c e−E x , h
m
, h
c

free energy
216
Deep Gated MRF
Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013
Layer 1
Layer 2
input
h2
Ranzato
217
Deep Gated MRF
Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013
Layer 1
h2
Layer 2
input
h3
Layer 3
Ranzato
218
Gaussian model marginal wavelet
from Simoncelli 2005
Pair-wise MRF FoE
from Schmidt, Gao, Roth CVPR 2010
Sampling High-Resolution Images
219
Gaussian model marginal wavelet
from Simoncelli 2005
Pair-wise MRF FoE
from Schmidt, Gao, Roth CVPR 2010
Sampling High-Resolution Images
gMRF: 1 layer
Ranzato et al. PAMI 2013
220
Gaussian model marginal wavelet
from Simoncelli 2005
Pair-wise MRF FoE
from Schmidt, Gao, Roth CVPR 2010
Sampling High-Resolution Images
Ranzato et al. PAMI 2013
gMRF: 1 layer
221
Gaussian model marginal wavelet
from Simoncelli 2005
Pair-wise MRF FoE
from Schmidt, Gao, Roth CVPR 2010
Sampling High-Resolution Images
Ranzato et al. PAMI 2013
gMRF: 1 layer
222
Gaussian model marginal wavelet
from Simoncelli 2005
Pair-wise MRF FoE
from Schmidt, Gao, Roth CVPR 2010
Sampling High-Resolution Images
Ranzato et al. PAMI 2013
gMRF: 3 layer
223
Gaussian model marginal wavelet
from Simoncelli 2005
Pair-wise MRF FoE
from Schmidt, Gao, Roth CVPR 2010
Sampling High-Resolution Images
Ranzato et al. PAMI 2013
gMRF: 3 layer
224
Gaussian model marginal wavelet
from Simoncelli 2005
Pair-wise MRF FoE
from Schmidt, Gao, Roth CVPR 2010
Sampling High-Resolution Images
Ranzato et al. PAMI 2013
gMRF: 3 layer
225
Gaussian model marginal wavelet
from Simoncelli 2005
Pair-wise MRF FoE
from Schmidt, Gao, Roth CVPR 2010
Sampling High-Resolution Images
Ranzato et al. PAMI 2013
gMRF: 3 layer
226
Sampling After Training on Face Images
Original Input 1st
layer 2nd
layer 3rd
layer 4th
layer 10 times
unconstrained samples
conditional (on the left part of the face) samples
Ranzato et al. PAMI 2013 Ranzato
227
Expression Recognition Under Occlusion
Ranzato et al. PAMI 2013 Ranzato
228
Tang et al. Robust BM for decognition and denoising CVPR 2012 Ranzato
229
Pros Cons
Feature extraction is fast
Unprecedented generation
quality
Advances models of natural
images
Trains without labeled data
Training is inefficient
Slow
Tricky
Sampling scales badly with
dimensionality
What's the use case of
generative models?
Conclusion
If generation is not required, other feature learning methods are
more efficient (e.g., sparse auto-encoders).
What's the use case of generative models?
Given enough labeled data, unsup. learning methods have
not produced more useful features.
230
Modeling Sequences: RNNs
http://www.cs.toronto.edu/~graves/handwriting.html
231
Structured Prediction: GTN
LeCun et al. “Gradient-based learning applied to document recognition” IEEE 1998
232
Summary Part II
Energy-based model: framework to understand and design loss
functions
Training is about shaping the energy. Dealing with high-
dimensional (continuous) spaces: explicit pull-up VS global
constraints (e.g., sparsity).
Unsupervised learning: a field in its infancy
Humas use it a lot
How to discover good compositional feature hierarchies?
How to scale to large dimensional spaces (curse of dimensionality)?
Hot research topic:
Scaling up to even larger datasets
Unsupervised learning
Multi-modal & multi-task learning
Video
Structure prediction
Ranzato
233
Questions
A ReLU layer computes: h = max(0, W x + b), where x is a D dimensional vector, h
and b are N dimensional vectors and W is a matrix of size NxD. If dL/dh is the
derivative of the loss w.r.t. h, then backpropagation computes (in matlab notation):
a) the derivatives dL/dx = W' (dL/dh .* 1(h)), where 1(h) = 1 if h>0 o/w it is 0
b) the derivatives dL/dW = (dL/dh .* 1(h)) x' and dL/db = dL/dh .* 1(h)
c) both the derivatives in a) and b)
Ranzato
A convolutional network:
a) is well defined only when layers are everywhere differentiable
b) requires unsupervised learning even if the labeled dataset is very large
c) has usually many convolution/pooling/normalization layers followed by fully
connected layers
Sparse coding:
a) aims at extracting sparse features that are good at reconstructing the input
and it does not need labels for training; the dimensionality of the feature vector is
usually higher than the input.
b) requires to have lots of images with their corresponding labels
c) can be interpreted as a graphical model, and inference of the latent variables
(sparse codes) is done by marginalization.
234
THANK YOU
Ranzato

Mais conteúdo relacionado

Mais procurados

195840620 declic-maths-tle-s-specifique-2012-partie-2-pdf
195840620 declic-maths-tle-s-specifique-2012-partie-2-pdf195840620 declic-maths-tle-s-specifique-2012-partie-2-pdf
195840620 declic-maths-tle-s-specifique-2012-partie-2-pdfEttaoufik Elayedi
 
Turing machine and Halting Introduction
Turing machine and Halting IntroductionTuring machine and Halting Introduction
Turing machine and Halting IntroductionAmartyaYandrapu1
 
OPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHOPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHCool Guy
 
Attribute Based Encryption
Attribute Based EncryptionAttribute Based Encryption
Attribute Based EncryptionUT, San Antonio
 
Matrix chain multiplication
Matrix chain multiplicationMatrix chain multiplication
Matrix chain multiplicationKiran K
 
Single source stortest path bellman ford and dijkstra
Single source stortest path bellman ford and dijkstraSingle source stortest path bellman ford and dijkstra
Single source stortest path bellman ford and dijkstraRoshan Tailor
 
C mcq practice test 4
C mcq practice test 4C mcq practice test 4
C mcq practice test 4Aman Kamboj
 
Tp1 20206
Tp1 20206Tp1 20206
Tp1 20206SumaY14
 
Anlysis and design of algorithms part 1
Anlysis and design of algorithms part 1Anlysis and design of algorithms part 1
Anlysis and design of algorithms part 1Deepak John
 
Divide And Conquer.pptx
Divide And Conquer.pptxDivide And Conquer.pptx
Divide And Conquer.pptxSHAILIPATEL19
 
Comment évaluer dans un travail de groupe ?
Comment évaluer dans un travail de groupe ?Comment évaluer dans un travail de groupe ?
Comment évaluer dans un travail de groupe ?GROLLEAU Anne-Céline
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free GrammarsMarina Santini
 
PROLOG: Matching And Proof Search In Prolog
PROLOG: Matching And Proof Search In PrologPROLOG: Matching And Proof Search In Prolog
PROLOG: Matching And Proof Search In PrologDataminingTools Inc
 

Mais procurados (20)

Backtracking
BacktrackingBacktracking
Backtracking
 
195840620 declic-maths-tle-s-specifique-2012-partie-2-pdf
195840620 declic-maths-tle-s-specifique-2012-partie-2-pdf195840620 declic-maths-tle-s-specifique-2012-partie-2-pdf
195840620 declic-maths-tle-s-specifique-2012-partie-2-pdf
 
Le codage de huffman
Le codage de huffmanLe codage de huffman
Le codage de huffman
 
Turing machine and Halting Introduction
Turing machine and Halting IntroductionTuring machine and Halting Introduction
Turing machine and Halting Introduction
 
OPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCHOPTIMAL BINARY SEARCH
OPTIMAL BINARY SEARCH
 
Attribute Based Encryption
Attribute Based EncryptionAttribute Based Encryption
Attribute Based Encryption
 
RSA
RSARSA
RSA
 
Matrix chain multiplication
Matrix chain multiplicationMatrix chain multiplication
Matrix chain multiplication
 
Single source stortest path bellman ford and dijkstra
Single source stortest path bellman ford and dijkstraSingle source stortest path bellman ford and dijkstra
Single source stortest path bellman ford and dijkstra
 
Ecc2
Ecc2Ecc2
Ecc2
 
Backtracking
BacktrackingBacktracking
Backtracking
 
C mcq practice test 4
C mcq practice test 4C mcq practice test 4
C mcq practice test 4
 
Tp1 20206
Tp1 20206Tp1 20206
Tp1 20206
 
AES.ppt
AES.pptAES.ppt
AES.ppt
 
Anlysis and design of algorithms part 1
Anlysis and design of algorithms part 1Anlysis and design of algorithms part 1
Anlysis and design of algorithms part 1
 
Divide And Conquer.pptx
Divide And Conquer.pptxDivide And Conquer.pptx
Divide And Conquer.pptx
 
Prime
PrimePrime
Prime
 
Comment évaluer dans un travail de groupe ?
Comment évaluer dans un travail de groupe ?Comment évaluer dans un travail de groupe ?
Comment évaluer dans un travail de groupe ?
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
PROLOG: Matching And Proof Search In Prolog
PROLOG: Matching And Proof Search In PrologPROLOG: Matching And Proof Search In Prolog
PROLOG: Matching And Proof Search In Prolog
 

Destaque

Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning processDenis Dus
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Image Recognition with TensorFlow
Image Recognition with TensorFlowImage Recognition with TensorFlow
Image Recognition with TensorFlowAltoros
 
Resumen el trabajo colaborativo mediante redes
Resumen el trabajo colaborativo mediante redesResumen el trabajo colaborativo mediante redes
Resumen el trabajo colaborativo mediante redesfabiolaflolug
 
Marktforschung mit einfachen Mitteln
Marktforschung mit einfachen MittelnMarktforschung mit einfachen Mitteln
Marktforschung mit einfachen MittelnJörg Hoewner
 
Naughty And Nice Bash Features
Naughty And Nice Bash FeaturesNaughty And Nice Bash Features
Naughty And Nice Bash FeaturesNati Cohen
 
Le zanzare sono un tormento? Vediamo come difenderci
Le zanzare sono un tormento? Vediamo come difenderciLe zanzare sono un tormento? Vediamo come difenderci
Le zanzare sono un tormento? Vediamo come difenderciVivere La Casa in Campagna
 
Minuta acensores empresa schindler 03 21-07-2011
Minuta acensores empresa schindler 03 21-07-2011Minuta acensores empresa schindler 03 21-07-2011
Minuta acensores empresa schindler 03 21-07-2011Antonio Cazorla
 
REGLAMENTO CUATREADA BACH
REGLAMENTO CUATREADA BACHREGLAMENTO CUATREADA BACH
REGLAMENTO CUATREADA BACHyogui1970
 
Integradora I portafolio joran abelino
Integradora I portafolio joran abelinoIntegradora I portafolio joran abelino
Integradora I portafolio joran abelinoMarco Antonio Pineda
 
Primer programador piedras preciosas
Primer programador piedras preciosasPrimer programador piedras preciosas
Primer programador piedras preciosasMayerly Martinez
 
Tema 2. la revolución industrial
Tema 2. la revolución industrialTema 2. la revolución industrial
Tema 2. la revolución industrialjmap2222
 
La BolsaenelañO2009
La BolsaenelañO2009La BolsaenelañO2009
La BolsaenelañO2009IvanAleman
 
Stki summit2013 infra_pini sigaltechnologies_v5 final
Stki summit2013 infra_pini sigaltechnologies_v5 finalStki summit2013 infra_pini sigaltechnologies_v5 final
Stki summit2013 infra_pini sigaltechnologies_v5 finalPini Cohen
 

Destaque (20)

Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Image Recognition with TensorFlow
Image Recognition with TensorFlowImage Recognition with TensorFlow
Image Recognition with TensorFlow
 
Resumen el trabajo colaborativo mediante redes
Resumen el trabajo colaborativo mediante redesResumen el trabajo colaborativo mediante redes
Resumen el trabajo colaborativo mediante redes
 
Marktforschung mit einfachen Mitteln
Marktforschung mit einfachen MittelnMarktforschung mit einfachen Mitteln
Marktforschung mit einfachen Mitteln
 
Naughty And Nice Bash Features
Naughty And Nice Bash FeaturesNaughty And Nice Bash Features
Naughty And Nice Bash Features
 
Debtanu_cv
Debtanu_cvDebtanu_cv
Debtanu_cv
 
Le zanzare sono un tormento? Vediamo come difenderci
Le zanzare sono un tormento? Vediamo come difenderciLe zanzare sono un tormento? Vediamo come difenderci
Le zanzare sono un tormento? Vediamo come difenderci
 
Minuta acensores empresa schindler 03 21-07-2011
Minuta acensores empresa schindler 03 21-07-2011Minuta acensores empresa schindler 03 21-07-2011
Minuta acensores empresa schindler 03 21-07-2011
 
Commands[1]
Commands[1]Commands[1]
Commands[1]
 
Flip video communication
Flip video communicationFlip video communication
Flip video communication
 
REGLAMENTO CUATREADA BACH
REGLAMENTO CUATREADA BACHREGLAMENTO CUATREADA BACH
REGLAMENTO CUATREADA BACH
 
Integradora I portafolio joran abelino
Integradora I portafolio joran abelinoIntegradora I portafolio joran abelino
Integradora I portafolio joran abelino
 
Primer programador piedras preciosas
Primer programador piedras preciosasPrimer programador piedras preciosas
Primer programador piedras preciosas
 
Tema 2. la revolución industrial
Tema 2. la revolución industrialTema 2. la revolución industrial
Tema 2. la revolución industrial
 
Conquistar es imposible
Conquistar es imposibleConquistar es imposible
Conquistar es imposible
 
Amor eterno
Amor eternoAmor eterno
Amor eterno
 
La BolsaenelañO2009
La BolsaenelañO2009La BolsaenelañO2009
La BolsaenelañO2009
 
Stki summit2013 infra_pini sigaltechnologies_v5 final
Stki summit2013 infra_pini sigaltechnologies_v5 finalStki summit2013 infra_pini sigaltechnologies_v5 final
Stki summit2013 infra_pini sigaltechnologies_v5 final
 

Semelhante a Lecture 06 marco aurelio ranzato - deep learning

The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Deep Learning & Tensor flow: An Intro
Deep Learning & Tensor flow: An IntroDeep Learning & Tensor flow: An Intro
Deep Learning & Tensor flow: An IntroSiby Jose Plathottam
 
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights ReservedMachine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights ReservedJonathan Mitchell
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...hyunsung lee
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepSanjanaSaxena17
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...MLconf
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksYasutoTamura1
 

Semelhante a Lecture 06 marco aurelio ranzato - deep learning (20)

The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Deep Learning & Tensor flow: An Intro
Deep Learning & Tensor flow: An IntroDeep Learning & Tensor flow: An Intro
Deep Learning & Tensor flow: An Intro
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Generative models
Generative modelsGenerative models
Generative models
 
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights ReservedMachine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
The Perceptron - Xavier Giro-i-Nieto - UPC Barcelona 2018
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
 
Java and Deep Learning
Java and Deep LearningJava and Deep Learning
Java and Deep Learning
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural Networks
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
 

Mais de mustafa sarac

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma sonmustafa sarac
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
Latka december digital
Latka december digitalLatka december digital
Latka december digitalmustafa sarac
 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualmustafa sarac
 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpymustafa sarac
 
Math for programmers
Math for programmersMath for programmers
Math for programmersmustafa sarac
 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizmustafa sarac
 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?mustafa sarac
 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mimustafa sarac
 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?mustafa sarac
 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Marketsmustafa sarac
 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimimustafa sarac
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0mustafa sarac
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tshmustafa sarac
 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008mustafa sarac
 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guidemustafa sarac
 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020mustafa sarac
 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dicemustafa sarac
 

Mais de mustafa sarac (20)

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma son
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Latka december digital
Latka december digitalLatka december digital
Latka december digital
 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manual
 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpy
 
Math for programmers
Math for programmersMath for programmers
Math for programmers
 
The book of Why
The book of WhyThe book of Why
The book of Why
 
BM sgk meslek kodu
BM sgk meslek koduBM sgk meslek kodu
BM sgk meslek kodu
 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimiz
 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?
 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mi
 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?
 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Markets
 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimi
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tsh
 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008
 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guide
 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020
 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dice
 

Último

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 

Último (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 

Lecture 06 marco aurelio ranzato - deep learning

  • 1. Deep Learning Marc'Aurelio Ranzato Facebook A.I. Research ICVSS – 15 July 2014www.cs.toronto.edu/~ranzato
  • 3.
  • 4.
  • 5.
  • 6. 6 fixed unsupervised supervised classifier Mixture of Gaussians MFCC ˈd ē p fixed unsupervised supervised classifier K-Means/ pooling SIFT/HOG “car” fixed unsupervised supervised classifiern-grams Parse Tree Syntactic “+” This burrito place is yummy and fun! Traditional Pattern Recognition VISION SPEECH NLP Ranzato
  • 7. 7 Hierarchical Compositionality (DEEP) VISION SPEECH NLP pixels edge texton motif part object sample spectral band formant motif phone word character NP/VP/.. clause sentence storyword Ranzato
  • 8. 8 fixed unsupervised supervised classifier Mixture of Gaussians MFCC ˈd ē p fixed unsupervised supervised classifier K-Means/ pooling SIFT/HOG “car” fixed unsupervised supervised classifiern-grams Parse Tree Syntactic “+” This burrito place is yummy and fun! Traditional Pattern Recognition VISION SPEECH NLP Ranzato
  • 9. 9 Deep Learning “car” Cascade of non-linear transformations End to end learning General framework (any hierarchical model is deep) What is Deep Learning Ranzato
  • 11. 11 PerceptronNeural Net Boosting SVM GMMΣΠ BayesNP Convolutional Neural Net Recurrent Neural Net Autoencoder Neural Net Sparse Coding Restricted BMDeep Belief Net Deep (sparse/denoising) Autoencoder Disclaimer: showing only a subset of the known methods
  • 12. 12 PerceptronNeural Net Boosting SVM GMMΣΠ BayesNP Convolutional Neural Net Recurrent Neural Net Autoencoder Neural Net Sparse Coding Restricted BMDeep Belief Net Deep (sparse/denoising) Autoencoder SHALLOW DEEP
  • 13. 13 PerceptronNeural Net Boosting SVM GMMΣΠ BayesNP Convolutional Neural Net Recurrent Neural Net Autoencoder Neural Net Sparse Coding Restricted BMDeep Belief Net Deep (sparse/denoising) Autoencoder UNSUPERVISED SUPERVISED DEEP SHALLOW
  • 14. 14 PerceptronNeural Net Boosting SVM Convolutional Neural Net Recurrent Neural Net Autoencoder Neural Net Deep (sparse/denoising) Autoencoder UNSUPERVISED SUPERVISED DEEP SHALLOWΣΠ BayesNP Deep Belief Net GMM Sparse Coding Restricted BM PROBABILISTIC
  • 15. 15 Main types of deep architectures Ranzato Deep Learning is B I G input input input feed-forward Feed-back Bi-directional Neural nets Conv Nets Hierar. Sparse Coding Deconv Nets Stacked Auto-encoders DBM input Recurrent Recurrent Neural nets Recursive Nets LISTA
  • 16. 16 Ranzato Deep Learning is B I G input input input feed-forward Feed-back Bi-directional Neural nets Conv Nets Hierar. Sparse Coding Deconv Nets Stacked Auto-encoders DBM input Recurrent Recurrent Neural nets Recursive Nets LISTA Main types of deep architectures
  • 17. 17 Ranzato Deep Learning is B I G Main types of learning protocols Purely supervised Backprop + SGD Good when there is lots of labeled data. Layer-wise unsupervised + superv. linear classifier Train each layer in sequence using regularized auto-encoders or RBMs Hold fix the feature extractor, train linear classifier on features Good when labeled data is scarce but there is lots of unlabeled data. Layer-wise unsupervised + supervised backprop Train each layer in sequence Backprop through the whole system Good when learning problem is very difficult.
  • 18. 18 Ranzato Deep Learning is B I G Main types of learning protocols Purely supervised Backprop + SGD Good when there is lots of labeled data. Layer-wise unsupervised + superv. linear classifier Train each layer in sequence using regularized auto-encoders or RBMs Hold fix the feature extractor, train linear classifier on features Good when labeled data is scarce but there is lots of unlabeled data. Layer-wise unsupervised + supervised backprop Train each layer in sequence Backprop through the whole system Good when learning problem is very difficult.
  • 19. 19 Outline Part I Ranzato Supervised Neural Networks Convolutional Neural Networks Examples Tips
  • 20. 20 Neural Networks Ranzato Assumptions (for the next few slides): The input image is vectorized (disregard the spatial layout of pixels) The target label is discrete (classification) Question: what class of functions shall we consider to map the input into the output? Answer: composition of simpler functions. Follow-up questions: Why not a linear combination? What are the “simpler” functions? What is the interpretation? Answer: later...
  • 21. 21 Neural Networks: example h2 h1 x max 0,W 1 x max0,W 2 h 1  W 3 h 2 Ranzato input 1-st layer hidden units 2-nd layer hidden units output Example of a 2 hidden layer neural network (or 4 layer network, counting also input and output). x h1 h2 o o
  • 22. 22 Forward Propagation Ranzato Def.: Forward propagation is the process of computing the output of the network given its input.
  • 23. 23 Forward Propagation Ranzato h1 =max0,W 1 xb1  x∈ RD W 1 ∈R N 1 ×D b1 ∈R N 1 h 1 ∈R N 1 x 1-st layer weight matrix or weightsW 1 1-st layer biasesb1 o The non-linearity is called ReLU in the DL literature. Each output hidden unit takes as input all the units at the previous layer: each such layer is called “fully connected”. u=max 0,v h2 h1 max 0,W 1 x max0,W 2 h 1  W 3 h 2
  • 24. 24 Forward Propagation Ranzato h2 =max 0,W 2 h1 b2  h1 ∈R N 1 W 2 ∈R N 2 ×N 1 b2 ∈R N 2 h 2 ∈R N 2 x o 2-nd layer weight matrix or weightsW 2 2-nd layer biasesb2 h2 h1 max 0,W 1 x max0,W 2 h 1  W 3 h 2
  • 25. 25 Forward Propagation Ranzato o=max 0,W 3 h2 b3  h2 ∈R N 2 W 3 ∈R N 3 ×N 2 b3 ∈R N 3 o∈R N 3 x o 3-rd layer weight matrix or weightsW 3 3-rd layer biasesb3 h2 h1 max 0,W 1 x max0,W 2 h 1  W 3 h 2
  • 26. 26 Alternative Graphical Representation Ranzato h k1 h k max 0,W k1 h k  hk1 h k W k 1 h1 k h2 k h3 k h4 k h1 k 1 h2 k 1 h3 k 1 w1,1 k1 w3,4 k1 h k hk1 W k 1
  • 27. 27 Interpretation Ranzato Question: Why can't the mapping between layers be linear? Answer: Because composition of linear functions is a linear function. Neural network would reduce to (1 layer) logistic regression. Question: What do ReLU layers accomplish? Answer: Piece-wise linear tiling: mapping is locally linear. Montufar et al. “On the number of linear regions of DNNs” arXiv 2014
  • 28. 28 Ranzato [0/1] [0/1] [0/1] [0/1] [0/1] [0/1] [0/1] [0/1] ReLU layers do local linear approximation. Number of planes grows exponentially with number of hidden units. Multiple layers yeild exponential savings in number of parameters (parameter sharing). Montufar et al. “On the number of linear regions of DNNs” arXiv 2014
  • 29. 29 Interpretation Ranzato Question: Why do we need many layers? Answer: When input has hierarchical structure, the use of a hierarchical architecture is potentially more efficient because intermediate computations can be re-used. DL architectures are efficient also because they use distributed representations which are shared across classes. [0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 … ] Exponentially more efficient than a 1-of-N representation (a la k-means) truck feature
  • 30. 30 Interpretation Ranzato [0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 … ] [1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1… ] motorbike truck
  • 31. 31 Interpretation Ranzato Input image low level parts prediction of class mid-level parts high-level parts distributed representations feature sharing compositionality ... Lee et al. “Convolutional DBN's ...” ICML 2009
  • 32. 32 Interpretation Ranzato Question: How many layers? How many hidden units? Answer: Cross-validation or hyper-parameter search methods are the answer. In general, the wider and the deeper the network the more complicated the mapping. Question: What does a hidden unit do? Answer: It can be thought of as a classifier or feature detector. Question: How do I set the weight matrices? Answer: Weight matrices and biases are learned. First, we need to define a measure of quality of the current mapping. Then, we need to define a procedure to adjust the parameters.
  • 33. 33 h 2 h 1 x o Loss max 0,W 1 x max0,W 2 h 1  W 3 h 2 Lx , y ; =−∑j y j log pc j∣x pck =1∣x= e ok ∑j=1 C eo j Probability of class k given input (softmax): (Per-sample) Loss; e.g., negative log-likelihood (good for classification of small number of classes): Ranzato How Good is a Network? y=[00..010..0] k1 C
  • 34. 34 Training ∗ =arg min ∑n=1 P Lxn , yn ;  Learning consists of minimizing the loss (plus some regularization term) w.r.t. parameters over the whole training set. Question: How to minimize a complicated function of the parameters? Answer: Chain rule, a.k.a. Backpropagation! That is the procedure to compute gradients of the loss w.r.t. parameters in a multi-layer neural network. Rumelhart et al. “Learning internal representations by back-propagating..” Nature 1986
  • 35. 35 Derivative w.r.t. Input of Softmax Lx , y ; =−∑j y j log pc j∣x pck =1∣x= e ok ∑j e o j By substituting the fist formula in the second, and taking the derivative w.r.t. we get:o ∂ L ∂o = pc∣x− y HOMEWORK: prove it! Ranzato y=[00..010..0] k1 C
  • 36. 36 Backward Propagation h 2 h 1 x Loss y Given and assuming we can easily compute the Jacobian of each module, we have: ∂ L/∂ o ∂ L ∂ o max 0,W 1 x max0,W 2 h 1  W 3 h 2 ∂ L ∂W 3 = ∂ L ∂ o ∂ o ∂W 3
  • 37. 37 Backward Propagation h 2 h 1 x Loss y Given and assuming we can easily compute the Jacobian of each module, we have: ∂ L/∂ o ∂ L ∂W 3 = ∂ L ∂ o ∂ o ∂W 3 ∂ L ∂ o max 0,W 1 x max0,W 2 h 1  W 3 h 2 ∂ L ∂W 3 =  pc∣x−y h 2 T
  • 38. 38 Backward Propagation h 2 h 1 x Loss y Given and assuming we can easily compute the Jacobian of each module, we have: ∂ L/∂ o ∂ L ∂ h 2 = ∂ L ∂ o ∂ o ∂ h 2 ∂ L ∂W 3 = ∂ L ∂ o ∂ o ∂W 3 ∂ L ∂ o max 0,W 1 x max0,W 2 h 1  W 3 h 2 ∂ L ∂W 3 =  pc∣x−y h 2 T
  • 39. 39 Backward Propagation h 2 h 1 x Loss y Given and assuming we can easily compute the Jacobian of each module, we have: ∂ L/∂ o ∂ L ∂ h 2 = ∂ L ∂ o ∂ o ∂ h 2 ∂ L ∂W 3 = ∂ L ∂ o ∂ o ∂W 3 ∂ L ∂ o max 0,W 1 x max0,W 2 h 1  W 3 h 2 ∂ L ∂W 3 =  pc∣x−y h 2 T ∂ L ∂ h 2 = W 3 T  pc∣x−y
  • 40. 40 Backward Propagation h 1 x Loss y Given we can compute now: ∂ L ∂ h 2 ∂ L ∂ h 1 = ∂ L ∂ h 2 ∂ h2 ∂ h 1 ∂ L ∂W 2 = ∂ L ∂ h 2 ∂ h 2 ∂ W 2 ∂ L ∂ o ∂ L ∂ h2 Ranzato max 0,W 1 x max0,W 2 h 1  W 3 h2
  • 41. 41 Backward Propagation x Loss y Given we can compute now: ∂ L ∂ h 1 ∂ L ∂W 1 = ∂ L ∂ h 1 ∂ h1 ∂ W 1 ∂ L ∂ h 1 Ranzato max 0,W 1 x max0,W 2 h 1  ∂ L ∂ o ∂ L ∂ h 2 W 3 h2
  • 42. 42 Backward Propagation Ranzato Question: Does BPROP work with ReLU layers only? Answer: Nope, any a.e. differentiable transformation works. Question: What's the computational cost of BPROP? Answer: About twice FPROP (need to compute gradients w.r.t. input and parameters at every layer). Note: FPROP and BPROP are dual of each other. E.g.,: + + FPROP BPROP SUMCOPY
  • 43. 43 Optimization Stochastic Gradient Descent (on mini-batches):   − ∂ L ∂ ,∈0,1 Stochastic Gradient Descent with Momentum:  0.9  ∂ L ∂   − Ranzato Note: there are many other variants...
  • 45. 45 Example: 200x200 image 40K hidden units ~2B parameters!!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. Fully Connected Layer Ranzato
  • 46. 46 Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Ranzato Note: This parameterization is good when input image is registered (e.g., face recognition).
  • 47. 47 STATIONARITY? Statistics is similar at different locations Ranzato Note: This parameterization is good when input image is registered (e.g., face recognition). Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters
  • 48. 48 Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels Ranzato
  • 64. Convolutional Layer RanzatoMathieu et al. “Fast training of CNNs through FFTs” ICLR 2014
  • 65. Convolutional Layer * -1 0 1 -1 0 1 -1 0 1 Ranzato =
  • 66. 66 Learn multiple filters. E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters Ranzato Convolutional Layer
  • 67. 67 h j n =max 0,∑k =1 K hk n−1 ∗wkj n  Ranzato Conv. layer h1 n−1 h2 n−1 h3 n−1 h1 n h2 n output feature map input feature map kernel Convolutional Layer
  • 68. 68 h j n =max 0,∑k =1 K hk n−1 ∗wkj n  Ranzato h1 n−1 h2 n−1 h3 n−1 h1 n h2 n output feature map input feature map kernel Convolutional Layer
  • 69. 69 h j n =max 0,∑k =1 K hk n−1 ∗wkj n  Ranzato h1 n−1 h2 n−1 h3 n−1 h1 n h2 n output feature map input feature map kernel Convolutional Layer
  • 70. 70 Ranzato Question: What is the size of the output? What's the computational cost? Answer: It is proportional to the number of filters and depends on the stride. If kernels have size KxK, input has size DxD, stride is 1, and there are M input feature maps and N output feature maps then: - the input has size M@DxD - the output has size N@(D-K+1)x(D-K+1) - the kernels have MxNxKxK coefficients (which have to be learned) - cost: M*K*K*N*(D-K+1)*(D-K+1) Question: How many feature maps? What's the size of the filters? Answer: Usually, there are more output feature maps than input feature maps. Convolutional layers can increase the number of hidden units by big factors (and are expensive to compute). The size of the filters has to match the size/scale of the patterns we want to detect (task dependent). Convolutional Layer
  • 71. 71 A standard neural net applied to images: - scales quadratically with the size of the input - does not leverage stationarity Solution: - connect each hidden unit to a small patch of the input - share the weight across space This is called: convolutional layer. A network with convolutional layers is called convolutional network. LeCun et al. “Gradient-based learning applied to document recognition” IEEE 1998 Key Ideas
  • 72. 72 Let us assume filter is an “eye” detector. Q.: how can we make the detection robust to the exact location of the eye? Pooling Layer Ranzato
  • 73. 73 By “pooling” (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features. Ranzato Pooling Layer
  • 74. 74 Ranzato Pooling Layer: Examples h j n  x , y=maxx∈N x , y∈N y  hj n−1 x ,y Max-pooling: h j n  x , y=1/ K ∑x∈N  x , y ∈N  y hj n−1 x ,y Average-pooling: h j n  x , y=∑x ∈N  x, y∈N  y hj n−1 x ,y2 L2-pooling: h j n  x , y=∑k ∈N  j hk n−1  x , y2 L2-pooling over features:
  • 75. 75 Ranzato Pooling Layer Question: What is the size of the output? What's the computational cost? Answer: The size of the output depends on the stride between the pools. For instance, if pools do not overlap and have size KxK, and the input has size DxD with M input feature maps, then: - output is M@(D/K)x(D/K) - the computational cost is proportional to the size of the input (negligible compared to a convolutional layer) Question: How should I set the size of the pools? Answer: It depends on how much “invariant” or robust to distortions we want the representation to be. It is best to pool slowly (via a few stacks of conv-pooling layers).
  • 76. 76 Ranzato Pooling Layer: Interpretation Task: detect orientation L/R Conv layer: linearizes manifold
  • 77. 77 Ranzato Pooling Layer: Interpretation Conv layer: linearizes manifold Pooling layer: collapses manifold Task: detect orientation L/R
  • 78. 78 Ranzato Pooling Layer: Receptive Field Size Conv. layer hn−1 h n Pool. layer h n1 If convolutional filters have size KxK and stride 1, and pooling layer has pools of size PxP, then each unit in the pooling layer depends upon a patch (at the input of the preceding conv. layer) of size: (P+K-1)x(P+K-1)
  • 79. 79 Ranzato Pooling Layer: Receptive Field Size Conv. layer hn−1 h n Pool. layer h n1 If convolutional filters have size KxK and stride 1, and pooling layer has pools of size PxP, then each unit in the pooling layer depends upon a patch (at the input of the preceding conv. layer) of size: (P+K-1)x(P+K-1)
  • 80. 80 Local Contrast Normalization h i1  x , y= hi x , y−mi N x , y max  , i N x , y Ranzato
  • 81. 81 We want the same response. Ranzato Local Contrast Normalization h i1  x , y= hi x , y−mi N x , y max  , i N x , y
  • 82. 82 Performed also across features and in the higher layers.. Effects: – improves invariance – improves optimization – increases sparsity Note: computational cost is negligible w.r.t. conv. layer. Ranzato Local Contrast Normalization h i1  x , y= hi x , y−mi N x , y max  , i N x , y
  • 83. 83 ConvNets: Typical Stage Convol. LCN Pooling One stage (zoom) courtesy of K. Kavukcuoglu Ranzato
  • 84. 84 Convol. LCN Pooling One stage (zoom) Conceptually similar to: SIFT, HoG, etc. Ranzato ConvNets: Typical Stage
  • 85. 85courtesy of K. Kavukcuoglu Ranzato Note: after one stage the number of feature maps is usually increased (conv. layer) and the spatial resolution is usually decreased (stride in conv. and pooling layers). Receptive field gets bigger. Reasons: - gain invariance to spatial translation (pooling layer) - increase specificity of features (approaching object specific units)
  • 86. 86 Convol. LCN Pooling One stage (zoom) Fully Conn. Layers Whole system 1st stage 2nd stage 3rd stage Input Image Class Labels Ranzato ConvNets: Typical Architecture
  • 87. 87 SIFT → K-Means → Pyramid Pooling → SVM SIFT → Fisher Vect. → Pooling → SVM Lazebnik et al. “...Spatial Pyramid Matching...” CVPR 2006 Sanchez et al. “Image classifcation with F.V.: Theory and practice” IJCV 2012 Conceptually similar to: Ranzato Fully Conn. Layers Whole system 1st stage 2nd stage 3rd stage Input Image Class Labels ConvNets: Typical Architecture
  • 88. 88 ConvNets: Training Algorithm: Given a small mini-batch - F-PROP - B-PROP - PARAMETER UPDATE All layers are differentiable (a.e.). We can use standard back-propagation. Ranzato
  • 89. 89 Ranzato Note: After several stages of convolution-pooling, the spatial resolution is greatly reduced (usually to about 5x5) and the number of feature maps is large (several hundreds depending on the application). It would not make sense to convolve again (there is no translation invariance and support is too small). Everything is vectorized and fed into several fully connected layers. If the input of the fully connected layers is of size Nx5x5, the first fully connected layer can be seen as a conv. layer with 5x5 kernels. The next fully connected layer can be seen as a conv. layer with 1x1 kernels.
  • 90. 90 Ranzato NxMxM, M small H hidden units / Hx1x1 feature maps Fully conn. layer / Conv. layer (H kernels of size NxMxM)
  • 91. 91 NxMxM, M small H hidden units / Hx1x1 feature maps Fully conn. layer / Conv. layer (H kernels of size NxMxM) K hidden units / Kx1x1 feature maps Fully conn. layer / Conv. layer (K kernels of size Hx1x1)
  • 92. 92 Viewing fully connected layers as convolutional layers enables efficient use of convnets on bigger images (no need to slide windows but unroll network over space as needed to re-use computation). CNNInput Image CNN Input Image Input Image TRAINING TIME TEST TIME x y
  • 93. 93 Viewing fully connected layers as convolutional layers enables efficient use of convnets on bigger images (no need to slide windows but unroll network over space as needed to re-use computation). CNNInput Image CNN Input Image TRAINING TIME TEST TIME x y Unrolling is order of magnitudes more eficient than sliding windows! CNNs work on any image size!
  • 94. 94 ConvNets: Test At test time, run only is forward mode (FPROP). Ranzato
  • 95. 95 Fancier Architectures: Multi-Scale Farabet et al. “Learning hierarchical features for scene labeling” PAMI 2013
  • 96. 96 Fancier Architectures: Multi-Modal Frome et al. “Devise: a deep visual semantic embedding model” NIPS 2013 CNN Text Embedding tiger Matching shared representation
  • 97. 97 Fancier Architectures: Multi-Task Zhang et al. “PANDA..” CVPR 2014 Conv Norm Pool Conv Norm Pool Conv Norm Pool Conv Norm Pool Fully Conn. Fully Conn. Fully Conn. Fully Conn. ... Attr. 1 Attr. 2 Attr. N image
  • 98. 98 Fancier Architectures: Multi-Task Osadchy et al. “Synergistic face detection and pose estimation..” JMLR 2007
  • 99. 99 Fancier Architectures: Generic DAG Any DAG of differentialble modules is allowed!
  • 100. 100 Fancier Architectures: Generic DAG If there are cycles (RNN), one needs to un-roll it. Graves “Offline Arabic handwriting recognition..” Springer 2012
  • 102. 102 CONV NETS: EXAMPLES - OCR / House number & Traffic sign classification Ciresan et al. “MCDNN for image classification” CVPR 2012 Wan et al. “Regularization of neural networks using dropconnect” ICML 2013 Goodfellow et al. “Multi-digit nuber recognition from StreetView...” ICLR 2014 Jaderberg et al. “Synthetic data and ANN for natural scene text recognition” arXiv 2014
  • 103. 103 CONV NETS: EXAMPLES - Texture classification Sifre et al. “Rotation, scaling and deformation invariant scattering...” CVPR 2013
  • 104. 104 CONV NETS: EXAMPLES - Pedestrian detection Sermanet et al. “Pedestrian detection with unsupervised multi-stage..” CVPR 2013
  • 105. 105 CONV NETS: EXAMPLES - Scene Parsing Farabet et al. “Learning hierarchical features for scene labeling” PAMI 2013 RanzatoPinheiro et al. “Recurrent CNN for scene parsing” arxiv 2013
  • 106. 106 CONV NETS: EXAMPLES - Segmentation 3D volumetric images Ciresan et al. “DNN segment neuronal membranes...” NIPS 2012 Turaga et al. “Maximin learning of image segmentation” NIPS 2009 Ranzato
  • 107. 107 CONV NETS: EXAMPLES - Action recognition from videos Taylor et al. “Convolutional learning of spatio-temporal features” ECCV 2010 Karpathy et al. “Large-scale video classification with CNNs” CVPR 2014 Simonyan et al. “Two-stream CNNs for action recognition in videos” arXiv 2014
  • 108. 108 CONV NETS: EXAMPLES - Robotics Sermanet et al. “Mapping and planning ...with long range perception” IROS 2008
  • 109. 109 CONV NETS: EXAMPLES - Denoising Burger et al. “Can plain NNs compete with BM3D?” CVPR 2012 original noised denoised Ranzato
  • 110. 110 CONV NETS: EXAMPLES - Dimensionality reduction / learning embeddings Hadsell et al. “Dimensionality reduction by learning an invariant mapping” CVPR 2006
  • 111. 111 CONV NETS: EXAMPLES - Object detection Sermanet et al. “OverFeat: Integrated recognition, localization, ...” arxiv 2013 Szegedy et al. “DNN for object detection” NIPS 2013 Ranzato Girshick et al. “Rich feature hierarchies for accurate object detection...” arxiv 2013
  • 112. Dataset: ImageNet 2012 Deng et al. “Imagenet: a large scale hierarchical image database” CVPR 2009
  • 114. 114 Architecture for Classification CONV LOCAL CONTRAST NORM MAX POOLING FULLY CONNECTED LINEAR CONV LOCAL CONTRAST NORM MAX POOLING CONV CONV CONV MAX POOLING FULLY CONNECTED Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012 category prediction input Ranzato
  • 115. 115CONV LOCAL CONTRAST NORM MAX POOLING FULLY CONNECTED LINEAR CONV LOCAL CONTRAST NORM MAX POOLING CONV CONV CONV MAX POOLING FULLY CONNECTED Total nr. params: 60M 4M 16M 37M 442K 1.3M 884K 307K 35K Total nr. flops: 832M 4M 16M 37M 74M 224M 149M 223M 105M Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012 category prediction input Ranzato Architecture for Classification
  • 116. 116 Optimization SGD with momentum: Learning rate = 0.01 Momentum = 0.9 Improving generalization by: Weight sharing (convolution) Input distortions Dropout = 0.5 Weight decay = 0.0005 Ranzato
  • 117. 117 Results: ILSVRC 2012 RanzatoKrizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012
  • 118. 118
  • 121. 121 Choosing The Architecture Task dependent Cross-validation [Convolution → LCN → pooling]* + fully connected layer The more data: the more layers and the more kernels Look at the number of parameters at each layer Look at the number of flops at each layer Computational resources Be creative :) Ranzato
  • 122. 122 How To Optimize SGD (with momentum) usually works very well Pick learning rate by running on a subset of the data Bottou “Stochastic Gradient Tricks” Neural Networks 2012 Start with large learning rate and divide by 2 until loss does not diverge Decay learning rate by a factor of ~1000 or more by the end of training Use non-linearity Initialize parameters so that each feature across layers has similar variance. Avoid units in saturation. Ranzato
  • 123. 123 Improving Generalization Weight sharing (greatly reduce the number of parameters) Data augmentation (e.g., jittering, noise injection, etc.) Dropout Hinton et al. “Improving Nns by preventing co-adaptation of feature detectors” arxiv 2012 Weight decay (L2, L1) Sparsity in the hidden units Multi-task (unsupervised learning) Ranzato
  • 124. 124 ConvNets: till 2012 Loss parameter Common wisdom: training does not work because we “get stuck in local minima”
  • 125. 125 ConvNets: today Loss parameter Local minima are all similar, there are long plateaus, it can take long time to break symmetries. w w input/output invariant to permutations breaking ties between parameters W T X 1 Saturating units Dauphin et al. “Identifying and attacking the saddle point problem..” arXiv 2014
  • 126. 126 Like walking on a ridge between valleys Neural Net Optimization is...
  • 127. 127 ConvNets: today Loss parameter Local minima are all similar, there are long plateaus, it can take long to break symmetries. Optimization is not the real problem when: – dataset is large – unit do not saturate too much – normalization layer
  • 128. 128 ConvNets: today Loss parameter Today's belief is that the challenge is about: – generalization How many training samples to fit 1B parameters? How many parameters/samples to model spaces with 1M dim.? – scalability
  • 129. 129 data flops/s capacity T IM E ConvNets: Why so successful today? TIME As time goes by, we get more data and more flops/s. The capacity of ML models should grow accordingly. 1K 1M 1B 1M 100M 10T Ranzato
  • 130. 130 data capacity T IM E ConvNets: Why so successful today? CNN were in many ways premature, we did not have enough data and flops/s to train them. They would overfit and be too slow to tran (apparent local minima). flops/s 1B NOTE: methods have to be easily scalable! Ranzato
  • 131. 131 Good To Know Check gradients numerically by finite differences Visualize features (feature maps need to be uncorrelated) and have high variance. samples hidden unit Good training: hidden units are sparse across samples and across features. Ranzato
  • 132. 132 Check gradients numerically by finite differences Visualize features (feature maps need to be uncorrelated) and have high variance. samples hidden unit Bad training: many hidden units ignore the input and/or exhibit strong correlations. Ranzato Good To Know
  • 133. 133 Check gradients numerically by finite differences Visualize features (feature maps need to be uncorrelated) and have high variance. Visualize parameters Good training: learned filters exhibit structure and are uncorrelated. GOOD BADBAD BAD too noisy too correlated lack structure Ranzato Good To Know Zeiler, Fergus “Visualizing and understanding CNNs” arXiv 2013 Simonyan, Vedaldi, Zisserman “Deep inside CNNs: visualizing image classification models..” ICLR 2014
  • 134. 134 Check gradients numerically by finite differences Visualize features (feature maps need to be uncorrelated) and have high variance. Visualize parameters Measure error on both training and validation set. Test on a small subset of the data and check the error → 0. Ranzato Good To Know
  • 135. 135 What If It Does Not Work? Training diverges: Learning rate may be too large → decrease learning rate BPROP is buggy → numerical gradient checking Parameters collapse / loss is minimized but accuracy is low Check loss function: Is it appropriate for the task you want to solve? Does it have degenerate solutions? Check “pull-up” term. Network is underperforming Compute flops and nr. params. → if too small, make net larger Visualize hidden units/params → fix optmization Network is too slow Compute flops and nr. params. → GPU,distrib. framework, make net smaller Ranzato
  • 136. 136 Summary Part I Deep Learning = learning hierarhical models. ConvNets are the most successful example. Leverage large labeled datasets. Optimization Don't we get stuck in local minima? No, they are all the same! In large scale applications, local minima are even less of an issue. Scaling GPUs Distributed framework (Google) Better optimization techniques Generalization on small datasets (curse of dimensionality): data augmentation weight decay dropout unsupervised learning multi-task learning Ranzato
  • 137. 137 SOFTWARE Torch7: learning library that supports neural net training http://www.torch.ch http://code.cogbits.com/wiki/doku.php (tutorial with demos by C. Farabet) https://github.com/sermanet/OverFeat Python-based learning library (U. Montreal) - http://deeplearning.net/software/theano/ (does automatic differentiation) Caffe (Yangqing Jia) – http://caffe.berkeleyvision.org Efficient CUDA kernels for ConvNets (Krizhevsky) – code.google.com/p/cuda-convnet Ranzato
  • 138. 138 REFERENCES Convolutional Nets – LeCun, Bottou, Bengio and Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998 - Krizhevsky, Sutskever, Hinton “ImageNet Classification with deep convolutional neural networks” NIPS 2012 – Jarrett, Kavukcuoglu, Ranzato, LeCun: What is the Best Multi-Stage Architecture for Object Recognition?, Proc. International Conference on Computer Vision (ICCV'09), IEEE, 2009 - Kavukcuoglu, Sermanet, Boureau, Gregor, Mathieu, LeCun: Learning Convolutional Feature Hierachies for Visual Recognition, Advances in Neural Information Processing Systems (NIPS 2010), 23, 2010 – see yann.lecun.com/exdb/publis for references on many different kinds of convnets. – see http://www.cmap.polytechnique.fr/scattering/ for scattering networks (similar to convnets but with less learning and stronger mathematical foundations) – see http://www.idsia.ch/~juergen/ for other references to ConvNets and LSTMs. Ranzato
  • 139. 139 REFERENCES Applications of Convolutional Nets – Farabet, Couprie, Najman, LeCun. Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers”, ICML 2012 – Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala and Yann LeCun: Pedestrian Detection with Unsupervised Multi-Stage Feature Learning, CVPR 2013 - D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. NIPS 2012 - Raia Hadsell, Pierre Sermanet, Marco Scoffier, Ayse Erkan, Koray Kavackuoglu, Urs Muller and Yann LeCun. Learning Long-Range Vision for Autonomous Off-Road Driving, Journal of Field Robotics, 26(2):120-144, 2009 – Burger, Schuler, Harmeling. Image Denoisng: Can Plain Neural Networks Compete with BM3D?, CVPR 2012 – Hadsell, Chopra, LeCun. Dimensionality reduction by learning an invariant mapping, CVPR 2006 – Bergstra et al. Making a science of model search: hyperparameter optimization in hundred of dimensions for vision architectures, ICML 2013 Ranzato
  • 140. 140 REFERENCES Latest and Greatest Convolutional Nets – Girshick, Donahue, Darrell, Malick. “Rich feature hierarchies for accurate object detection and semantic segmentation”, arXiv 2014 – Simonyan, Zisserman “Two-stream CNNs for action recognition in videos” arXiv 2014 - Cadieu, Hong, Yamins, Pinto, Ardila, Solomon, Majaj, DiCarlo. “DNN rival in representation of primate IT cortex for core visual object recognition”. arXiv 2014 - Erhan, Szegedy, Toshev, Anguelov “Scalable object detection using DNN” CVPR 2014 - Razavian, Azizpour, Sullivan, Carlsson “CNN features off-the-shelf: and astounding baseline for recognition” arXiv 2014 - Krizhevsky “One weird trick for parallelizing CNNs” arXiv 2014 Ranzato
  • 141. 141 REFERENCES Deep Learning in general – deep learning tutorial @ CVPR 2014 https://sites.google.com/site/deeplearningcvpr2014/ – deep learning tutorial slides at ICML 2013: icml.cc/2013/?page_id=39 – Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), pp.1-127, 2009. – LeCun, Chopra, Hadsell, Ranzato, Huang: A Tutorial on Energy-Based Learning, in Bakir, G. and Hofman, T. and Schölkopf, B. and Smola, A. and Taskar, B. (Eds), Predicting Structured Data, MIT Press, 2006 Ranzato “Theory” of Deep Learning – Mallat: Group Invariant Scattering, Comm. In Pure and Applied Math. 2012 – Pascanu, Montufar, Bengio: On the number of inference regions of DNNs with piece wise linear activations, ICLR 2014 – Pascanu, Dauphin, Ganguli, Bengio: On the saddle-point problem for non-convex optimization, arXiv 2014 - Delalleau, Bengio: Shallow vs deep Sum-Product Networks, NIPS 2011
  • 142. 142 THE END - PART I Ranzato
  • 143. 143 Outline Part II Theory: Energy-Based Models Energy function Loss function Examples: Supervised learning: neural nets Unsupervised learning: sparse coding Unsupervised learning: gated MRF Conclusions Ranzato
  • 144. 144 Energy: LeCun et al. “Tutorial on Energy-based learning ...” Predicting Structure Data 2006 Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007 Energy-Based Models: Energy Function E y ; E y ; x ,or unsupervised supervised
  • 145. 145 Energy: LeCun et al. “Tutorial on Energy-based learning ...” Predicting Structure Data 2006 Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007 Energy-Based Models: Energy Function E y ; E y ; x ,or unsupervised supervised y can be discrete continuous {
  • 146. 146 Energy: LeCun et al. “Tutorial on Energy-based learning ...” Predicting Structure Data 2006 Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007 Energy-Based Models: Energy Function E y ; E y ; x ,or unsupervised supervised y can be discrete continuous { We will refer to the unsupervised/continous case, but much of the following applies to the other cases as well.
  • 147. 147 Energy should be lower for desired output E y Energy-Based Models: Energy Function Ranzato
  • 148. 148 Make energy lower at the desired output E yy∗ BEFORE TRAINING Energy-Based Models: Energy Function Ranzato
  • 149. 149 Make energy lower at the desired output E yy∗ AFTER TRAINING Energy-Based Models: Energy Function Ranzato
  • 150. 150 Examples of energy function: PCA Linear binary classifier Neural net binary classifier Energy-Based Models: Energy Function E y=∥y−W W T y∥2 2 E y ; x=−yW T x y∈{−1,1} E y ; x=−yW 2 T f x ;W 1
  • 151. 151 Energy-Based Models: Loss Function Loss is a function of the energy Minimizing the loss over the training set yields the desired energy landscape.  ∗ =min∑p LE y p ; Examples of loss function: PCA Logistic regression classifier E y=∥y−W W T y∥2 2 E y ; x=−yW T x LE y=E y LE y ; x=log 1expE  y ; x
  • 152. 152 Energy-Based Models: Loss Function Loss is a function of the energy Minimizing the loss over the training set yields the desired energy landscape.  ∗ =min∑p LE y p ; How to design loss good functions? Ranzato
  • 153. 153 Energy-Based Models: Loss Function Loss is a function of the energy Minimizing the loss over the training set yields the desired energy landscape.  ∗ =min∑p LE y p ; E y L=E  y How to design loss good functions?
  • 154. 154 Energy-Based Models: Loss Function Loss is a function of the energy Minimizing the loss over the training set yields the desired energy landscape.  ∗ =min∑p LE y p ; E y L=E  y How to design loss good functions?
  • 155. 155 Energy-Based Models: Loss Function Loss is a function of the energy Minimizing the loss over the training set yields the desired energy landscape.  ∗ =min∑p LE y p ; E y L=E  y BAD LOSS How to design loss good functions? Energy is degenerate: low everywhere!
  • 156. 156 Energy-Based Models: Loss Function Loss is a function of the energy Minimizing the loss over the training set yields the desired energy landscape.  ∗ =min∑p LE y p ; How to design loss good functions? L=E  ylog∑y exp−Ey E y
  • 157. 157 Energy-Based Models: Loss Function Loss is a function of the energy Minimizing the loss over the training set yields the desired energy landscape.  ∗ =min∑p LE y p ; How to design loss good functions? L=E  ylog∑y exp−Ey E y GOOD LOSS but potentially very expensive
  • 158. 158 Strategies to Shape E: #1 Pull down the correct answer and pull up everywhere else. E y L=E  ylog∑y exp−E y PROS It produces calibrated probabilities. CONS Expensive to compute when y is discrete and high dimensional. Generally intractable when y is continuous. Negative Log-Likelihood p y= e −E  y ∑u e −Eu
  • 159. 159 Strategies to Shape E: #2 Pull down the correct answer and pull up carefully chosen points. L=max0,m−E y−E y, e.g. y=miny≠ y E  y E yy E.g.: Contrastive Divergence, Ratio Matching, Noise Contrastive Estimation, Minimum Probability Flow... Hinton et al. “A fast learning algorithm for DBNs” Neural Comp. 2008 Gutmann et al. “Noise contrastive estimation of unnormalized...” JMLR 2012 Hyvarinen “Some extensions of score matchine” Comp Stats 2007 Sohl-Dickstein et al. “Minimum probability flow learning” ICML 2011
  • 160. 160 Strategies to Shape E: #2 Pull down the correct answer and pull up carefully chosen points. PROS Efficient. CONS The criterion to pick where to pull up is tricky (overall in high dimensional spaces): trades-off computational and statistical efficiency. L=max0,m−E y−E y, e.g. y=miny≠ y E  y E yy
  • 161. 161 Strategies to Shape E: #3 Pull down the correct answer and increase local curvature. E y Score Matching L=∂ E  y ∂ y  2  ∂2 E  y ∂ y2 Hyvarinen “Estimation of non-normalized statistical models using score matching” JMLR 2005
  • 162. 162 Strategies to Shape E: #3 Pull down the correct answer and increase local curvature. E Score Matching y Hyvarinen “Estimation of non-normalized statistical models using score matching” JMLR 2005 L=∂ E  y ∂ y  2  ∂2 E  y ∂ y2
  • 163. 163 Strategies to Shape E: #3 Pull down the correct answer and increase local curvature. PROS Efficient in continuous but not too high dimensional spaces. CONS Very complicated to compute and not practical in very high dimensional spaces. Not applicable in discrete spaces. E Score Matching y L=∂ E  y ∂ y  2  ∂2 E  y ∂ y2
  • 164. 164 Strategies to Shape E: #4 Pull down correct answer and have global constrain on the energy: only few minima exist E y PCA, ICA, sparse coding,... PROS Efficient in continuous, high dimensional spaces. L=E  y Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007 CONS Need to design good global constraints. Used in unsup. learning only.
  • 165. 165 Strategies to Shape E: #4 E y PCA, ICA, sparse coding,... L=E  y Pull down correct answer and have global constrain on the energy: only few minima exist PROS Efficient in continuous, high dimensional spaces. CONS Need to design good global constraints. Used in unsup. learning only. Ranzato et al. “A unified energy-based framework for unsup. learning” AISTATS 2007
  • 166. 166 Strategies to Shape E: #4 Pull down correct answer and have global constrain on the energy: only few minima exist Typical methods (unsup. learning): Use compact internal representation (PCA) Have finite number of internal states (K-Means) Use sparse codes (ICA, sparse coding) Ranzato
  • 167. 167 INPUT SPACE: FEATURE SPACE: f  x ,h g x ,h hx Ranzato training sample input data point which is not a training sample feature (code)
  • 168. 168 Wh f  x ,h hxINPUT SPACE: FEATURE SPACE: E.g. K-means: is 1-of-N.h Since there are very few “codes” available and the energy (MSE) is minimized on the training set, the energy must be higher elsewhere. Ranzato
  • 169. 169 Strategies to Shape E: #5 F Denoising AE L=E  y Make the observed an attractor state of some energy function. Need only to define a dynamics. y E.g.: L= 1 2 ∥y−y∥ 2 y=W 2 W 1 yn, n∈N 0, I  y Vincent et al. “Extracting and composing robust features with denoising autoencoders” ICML 2008 Training samples must be stable points (local minima) and attractors.
  • 170. 170 Make the observed an attractor state of some energy function. Need only to define a dynamics. Strategies to Shape E: #5 Denoising AE E.g.: yn Training samples must be stable points (local minima) and attractors. F Kamyshanska et al. “On autoencoder scoring” ICML 2013 y L=E  y y y=W 2 W 1 yn, n∈N 0, I  L= 1 2 ∥y−y∥ 2 Ranzato
  • 171. 171 Make the observed an attractor state of some energy function. Need only to define a dynamics. Strategies to Shape E: #5 Denoising AE E.g.: F Training samples must be stable points (local minima) and attractors. Kamyshanska et al. “On autoencoder scoring” ICML 2013 y L=E  y y yn y=W 2 W 1 yn, n∈N 0, I  L= 1 2 ∥y−y∥ 2 Ranzato
  • 172. 172 Make the observed an attractor state of some energy function. Need only to define a dynamics. Strategies to Shape E: #5 PROS Efficient in high dimensional spaces. CONS Need to pick noise distribution. May need to pick many noisy points. Kamyshanska et al. “On autoencoder scoring” ICML 2013 Denoising AEF y L=E  y y yn
  • 173. 173 Loss: summary Goal of loss: make energy lower for correct answer. Different losses choose differently how to “pull-up” Pull-up all points Pull up one or a few points only Make observations minima & increase curvature Add global constraints/penalties on internal states Define a dynamics with observations at the minima Choice of loss depends on desired landscape, computational budget, domain of input (discrete/continouus), task, etc. Ranzato
  • 174. 174 Final Notes on EBMs EBMs apply to any predictor (shallow & deep). EBMs subsume graphical models (e.g., use strategy #1 or #2 to “pull-up”). EBM is general framework to design good loss functions. Ranzato
  • 175. 175 Outline Ranzato Theory: Energy-Based Models Energy function Loss function Examples: Supervised learning: neural nets Unsupervised learning: sparse coding Unsupervised learning: gated MRF Conclusions
  • 176. 176 Perceptron Neural Net Boosting SVM GMM ΣΠ BayesNP CNN Recurrent Neural Net Autoencoder Neural Net Sparse Coding Restricted BMDBN Deep (sparse/denoising) Autoencoder UNSUPERVISED SUPERVISED DEEP SHALLOWPROBABILISTIC Loss: type #1
  • 177. 177 Neural Nets for Classification h2h1x h3 Loss y max 0,W 1 x max 0,W 2 h1 W 3 h2 LE y ; x=log 1expE  y ; x E y ; x=−y h3 {W 1, ∗ W 2, ∗ W 3 ∗ }=arg minW 1,W 2,W 3 LE y ; x Loss: type #1 Ranzato
  • 178. 178 Perceptron Neural Net Boosting SVM GMM ΣΠ BayesNP CNN Recurrent Neural Net Autoencoder Neural Net Sparse Coding Restricted BMDBN Deep (sparse/denoising) Autoencoder UNSUPERVISED SUPERVISED DEEP SHALLOWPROBABILISTIC Loss: type #4
  • 179. 179 Energy & latent variables Energy may have latent variables. Two major approaches: Marginalization (intractable if space is large) Minimization E y=minh E y ,h E y=−log ∑h exp−E y ,h Ranzato
  • 180. 180 Sparse Coding L= E x ;W  E x ,h;W = 1 2 ∥x−W h∥2 2 ∥h∥1 E x ;W =minh E x ,h;W  Loss type #4: energy loss with (sparsity) constraints. Ranzato et al. NIPS 2006Olshausen & Field, Nature 1996
  • 181. 181 Inference Prediction of latent variables h∗ =arg minh E x ,h;W  Inference is an iterative process. E x ,h;W = 1 2 ∥x−W h∥2 2 ∥h∥1 Ranzato
  • 182. 182 Inference Prediction of latent variables h∗ =arg minh E x ,h;W  Inference is an iterative process. E x ,h;W = 1 2 ∥x−W h∥2 2 ∥h∥1 Q.: Is it possible to make inference more efficient? A.: Yes, by training another module to directly predict Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008
  • 183. 183 Inference in Sparse Coding E x ,h= 1 2 ∥x−W 2 h∥2 2 ∥h∥1 Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008 x h x h W arg minh E Ranzato
  • 184. 184 E x ,h= 1 2 ∥x−W 2 h∥2 2 ∥h∥1 Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008 W 2 h ∥x−W 2 h∥2 2 ∥h∥1 h x Inference in Sparse Coding Ranzato
  • 185. 185 Learning To Perform Fast Inference E x ,h= 1 2 ∥x−W 2 h∥2 2 ∥h∥1 1 2 ∥h−g x ;W 1∥2 2 Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008 g x ;W 1 ∥h−g x ;W 1∥2 2 h x Ranzato
  • 186. 186 Predictive Sparse Decomposition E x ,h= 1 2 ∥x−W 2 h∥2 2 ∥h∥1 1 2 ∥h−g x ;W 1∥2 2 Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008 g x ;W 1 W 2 h ∥x−W 2 h∥2 2 ∥h−g x ;W 1∥2 2 ∥h∥1 h x Ranzato
  • 187. 187 Sparse Auto-Encoders Example: Predictive Sparse Decomposition E x ,h= 1 2 ∥x−W 2 h∥2 2 ∥h∥1 1 2 ∥h−g x ;W 1∥2 2 Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008 TRAINING: For every sample: Initialize (E) Infer optimal latent variables: (M) Update parameters h=g  x ;W 1 h∗ =minh E x ,h;W  W 1 ,W 2 Inference at test time (fast): h∗ ≈g x ;W 1 Ranzato
  • 188. 188 E x ,h= 1 2 ∥x−W 2 h∥2 2 ∥h∥1 1 2 ∥h−g x ;W 1∥2 2 Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008 x h W x h alternative graphical representations Predictive Sparse Decomposition Ranzato
  • 189. 189 Gregor et al. “Learning fast approximations of sparse coding” ICML 2010 LISTA W 1 W 2 ∥x−W 2 h∥2 2 ∥h−g x ;W 1∥2 2 ∥h∥1 h x ++W 3 W 3 W 3 Ranzato
  • 190. 190 KEY IDEAS Inference can require expensive optimization We may approximate exact inference well by using a non- linear function (learn optimal approximation to perform fast inference) The original model and the fast predictor can be trained jointly Kavukcuoglu et al. “Predictive Sparse Decomposition” ArXiv 2008 Rolfe et al. “Discriminative Recurrent Sparse Autoencoders” ICLR 2013 Szlam et al. “Fast approximations to structured sparse coding...” ECCV 2012 Gregor et al. “Structured sparse coding via lateral inhibition” NIPS 2011 Kavukcuoglu et al. “Learning convolutonal feature hierarchies..” NIPS 2010 Ranzato
  • 191. 191 Outline Theory: Energy-Based Models Energy function Loss function Examples: Supervised learning: neural nets Supervised learning: convnets Unsupervised learning: sparse coding Unsupervised learning: gated MRF Other examples Practical tricks Ranzato
  • 192. 192 Perceptron Neural Net Boosting SVM GMM ΣΠ BayesNP CNN Recurrent Neural Net Autoencoder Neural Net Sparse Coding Restricted BMDBN Deep (sparse/denoising) Autoencoder UNSUPERVISED SUPERVISED DEEP SHALLOWPROBABILISTIC Loss: type #2
  • 193. 193 Probabilistic Models of Natural Images INPUT SPACE LATENT SPACE Training sample Latent vector p x∣h=N meanh ,D  - examples: PPCA, Factor Analysis, ICA, Gaussian RBM p x∣h Ranzato
  • 194. 194 Probabilistic Models of Natural Images input image model does not represent well dependecies, only mean intensity ph∣x p x∣h generated image latent variables p x∣h=N meanh ,D  - examples: PPCA, Factor Analysis, ICA, Gaussian RBM Ranzato
  • 195. 195 Probabilistic Models of Natural Images Training sample Latent vector p x∣h=N 0,Covarianceh - examples: PoT, cRBM p x∣h Welling et al. NIPS 2003, Ranzato et al. AISTATS 10 INPUT SPACE LATENT SPACE Ranzato
  • 196. 196 Probabilistic Models of Natural Images Welling et al. NIPS 2003, Ranzato et al. AISTATS 10 model does not represent well mean intensity, only dependencies Andy Warhol 1960 input image ph∣x p x∣h generated image latent variables p x∣h=N 0,Covarianceh - examples: PoT, cRBM
  • 197. 197 Probabilistic Models of Natural Images Training sample Latent vector p x∣h=N meanh ,Covariance h - this is what we propose: mcRBM, mPoT Ranzato et al. CVPR 10, Ranzato et al. NIPS 2010, Ranzato et al. CVPR 11 p x∣h INPUT SPACE LATENT SPACE Ranzato
  • 198. 198 Probabilistic Models of Natural Images Training sample Latent vector Ranzato et al. CVPR 10, Ranzato et al. NIPS 2010, Ranzato et al. CVPR 11 p x∣h p x∣h=N meanh ,Covariance h - this is what we propose: mcRBM, mPoT INPUT SPACE LATENT SPACE Ranzato
  • 199. 199 PoT PPCA Our model N(0,Σ) N(m,I) N(m,Σ) Ranzato
  • 200. 200 Deep Gated MRF Layer 1: E x ,h c ,h m = 1 2 x '  −1 x pair-wise MRF x p xq Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 p x ,hc ,hm e−E x , h c ,h m  Ranzato
  • 201. 201 Deep Gated MRF Layer 1: E x ,h c ,h m = 1 2 x ' C C ' x pair-wise MRF x p xq F Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Ranzato
  • 202. 202 Deep Gated MRF Layer 1: E x ,h c ,h m = 1 2 x ' C [diag h c ]C ' x gated MRF x p xq hk c CC F F Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Ranzato
  • 203. 203 Deep Gated MRF Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Layer 1: E x ,h c ,h m = 1 2 x ' C [diag h c ]C ' x 1 2 x' x−x' W h m x p xq h j m W CC F M hk c N gated MRF p x∫h c ∫h m e−E x ,h c , h m 
  • 204. 204 Deep Gated MRF Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Layer 1: E x ,h c ,h m = 1 2 x ' C [diag h c ]C ' x 1 2 x' x−x' W h m Inference of latent variables: just a forward pass Generation: sampling from multi-variate Gaussian p x∣hc ,hm =N Whm ,   −1 =C diag [h c ]C 'I phk c =1∣x=− 1 2 Pk C ' x 2 bk  phj m =1∣x= W j ' xb j
  • 205. 205 Deep Gated MRF: Training Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Layer 1: E x ,h c ,h m = 1 2 x ' C [diag h c ]C ' x 1 2 x' x−x' W h m Training by maximum likelihood (PCD): p x= ∫h m , h c e −E x , h m , h c  ∫x ,h m , h c e −Ex ,h m ,h c  Loss type #2 Since integrating over the input space is intractable we use:
  • 206. 206 ∂ F ∂ w training sample F= - log(p(x)) + K p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 207. 207 HMC dynamics ∂ F ∂ w training sample F= - log(p(x)) + K p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 208. 208 ∂ F ∂ w training sample F= - log(p(x)) + K HMC dynamics p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 209. 209 ∂ F ∂ w training sample F= - log(p(x)) + K HMC dynamics p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 210. 210 ∂ F ∂ w training sample F= - log(p(x)) + K HMC dynamics p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 211. 211 ∂ F ∂ w training sample F= - log(p(x)) + K HMC dynamics p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 212. 212 ∂ F ∂ w F= - log(p(x)) + K HMC dynamics p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 213. 213 ∂ F ∂ w weight update ∂ F ∂ w sampledata point w  w− −  ∂ F ∂ w ∣sample ∂ F ∂ w ∣data F= - log(p(x)) + K p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 214. 214 ∂ F ∂ w ∂ F ∂ w sampledata point w  w− −  ∂ F ∂ w ∣sample ∂ F ∂ w ∣data F= - log(p(x)) + K weight update p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 215. 215 w  w− −  ∂ F ∂ w ∣sample ∂ F ∂ w ∣data Start dynamics from previous point data point sample F= - log(p(x)) + K p x= e−F x ∫x e−F  x F x=−log∫h m , h c e−E x , h m , h c  free energy
  • 216. 216 Deep Gated MRF Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Layer 1 Layer 2 input h2 Ranzato
  • 217. 217 Deep Gated MRF Ranzato et al. “Modeling natural images with gated MRFs” PAMI 2013 Layer 1 h2 Layer 2 input h3 Layer 3 Ranzato
  • 218. 218 Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010 Sampling High-Resolution Images
  • 219. 219 Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010 Sampling High-Resolution Images gMRF: 1 layer Ranzato et al. PAMI 2013
  • 220. 220 Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010 Sampling High-Resolution Images Ranzato et al. PAMI 2013 gMRF: 1 layer
  • 221. 221 Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010 Sampling High-Resolution Images Ranzato et al. PAMI 2013 gMRF: 1 layer
  • 222. 222 Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010 Sampling High-Resolution Images Ranzato et al. PAMI 2013 gMRF: 3 layer
  • 223. 223 Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010 Sampling High-Resolution Images Ranzato et al. PAMI 2013 gMRF: 3 layer
  • 224. 224 Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010 Sampling High-Resolution Images Ranzato et al. PAMI 2013 gMRF: 3 layer
  • 225. 225 Gaussian model marginal wavelet from Simoncelli 2005 Pair-wise MRF FoE from Schmidt, Gao, Roth CVPR 2010 Sampling High-Resolution Images Ranzato et al. PAMI 2013 gMRF: 3 layer
  • 226. 226 Sampling After Training on Face Images Original Input 1st layer 2nd layer 3rd layer 4th layer 10 times unconstrained samples conditional (on the left part of the face) samples Ranzato et al. PAMI 2013 Ranzato
  • 227. 227 Expression Recognition Under Occlusion Ranzato et al. PAMI 2013 Ranzato
  • 228. 228 Tang et al. Robust BM for decognition and denoising CVPR 2012 Ranzato
  • 229. 229 Pros Cons Feature extraction is fast Unprecedented generation quality Advances models of natural images Trains without labeled data Training is inefficient Slow Tricky Sampling scales badly with dimensionality What's the use case of generative models? Conclusion If generation is not required, other feature learning methods are more efficient (e.g., sparse auto-encoders). What's the use case of generative models? Given enough labeled data, unsup. learning methods have not produced more useful features.
  • 231. 231 Structured Prediction: GTN LeCun et al. “Gradient-based learning applied to document recognition” IEEE 1998
  • 232. 232 Summary Part II Energy-based model: framework to understand and design loss functions Training is about shaping the energy. Dealing with high- dimensional (continuous) spaces: explicit pull-up VS global constraints (e.g., sparsity). Unsupervised learning: a field in its infancy Humas use it a lot How to discover good compositional feature hierarchies? How to scale to large dimensional spaces (curse of dimensionality)? Hot research topic: Scaling up to even larger datasets Unsupervised learning Multi-modal & multi-task learning Video Structure prediction Ranzato
  • 233. 233 Questions A ReLU layer computes: h = max(0, W x + b), where x is a D dimensional vector, h and b are N dimensional vectors and W is a matrix of size NxD. If dL/dh is the derivative of the loss w.r.t. h, then backpropagation computes (in matlab notation): a) the derivatives dL/dx = W' (dL/dh .* 1(h)), where 1(h) = 1 if h>0 o/w it is 0 b) the derivatives dL/dW = (dL/dh .* 1(h)) x' and dL/db = dL/dh .* 1(h) c) both the derivatives in a) and b) Ranzato A convolutional network: a) is well defined only when layers are everywhere differentiable b) requires unsupervised learning even if the labeled dataset is very large c) has usually many convolution/pooling/normalization layers followed by fully connected layers Sparse coding: a) aims at extracting sparse features that are good at reconstructing the input and it does not need labels for training; the dimensionality of the feature vector is usually higher than the input. b) requires to have lots of images with their corresponding labels c) can be interpreted as a graphical model, and inference of the latent variables (sparse codes) is done by marginalization.