Deep learning in Computer Vision

Deep Learning in Computer Vision
Deep Learning in Action 
24. June ’15
Caner Hazırbaş

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de
Computer Vision Group
2
6 Postdocs, 16 PhD students

Research in Computer Vision
3
Image-based 3D 
Reconstruction
Shape Analysis Robot Vision
RGB-D Vision
Image 
Segmentation
Convex  
Relaxation 
Methods
Visual SLAM
Optical Flow

Deep Learning  
in Computer Vision

How to teach a machine ?
5
edges classiﬁer
(or any other hand-crafted features)
Person

How to teach a machine ?
6
edges classiﬁer
Person
(or any other hand-crafted features)
N
ota
good
representation

What is deep learning ?
• Representation learning method 
Learning good features automatically from raw data
• Learning representations of data with multiple levels of abstraction
7
Google’s cat detection neural network

Construction of higher  
levels of abstraction
8
w1
w2
w3
“non-linear” 
transformation
1
b

Going deeper in the network
9
Input 
‘Pixels’
1st and 2nd Layers
‘Edges’nvolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
not conditionally independent of one another given
layers above and below. In contrast, our treatment
g undirected edges enables combining bottom-up
top-down information more e ciently, as shown
ection 4.5.
our approach, probabilistic max-pooling helps to
ress scalability by shrinking the higher layers;
ght-sharing (convolutions) further speeds up the
rithm. For example, inference in a three-layer
work (with 200x200 input images) using weight-
ring but without max-pooling was about 10 times
wer. Without weight-sharing, it was more than 100
es slower.
work that was contemporary to and done indepen-
tly of ours, Desjardins and Bengio (2008) also ap-
d convolutional weight-sharing to RBMs and ex-
mented on small image patches. Our work, how-
, develops more sophisticated elements such as
babilistic max-pooling to make the algorithm more
able.
Experimental results
Learning hierarchical representations
from natural images
first tested our model’s ability to learn hierarchi-
representations of natural images. Specifically, we
ned a CDBN with two hidden layers from the Ky-
natural image dataset.3
The first layer consisted
4 groups (or “bases”)4
of 10x10 pixel filters, while
second layer consisted of 100 bases, each one 10x10
well.5
As shown in Figure 2 (top), the learned first
r bases are oriented, localized edge filters; this re-
is consistent with much prior work (Olshausen &
d, 1996; Bell & Sejnowski, 1997; Ranzato et al.,
6). We note that the sparsity regularization dur-
training was necessary for learning these oriented
e filters; when this term was removed, the algo-
m failed to learn oriented edges.
learned second layer bases are shown in Fig-
2 (bottom), and many of them empirically re-
Figure 2. The first layer bases (top) and the second layer
bases (bottom) learned from natural images. Each second
layer basis (filter) was visualized as a weighted linear com-
bination of the first layer bases.
unlabeled data do not share the same class labels, or
the same generative distribution, as the labeled data.
This framework, where generic unlabeled data improve
performance on a supervised learning task, is known
as self-taught learning. In their experiments, they used
sparse coding to train a single-layer representation,
and then used the learned representation to construct
features for supervised learning tasks.
We used a similar procedure to evaluate our two-layer
CDBN, described in Section 4.1, on the Caltech-101
object classification task.6
The results are shown in
Table 1. First, we observe that combining the first
and second layers significantly improves the classifica-
tion accuracy relative to the first layer alone. Overall,
we achieve 57.7% test accuracy using 15 training im-
ages per class, and 65.4% test accuracy using 30 train-
ing images per class. Our result is competitive with
state-of-the-art results using highly-specialized single
features, such as SIFT, geometric blur, and shape-
context (Lazebnik et al., 2006; Berg et al., 2005; Zhang
et al., 2006).7
Recall that the CDBN was trained en-
6
Details: Given an image from the Caltech-101
dataset (Fei-Fei et al., 2004), we scaled the image so that
Convolutional Deep Belief Networks for Scalable Unsu
Table 2. Test error fo
Labeled training samples 1,000
CDBN 2.62±0.12% 2
Ranzato et al. (2007) 3.21%
Hinton and Salakhutdinov (2006) -
Weston et al. (2008) 2.73%
faces cars elephan
Figure 3. Columns 1-4: the second layer bases (top) and th
categories. Column 5: the second layer bases (top) and the
object categories (faces, cars, airplanes, motorbikes).
0.2
0.4
0.6
Faces
first layer
second layer
third layer
0.2
0.4
0.6
Motorbikes
first layer
second layer
third layer
0.2
0.4
0.6
Cars
first layer
second layer
third layer
Convolutional Deep Belief N
Labeled training sample
CDBN
Ranzato et al. (2007)
Hinton and Salakhutdin
Weston et al. (2008)
faces
Figure 3. Columns 1-4: the seco
categories. Column 5: the secon
object categories (faces, cars, airp
0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
Area under the PR curve (AUC)
Faces
first layer
second layer
third layer
0.2 0.4 0.6 0
0
0.2
0.4
0.6
Area under the PR cur
Motorbikes
first layer
second layer
third layer
Features Faces M
First layer 0.39±0.17 0.
Second layer 0.86±0.13 0.
Third layer 0.95±0.03 0.
3rd Layer
‘Object Parts’
4th Layer
‘Objects’
ks for Scalable Unsupervised Learning of Hierarchical Representations
Table 2. Test error for MNIST dataset
1,000 2,000 3,000 5,000 60,000
2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% 0.82%
3.21% 2.53% - 1.52% 0.64%
06) - - - - 1.20%
2.73% - 1.83% - 1.50%
elephants chairs faces, cars, airplanes, motorbikes
utional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations
Table 2. Test error for MNIST dataset
Labeled training samples 1,000 2,000 3,000 5,000 60,000
CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% 0.82%
Ranzato et al. (2007) 3.21% 2.53% - 1.52% 0.64%
Hinton and Salakhutdinov (2006) - - - - 1.20%
Weston et al. (2008) 2.73% - 1.83% - 1.50%
faces cars elephants chairs faces, cars, airplanes, motorbikes
3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned from specific object
es. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from a mixture of four
ategories (faces, cars, airplanes, motorbikes).
Faces
ayer
r 0.4
0.6
Motorbikes
first layer
second layer
third layer 0.4
0.6
Cars
first layer
second layer
third layer
faces
faces
cars
airplanes
motorbikes

Deep Learning Methods
Unsupervised Methods
• Restricted Boltzmann Machines
• Deep Belief Networks
• Auto encoders: unsupervised feature extraction/learning
10
encode decode

Deep Learning Methods
Supervised Methods
• Deep Neural Networks
• Recurrent Neural Networks
• Convolutional Neural Networks
11
Vision
Deep CNN
Language
Generating RNN
A group of people
shopping at an outdoor
market.
There are many
vegetables at the
fruit stand.
REVIEWINSIGHT

How to train a deep network ?
Stochastic Gradient Descent — supervised learning
• show input vector of few examples
• compute the output and the errors
• compute average gradient
• update the weights accordingly
12

Convolutional Neural Networks
• CNNs are designed to process the data in the form of multiple arrays
(e.g. 2D images, 3D video/volumetric images)
• Typical architecture is composed of series of stages: convolutional layers
and pooling layers
• Each unit is connected to local patches in the feature maps of the
previous layer
13
20
15 x 5415 x 54 8 x 27
4 x 14
50
500 x 1
378 x 1
EAqyB4
conv1 pool1 conv pool2
10%
20
8 x 27
50

• local connections
Key Idea behind  
Convolutional Networks
14
Convolutional networks take advantage of the properties of natural signals:

Key Idea behind  
15
• shared weights

Key Idea behind  
16
• shared weights
• pooling

Key Idea behind  
17
• shared weights
• pooling • the use of many layers
Person

Pros & Cons
• Best performing method in many
Computer Vision tasks
• No need of hand-crafted features
• Most applicable method for large-
scale problems, e.g. classiﬁcation
of 1000 classes
• Easy parallelization on GPUs
18
• Need of huge amount of training
data
• Hard to train (local minima problem,
tuning hyper-parameters)
• Difﬁcult to analyse (to be solved)

Deep Learning Applications 
in Computer Vision

Handwritten Digit Recognition
20

ImageNet Classiﬁcation with Deep
Convolutional Neural Networks (AlexNet)
21
Figure 4: (Left) Eight ILSVRC-2010 test images and the ﬁve labels consider
The correct label is written under each image, and the probability assigned to

FlowNet: Learning Optical Flow
with Convolutional Networks
22
in collaboration with University of Freiburg 
lmb.informatik.uni-freiburg.de

23

24
96 x 128
9999
192 x 256
6
64
128
256256
512512
512512
10245
x 5
5
x 5 3
x 3
conv6
prediction
conv5_1
conv5conv4_1
conv4conv3_1
conv3
conv2
conv1
136 x 320
7
x 7
384 x 512
refine-
ment
conv1
conv2
conv3
corr
conv_redir
conv3_1 conv4
conv4_1 conv5 conv5_1 conv6
3
64
128
256441 32
473
256
512 512512512
1024
384 x 512
sqrt
prediction
136 x 320
refine-
ment
4 x 512
4 x 51222222222222
kernel
7 x 7
5 x 5
1 x 1
1 x 1
3 x 3
FlowNetSimple
FlowNetCorr

25
128
128
256256
conv3
corr
conv_redir
conv3_1 conv4
conv4_1 conv5
128
256441 32
473
256
512512
sqrt
kernel
1 x 1
1 x 1
3 x 3

26

From Image to Caption
27
Figure 3 | From image to text. Captions generated by a recurrent neural with permission from ref. 102. When the RNN is given the ability to focus its
Vision
Deep CNN
Language
Generating RNN
A group of people
market.
There are many
vegetables at the
fruit stand.
A woman is throwing a frisbee in a park.
A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest with
trees in the background.
A dog is standing on a hardwood floor. A stop sign is on a road with a
mountain in the background
REVIEWINSIGHT
self-driving cars60,61
. Companies such as Mobileye and NVIDIA are Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly
Figure 3 | From image to text. Captions generated by a recurrent neural
network (RNN) taking, as extra input, the representation extracted by a deep
convolution neural network (CNN) from a test image, with the RNN trained to
‘translate’ high-level representations of images into captions (top). Reproduced
with permission from ref. 102. When the RNN is given the ability to focus its
attention on a different location in the input image (middle and bottom; the
lighter patches were given more attention) as it generates each word (bold), we
found86
that it exploits this to achieve better ‘translation’ of images into captions.
Vision
Deep CNN
Language
Generating RNN
A group of people
market.
There are many
vegetables at the
fruit stand.
A woman is throwing a frisbee in a park.
A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest with
trees in the background.
A dog is standing on a hardwood floor. A stop sign is on a road with a
REVIEWINSIGHT

Vision
Deep CNN
Language
Generating RNN
A group of people
market.
There are many
vegetables at the
fruit stand.
A woman is throwing a frisbee in a park. A dog is standing on a hardwood floor. A stop sign is on a road with a
REVIEWINSIGHT
Deep Learning in Computer Vision
Questions ?
End of 
Presentation
Caner Hazırbaş | hazirbas@cs.tum.edu

References
• Building High-level Features Using Large Scale Unsupervised Learning 
Quoc V. Le , Rajat Monga , Matthieu Devin , Kai Chen , Greg S. Corrado , Jeff Dean , Andrew Y. Ng
ICML’12
• Convolutional Deep Belief Networks for Scalable Unsupervised Learning of
Hierarchical Representations 
Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng ICML’09
• ImageNet Classiﬁcation with Deep Convolutional Neural Networks 
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton NIPS’12
• Gradient-based learning applied to document recognition. 
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner Proceedings of the IEEE’98
• FlowNet: Learning Optical Flow with Convolutional Networks 
Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov,
Patrick van der Smagt, Daniel Cremers, Thomas Brox
29

References
• Google’s cat detection neural network http://www.resnap.com/image-
selection-technology/deep-learning-image-classiﬁcation/
• Example auto-encoder : http://nghiaho.com/?p=1765
• SGD : http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-
descent/
30

Deep learning in Computer Vision

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning in Computer Vision

Similar to Deep learning in Computer Vision (20)

Recently uploaded

Recently uploaded (20)

Deep learning in Computer Vision