Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)

@DocXavi
Module 5 - Lecture 10
Deep Convnets for
Global Recognition
30 March 2016
Xavier Giró-i-Nieto
http://pagines.uab.cat/mcv/

3
Acknowledgments
Santi
Pascual

Two deep lectures in M5
5
Global Scale
(today’s lecture)
Local Scale
(next lecture)
Deep ConvNets for Recognition at...

Previously, in M3...
6Slide credit: Jose M Àlvarez
Dog

Previously, in M3...
7Slide credit: Jose M Àlvarez
Dog
Learned
Representation

Outline for this session in M5...
8
Dog
Learned
Representation
Part I: End-to-end learning (E2E)

9
Learned
Representation
Task A
(eg. image classification)

10
Task A
Learned
Representation
Task B
(eg. image retrieval)Part II: Off-the-shelf features

11
Task A
Learned
Representation
Task B
(eg. image retrieval)Part II: Off-the-shelf features

E2E: Classification: LeNet-5
12
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-
2324.

E2E: Classification: LeNet-5
13
Demo: 3D Visualization of a Convolutional Neural Network
Harley, Adam W. "An Interactive Node-Link Visualization of Convolutional Neural Networks." In Advances in Visual Computing, pp. 867-877. Springer
International Publishing, 2015.

E2E: Classification: Similar to LeNet-5
14
Demo: Classify MNIST digits with a Convolutional Neural Network
“ConvNetJS is a Javascript library for
training Deep Learning models (mainly
Neural Networks) entirely in your browser.
Open a tab and you're training. No software
requirements, no compilers, no installations,
no GPUs, no sweat.”

E2E: Classification: Databases
15
Li Fei-Fei, “How we’re teaching computers to understand
pictures” TEDTalks 2014.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). Imagenet large scale visual
recognition challenge. arXiv preprint arXiv:1409.0575. [web]

16

17
Zhou, Bolei, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. "Learning deep features for scene recognition
using places database." In Advances in neural information processing systems, pp. 487-495. 2014. [web]
● 205 scene classes (categories).
● Images:
○ 2.5M train
○ 20.5k validation
○ 41k test

18
E2E: Classification: ImageNet ILSRVC
● 100 object classes (categories).
● Images:
○ 1.2 M train
○ 100k test.

● Predict 5 classes.

Slide credit:
Rob Fergus (NYU)
Image Classifcation 2012
-9.8%
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv
preprint arXiv:1409.0575. [web] 20
E2E: Classification: ILSRVC

E2E: Classification: AlexNet (Supervision)
21Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)
Orange
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” Part of: Advances in
Neural Information Processing Systems 25 (NIPS 2012)

24Image credit: Deep learning Tutorial (Stanford University)

27
Rectified
Linear
Unit
(non-linearity)
f(x) = max(0,x)
Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)

28
Dot Product
Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)

ImageNet Classification 2013
preprint arXiv:1409.0575. [web]
Slide credit:
Rob Fergus (NYU)
29

The development of better
convnets is reduced to trial-and-
error.
30
E2E: Classification: Visualize: ZF
Visualization can help in
proposing better architectures.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp. 818-833). Springer
International Publishing.

“A convnet model that uses the same
components (filtering, pooling) but in
reverse, so instead of mapping pixels
to features does the opposite.”
Zeiler, Matthew D., Graham W. Taylor, and Rob Fergus. "Adaptive deconvolutional networks for mid and high level feature learning." Computer Vision
(ICCV), 2011 IEEE International Conference on. IEEE, 2011.
31

DeconvN
et
Conv
Net
32

33

34

35

“To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature
maps as input to the attached deconvnet layer.”
36

37

“(i) Unpool: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate
inverse by recording the locations of the maxima within each pooling region in a set of switch variables.”
38

XX
“(ii) Rectification: The convnet uses ReLU non-linearities, which rectify the feature maps thus ensuring the
feature maps are always positive.”
39

“(iii) Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To
approximately invert this, the deconvnet uses transposed versions of the same filters (as other autoencoder
models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this
means flipping each filter vertically and horizontally.
XX
XX
40

“(iii) Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To
approximately invert this, the deconvnet uses transposed versions of the same filters (as other autoencoder
models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this
means flipping each filter vertically and horizontally.
XX
XX
41

42
Top 9 activations in a random subset of feature maps
across the validation data, projected down to pixel space
using our deconvolutional network approach.
Corresponding image patches.

43

44

45

46

47
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead"
features.
AlexNet (Layer 1) Clarifai (Layer 1)

48
Cleaner features in Clarifai, without the aliasing artifacts caused by the stride 4 used in AlexNet.
AlexNet (Layer 2) Clarifai (Layer 2)

49
Regularization with dropout:
Reduction of overfitting by setting
to zero the output of a portion
(typically 50%) of the each
intermediate neuron.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv preprint arXiv:1207.0580.
Chicago
E2E: Classification: Dropout: ZF

50

51
E2E: Classification: Ensembles: ZF

-5%
Slide credit:
Rob Fergus (NYU)
52

-5%
Slide credit:
Rob Fergus (NYU)
53

E2E: Classification: GoogLeNet
55Movie: Inception (2010)

56
● 22 layers, but 12 times fewer parameters than AlexNet.
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions."

57
● Challenges of going deeper:
○ Overfitting, due to the increase amount of parameters.
○ Inefficient computation if most weights end up close to zero.
Solution
Sparsity
How ?
Inception modules

58

59
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014.

60

E2E: Classification: GoogLeNet (NiN)
61
3x3 and 5x5 convolutions deal with
different scales.
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014. [Slides]

62
1x1 convolutions does dimensionality reduction (c3<c2) and
accounts for rectified linear units (ReLU).

63
In NiN, the Cascaded 1x1 Convolutions compute reductions after the
convolutions.

64
In GoogLeNet, the Cascaded 1x1 Convolutions compute reductions before the
expensive 3x3 and 5x5 convolutions.

65

66
3x3 max pooling introduces somewhat spatial invariance,
and has proven a benefitial effect by adding an alternative
parallel path.

67
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while
providing regularization at training time.
...and no fully connected layers needed !

68NVIDIA, “NVIDIA and IBM CLoud Support ImageNet Large Scale Visual Recognition Challenge” (2015)

69
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." CVPR 2015. [video] [slides] [poster]

E2E: Classification: VGG
70
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition."
International Conference on Learning Representations (2015). [video] [slides] [project]

71
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition."
International Conference on Learning Representations (2015). [video] [slides] [project]

E2E: Classification: VGG: 3x3 Stacks
72
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." International Conference on Learning Representations (2015). [video] [slides] [project]

73
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." International Conference on Learning Representations (2015). [video] [slides] [project]
● No poolings between some convolutional layers.
● Convolution strides of 1 (no skipping).

E2E: Classification
74
3.6% top 5 error…
with 152 layers !!

E2E: Classification: ResNet
75
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385
(2015). [slides]

76
● Deeper networks (34 is deeper than 18) are more difficult to train.
(2015). [slides]
Thin curves: training error
Bold curves: validation error

77
● Residual learning: reformulate the layers as learning residual functions with
reference to the layer inputs, instead of learning unreferenced functions
(2015). [slides]

78
(2015). [slides]

79
E2E: Classification: Humans

80
“Is this a Border terrier” ?
Crowdsourcing
Yes
No
● Binary ground truth annotation from the crowd.

81
● Annotation Problems:
Carlier, Axel, Amaia Salvador, Ferran Cabezas, Xavier Giro-i-Nieto, Vincent Charvillat, and Oge Marques. "Assessment of
crowdsourcing and gamification loss in user-assisted object segmentation." Multimedia Tools and Applications (2015): 1-28.
Crowdsource loss (0.3%) More than 5 objects classes

82
Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)
● Test data collection from one human. [interface]

83
Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)
● Test data collection from one human. [interface]
“Aww, a cute dog! Would you like to spend 5 minutes scrolling through 120
breeds of dog to guess what species it is ?”

84
NVIDIA, “Mocha.jl: Deep Learning for Julia” (2015)
ResNet

91
E2E: Saliency
Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB 2015)

92
Eye Tracker Mouse Click
Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB 2015)
E2E: Saliency

93
E2E: Saliency: JuntingNet
JuntingNet
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016

94
JuntingNet
DATA
iSun [Xu’15]
SALICON [Jiang’15]

95
TRAIN VALIDATION TEST
SALICON [Jiang’15] 10,000 5,000 5,000
iSun [Xu’15] 6,000 926 2,000
CAT2000 [Borji’15] 2,000 - 2,000
MIT300 [Judd’12] 300 - -Large
Scale

96
ARCHITECTURE
[Pan’15]
JuntingNet
DATA
iSun [Xu’15]

97
Upsample
+ filter
2D
map
96x96 2340=48x48
IMAGE
INPUT
(RGB)

98
Upsample
+ filter
2D
map
96x96 2340=48x48
3 CONV LAYERS

99
Upsample
+ filter
2D
map
96x96 2340=48x48
2 DENSE
LAYERS

100
Upsample
+ filter
2D
map
96x96 2340=48x48

101
JuntingNet
ARCHITECTURE
[Pan’15]
DATA
iSun [Xu’15]
SOFTWARE
[Bergstra’10]
[Bastien’12]

102
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 0.03 to 0.0001
Mini batch size 128
Training time 7h (SALICON) / 4h (iSUN)
Acceleration SGD+ nesterov momentum (0.9)
Regularisation Maxout norm
GPU NVidia GTX 980

103
Number of iterations
(Training time)
● Back-propagation with the Euclidean distance.
● Training curve for the SALICON database.

104
JuntingNetGround TruthPixels
E2E: Saliency: JuntingNet: iSUN

105

106
Results from CVPR LSUN Challenge 2015

107
E2E: Saliency: JuntingNet: SALICON

108

109
Results from CVPR LSUN Challenge 2015

110

111
Domain B
Fine-tuned
Learned
Representation
Part I’: End-to-End Fine-Tuning (FT)
Domain ALearned
Representation
Transfer

112
E2E: Fine-tuning
Fine-tuning a pre-trained network
Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)

113
E2E: Fine-tuning
Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)
Fine-tuning a pre-trained network

114
E2E: Fine-tuning: Sentiments
CNN
Campos, Victor, Amaia Salvador, Xavier Giro-i-Nieto, and Brendan Jou. "Diving Deep into Sentiment: Understanding
Fine-tuned CNNs for Visual Sentiment Prediction." In Proceedings of the 1st International Workshop on Affect &
Sentiment in Multimedia, pp. 57-62. ACM, 2015.

115
Campos, Victor, Xavier Giro-i-Nieto, and Brendan Jou. “From pixels to sentiments” (Submitted)
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully Convolutional Networks for Semantic
Segmentation." CVPR 2015
E2E: Fine-tuning: Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks.

116
E2E: Fine-tuning: Cultural events
ChaLearn Workshop
A. Salvador, Zeppelzauer, M., Manchon-Vizuete, D., Calafell-Orós, A., and Giró-i-Nieto, X., “Cultural
Event Recognition with Visual ConvNets and Temporal Models”, in CVPR ChaLearn Looking at People
Workshop 2015, 2015. [slides]

117
Task A
Learned
Representation
Task B
(eg. image retrieval)
Part II: Off-The-Shelf features (OTS)

118
Razavian, Ali, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. "CNN features off-the-shelf:
an astounding baseline for recognition." CVPRW 2014
Off-The-Shelf (OTS) Features

● Intermediate features can be used as regular visual descriptors for any task.
119
Off-The-Shelf (OTS) Features
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014

120
OTS: Classification: Razavian
Razavian, Ali, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. "CNN features off-the-shelf:
an astounding baseline for recognition." CVPRW 2014
Pascal VOC 2007

121
OTS: Classification: Return of devil
Chatfield, K., Simonyan, K., Vedaldi, A. and Zisserman, A.. Return of the devil in the details: Delving
deep into convolutional nets. BMVC 2014
Classifier
L2-normalization
Accuracy
+5%

Three representative architectures considered:
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
@ NVIDIA GTX Titan GPU
122

F
C
123
Data augmentation

124
Fisher
Kernels
(FK)
ConvNets
(CNN)

Color
Gray Scale (GS)
Accuracy
-2.5%
125

Dimensionality reduction by retraining the last layer to smaller sizes.
Accuracy
-2%
Size
x32
126

Ranking
Summary of the paper by Amaia Salvador on Bitsearch.
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014. Springer
International Publishing, 2014. 584-599.
127
OTS: Retrieval

Oxford Buildings Inria Holidays UKB
128
OTS: Retrieval

Pooled from the network from Krizhevsky et. al. pretrained with images from
ImageNet.
129
OTS: Retrieval: FC layers
Summary of the paper by Amaia Salvador on Bitsearch.
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014.

Off-the-shelf CNN descriptors from fully connected layers show useful but not
superior (w.r.t. FV, VLAD, Sparse Coding,...)
130Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014.
OTS: Retrieval: FC layers

Razavian et al, A baseline for visual instance retrieval with deep convolutional networks, ICLR 2015.
131
Convolutional layers have shown better performance than fully connected ones.
OTS: Retrieval: Conv layers

132
Spatial Search, (extract N local descriptor from predefined locations) increases
performance at computational cost.

Medium memory footprints
133

Mohedano et al, Bag of Convolutional Local Features for Scalable Instance Search (submitted)
134
...
Instance Retrieval
(Instance: Object, Building, Person, Place…)
OTS: Retrieval: Bag of Words (BoW)

Mohedano et al, Bags of Local Convolutional Features for Scalable Instance Search (ICMR 2016)
135
An assignment map of BoF quantized features can be defined over the spatial
locations of the convolutional feature maps.

136Mohedano et al, Bag of Convolutional Local Features for Scalable Instance Search (submitted)

137
OTS: Summarization
Bolaños M, Mestre R, Talavera E, Giró-i-Nieto X, Radeva P. Visual Summary of Egocentric Photostreams by Representative
Keyframes. In: IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015. Turin, Italy:
2015
Clustering based on Euclidean distance over FC7 features from AlexNet.

138
OTS: Reinforcement learning: DeepQ
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra,
and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602
(2013).
(Google) DeepMind

139
OTS: Reinforcement learning: Deep Q
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin
Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
Demo: Deep Q learning Demo

140
OTS: Reinforcement learning
Caicedo, Juan C., and Svetlana Lazebnik. "Active object localization with deep reinforcement learning."
ICCV 2015 [Slides by Miriam Bellver]
Other: Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing
atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
Object is localized based on visual features from AlexNet FC6.

ConvNets: Learn more
141
The New York Times: “The Race Is On to Control Artificial Intelligence, and Tech’s
Future” (25/03/2016)

142
Keras http://keras.io/
Caffe http://caffe.berkeleyvision.org/
Torch (Overfeat) http://torch.ch/
Theano http://deeplearning.net/software/theano/
Tensor Flow https://www.tensorflow.org/
MatconvNet (VLFeat) http://www.vlfeat.org/matconvnet/
CNTK (Mcrosoft) http://www.cntk.ai/
ConvNets: Frameworks

143
ConvNets: Frameworks
Source: @fchollet

Stanford course:
CS231n:
Convolutional Neural
Networks for Visual
Recognition
144

145
● Reading Group on Tuesdays (UPC) and Wednesdays (UB) at 11am.
Schedule, list of paper, slides & video(s): https://github.com/imatge-upc/readcv
Open to MCV students.
E-mail me at xavier.
giro@upc.edu
before attending :)

146
● Summer Course “Deep learning for computer vision” (4-8 July 2016)
(Very) temptative program at: https://github.com/imatge-upc/etsetb-2016-dlcv
E-mail me at xavier.
giro@upc.edu for a
free spot
(subject to
availability)

147
● Broader and more friendly slides for dissemination.
[Available on slideshare.net]

148
Thank you !
https://imatge.upc.edu/web/people/xavier-giro
https://twitter.com/DocXavi
https://www.facebook.com/ProfessorXavi
xavier.giro@upc.edu
Xavier Giró-i-Nieto

Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)

Similar to Deep convnets for global recognition (Master in Computer Vision Barcelona 2016) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)