Lecture by Xavier Giro-i-Nieto (UPC) at the Master in Computer Vision Barcelona (March 30, 2016).
http://pagines.uab.cat/mcv/
This lecture provides an overview of computer vision analysis of images at a global scale using deep learning techniques. The session is structure in two blocks: a first one addressing end to end learning, and a second one focusing on applications that use off-the-shelf features.
Please submit your feedback as comments on the GDrive source slides:
https://docs.google.com/presentation/d/1ms9Fczkep__9pMCjxtVr41OINMklcHWc74kwANj7KKI/edit?usp=sharing
8. Outline for this session in M5...
8
Dog
Learned
Representation
Part I: End-to-end learning (E2E)
9. Outline for this session in M5...
9
Learned
Representation
Part I: End-to-end learning (E2E)
Task A
(eg. image classification)
10. Outline for this session in M5...
10
Task A
(eg. image classification)
Learned
Representation
Part I: End-to-end learning (E2E)
Task B
(eg. image retrieval)Part II: Off-the-shelf features
11. Outline for this session in M5...
11
Task A
(eg. image classification)
Learned
Representation
Part I: End-to-end learning (E2E)
Task B
(eg. image retrieval)Part II: Off-the-shelf features
12. E2E: Classification: LeNet-5
12
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-
2324.
13. E2E: Classification: LeNet-5
13
Demo: 3D Visualization of a Convolutional Neural Network
Harley, Adam W. "An Interactive Node-Link Visualization of Convolutional Neural Networks." In Advances in Visual Computing, pp. 867-877. Springer
International Publishing, 2015.
14. E2E: Classification: Similar to LeNet-5
14
Demo: Classify MNIST digits with a Convolutional Neural Network
“ConvNetJS is a Javascript library for
training Deep Learning models (mainly
Neural Networks) entirely in your browser.
Open a tab and you're training. No software
requirements, no compilers, no installations,
no GPUs, no sweat.”
15. E2E: Classification: Databases
15
Li Fei-Fei, “How we’re teaching computers to understand
pictures” TEDTalks 2014.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). Imagenet large scale visual
recognition challenge. arXiv preprint arXiv:1409.0575. [web]
17. 17
Zhou, Bolei, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. "Learning deep features for scene recognition
using places database." In Advances in neural information processing systems, pp. 487-495. 2014. [web]
E2E: Classification: Databases
● 205 scene classes (categories).
● Images:
○ 2.5M train
○ 20.5k validation
○ 41k test
21. E2E: Classification: AlexNet (Supervision)
21Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)
Orange
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” Part of: Advances in
Neural Information Processing Systems 25 (NIPS 2012)
22. 22Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)
E2E: Classification: AlexNet (Supervision)
23. 23Slide credit: Junting Pan, “Visual Saliency Prediction using Deep Learning Techniques” (ETSETB-UPC 2015)
E2E: Classification: AlexNet (Supervision)
30. The development of better
convnets is reduced to trial-and-
error.
30
E2E: Classification: Visualize: ZF
Visualization can help in
proposing better architectures.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp. 818-833). Springer
International Publishing.
31. “A convnet model that uses the same
components (filtering, pooling) but in
reverse, so instead of mapping pixels
to features does the opposite.”
Zeiler, Matthew D., Graham W. Taylor, and Rob Fergus. "Adaptive deconvolutional networks for mid and high level feature learning." Computer Vision
(ICCV), 2011 IEEE International Conference on. IEEE, 2011.
31
E2E: Classification: Visualize: ZF
32. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp. 818-833). Springer
International Publishing.
DeconvN
et
Conv
Net
32
E2E: Classification: Visualize: ZF
36. “To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature
maps as input to the attached deconvnet layer.”
36
E2E: Classification: Visualize: ZF
38. “(i) Unpool: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate
inverse by recording the locations of the maxima within each pooling region in a set of switch variables.”
38
E2E: Classification: Visualize: ZF
39. XX
“(ii) Rectification: The convnet uses ReLU non-linearities, which rectify the feature maps thus ensuring the
feature maps are always positive.”
39
E2E: Classification: Visualize: ZF
40. “(iii) Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To
approximately invert this, the deconvnet uses transposed versions of the same filters (as other autoencoder
models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this
means flipping each filter vertically and horizontally.
XX
XX
40
E2E: Classification: Visualize: ZF
41. “(iii) Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To
approximately invert this, the deconvnet uses transposed versions of the same filters (as other autoencoder
models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this
means flipping each filter vertically and horizontally.
XX
XX
41
E2E: Classification: Visualize: ZF
42. 42
Top 9 activations in a random subset of feature maps
across the validation data, projected down to pixel space
using our deconvolutional network approach.
Corresponding image patches.
E2E: Classification: Visualize: ZF
46. Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp. 818-833). Springer
International Publishing.
46
E2E: Classification: Visualize: ZF
47. 47
The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead"
features.
AlexNet (Layer 1) Clarifai (Layer 1)
E2E: Classification: Visualize: ZF
48. 48
Cleaner features in Clarifai, without the aliasing artifacts caused by the stride 4 used in AlexNet.
AlexNet (Layer 2) Clarifai (Layer 2)
E2E: Classification: Visualize: ZF
49. 49
Regularization with dropout:
Reduction of overfitting by setting
to zero the output of a portion
(typically 50%) of the each
intermediate neuron.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv preprint arXiv:1207.0580.
Chicago
E2E: Classification: Dropout: ZF
56. E2E: Classification: GoogLeNet
56
● 22 layers, but 12 times fewer parameters than AlexNet.
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions."
57. E2E: Classification: GoogLeNet
57
● Challenges of going deeper:
○ Overfitting, due to the increase amount of parameters.
○ Inefficient computation if most weights end up close to zero.
Solution
Sparsity
How ?
Inception modules
61. E2E: Classification: GoogLeNet (NiN)
61
3x3 and 5x5 convolutions deal with
different scales.
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014. [Slides]
62. 62
1x1 convolutions does dimensionality reduction (c3<c2) and
accounts for rectified linear units (ReLU).
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014. [Slides]
E2E: Classification: GoogLeNet (NiN)
63. 63
In NiN, the Cascaded 1x1 Convolutions compute reductions after the
convolutions.
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014. [Slides]
E2E: Classification: GoogLeNet (NiN)
66. E2E: Classification: GoogLeNet
66
3x3 max pooling introduces somewhat spatial invariance,
and has proven a benefitial effect by adding an alternative
parallel path.
67. E2E: Classification: GoogLeNet
67
Two Softmax Classifiers at intermediate layers combat the vanishing gradient while
providing regularization at training time.
...and no fully connected layers needed !
69. E2E: Classification: GoogLeNet
69
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." CVPR 2015. [video] [slides] [poster]
70. E2E: Classification: VGG
70
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition."
International Conference on Learning Representations (2015). [video] [slides] [project]
71. E2E: Classification: VGG
71
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition."
International Conference on Learning Representations (2015). [video] [slides] [project]
72. E2E: Classification: VGG: 3x3 Stacks
72
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." International Conference on Learning Representations (2015). [video] [slides] [project]
73. E2E: Classification: VGG
73
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." International Conference on Learning Representations (2015). [video] [slides] [project]
● No poolings between some convolutional layers.
● Convolution strides of 1 (no skipping).
80. 80
E2E: Classification: Humans
“Is this a Border terrier” ?
Crowdsourcing
Yes
No
● Binary ground truth annotation from the crowd.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). Imagenet large scale visual
recognition challenge. arXiv preprint arXiv:1409.0575. [web]
81. 81
● Annotation Problems:
Carlier, Axel, Amaia Salvador, Ferran Cabezas, Xavier Giro-i-Nieto, Vincent Charvillat, and Oge Marques. "Assessment of
crowdsourcing and gamification loss in user-assisted object segmentation." Multimedia Tools and Applications (2015): 1-28.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). Imagenet large scale visual
recognition challenge. arXiv preprint arXiv:1409.0575. [web]
Crowdsource loss (0.3%) More than 5 objects classes
E2E: Classification: Humans
82. 82
Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)
● Test data collection from one human. [interface]
E2E: Classification: Humans
83. 83
Andrej Karpathy, “What I learned from competing against a computer on ImageNet” (2014)
● Test data collection from one human. [interface]
“Aww, a cute dog! Would you like to spend 5 minutes scrolling through 120
breeds of dog to guess what species it is ?”
E2E: Classification: Humans
102. 102
Loss function Mean Square Error (MSE)
Weight initialization Gaussian distribution
Learning rate 0.03 to 0.0001
Mini batch size 128
Training time 7h (SALICON) / 4h (iSUN)
Acceleration SGD+ nesterov momentum (0.9)
Regularisation Maxout norm
GPU NVidia GTX 980
E2E: Saliency: JuntingNet
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
103. 103
Number of iterations
(Training time)
● Back-propagation with the Euclidean distance.
● Training curve for the SALICON database.
E2E: Saliency: JuntingNet
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
104. 104
JuntingNetGround TruthPixels
E2E: Saliency: JuntingNet: iSUN
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
105. 105
JuntingNetGround TruthPixels
E2E: Saliency: JuntingNet: iSUN
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
106. 106
Results from CVPR LSUN Challenge 2015
E2E: Saliency: JuntingNet: iSUN
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
107. 107
JuntingNetGround TruthPixels
E2E: Saliency: JuntingNet: SALICON
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
108. 108
JuntingNetGround TruthPixels
E2E: Saliency: JuntingNet: SALICON
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
109. 109
Results from CVPR LSUN Challenge 2015
E2E: Saliency: JuntingNet: SALICON
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
110. 110
E2E: Saliency: JuntingNet
Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O'Connor, and Xavier Giro-i-Nieto. "Shallow and Deep Convolutional
Networks for Saliency Prediction." CVPR 2016
111. Outline for this session in M5...
111
Part I: End-to-end learning (E2E)
Domain B
Fine-tuned
Learned
Representation
Part I’: End-to-End Fine-Tuning (FT)
Part I: End-to-end learning (E2E)
Domain ALearned
Representation
Part I: End-to-end learning (E2E)
Transfer
112. 112
E2E: Fine-tuning
Fine-tuning a pre-trained network
Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)
113. 113
E2E: Fine-tuning
Slide credit: Victor Campos, “Layer-wise CNN surgery for Visual Sentiment Prediction” (ETSETB 2015)
Fine-tuning a pre-trained network
114. 114
E2E: Fine-tuning: Sentiments
CNN
Campos, Victor, Amaia Salvador, Xavier Giro-i-Nieto, and Brendan Jou. "Diving Deep into Sentiment: Understanding
Fine-tuned CNNs for Visual Sentiment Prediction." In Proceedings of the 1st International Workshop on Affect &
Sentiment in Multimedia, pp. 57-62. ACM, 2015.
115. 115
Campos, Victor, Xavier Giro-i-Nieto, and Brendan Jou. “From pixels to sentiments” (Submitted)
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully Convolutional Networks for Semantic
Segmentation." CVPR 2015
E2E: Fine-tuning: Sentiments
True positive True negative False positive False negative
Visualizations with fully convolutional networks.
116. 116
E2E: Fine-tuning: Cultural events
ChaLearn Workshop
A. Salvador, Zeppelzauer, M., Manchon-Vizuete, D., Calafell-Orós, A., and Giró-i-Nieto, X., “Cultural
Event Recognition with Visual ConvNets and Temporal Models”, in CVPR ChaLearn Looking at People
Workshop 2015, 2015. [slides]
117. Outline for this session in M5...
117
Task A
(eg. image classification)
Learned
Representation
Part I: End-to-end learning (E2E)
Task B
(eg. image retrieval)
Part II: Off-The-Shelf features (OTS)
118. 118
Razavian, Ali, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. "CNN features off-the-shelf:
an astounding baseline for recognition." CVPRW 2014
Off-The-Shelf (OTS) Features
119. ● Intermediate features can be used as regular visual descriptors for any task.
119
Off-The-Shelf (OTS) Features
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014
120. 120
OTS: Classification: Razavian
Razavian, Ali, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. "CNN features off-the-shelf:
an astounding baseline for recognition." CVPRW 2014
Pascal VOC 2007
121. 121
OTS: Classification: Return of devil
Chatfield, K., Simonyan, K., Vedaldi, A. and Zisserman, A.. Return of the devil in the details: Delving
deep into convolutional nets. BMVC 2014
Classifier
L2-normalization
Accuracy
+5%
122. Three representative architectures considered:
AlexNet
ZF
OverFeat
5 days (fast)
3 weeks (slow)
@ NVIDIA GTX Titan GPU
122
OTS: Classification: Return of devil
Chatfield, K., Simonyan, K., Vedaldi, A. and Zisserman, A.. Return of the devil in the details: Delving
deep into convolutional nets. BMVC 2014
123. F
C
123
OTS: Classification: Return of devil
Data augmentation
Chatfield, K., Simonyan, K., Vedaldi, A. and Zisserman, A.. Return of the devil in the details: Delving
deep into convolutional nets. BMVC 2014
125. Color
Gray Scale (GS)
Accuracy
-2.5%
125
OTS: Classification: Return of devil
Chatfield, K., Simonyan, K., Vedaldi, A. and Zisserman, A.. Return of the devil in the details: Delving
deep into convolutional nets. BMVC 2014
126. Dimensionality reduction by retraining the last layer to smaller sizes.
Accuracy
-2%
Size
x32
126
OTS: Classification: Return of devil
Chatfield, K., Simonyan, K., Vedaldi, A. and Zisserman, A.. Return of the devil in the details: Delving
deep into convolutional nets. BMVC 2014
127. Ranking
Summary of the paper by Amaia Salvador on Bitsearch.
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014. Springer
International Publishing, 2014. 584-599.
127
OTS: Retrieval
129. Pooled from the network from Krizhevsky et. al. pretrained with images from
ImageNet.
129
OTS: Retrieval: FC layers
Summary of the paper by Amaia Salvador on Bitsearch.
Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014.
130. Off-the-shelf CNN descriptors from fully connected layers show useful but not
superior (w.r.t. FV, VLAD, Sparse Coding,...)
130Babenko, Artem, et al. "Neural codes for image retrieval." Computer Vision–ECCV 2014.
OTS: Retrieval: FC layers
131. Razavian et al, A baseline for visual instance retrieval with deep convolutional networks, ICLR 2015.
131
Convolutional layers have shown better performance than fully connected ones.
OTS: Retrieval: Conv layers
132. OTS: Retrieval: Conv layers
Razavian et al, A baseline for visual instance retrieval with deep convolutional networks, ICLR 2015.
132
Spatial Search, (extract N local descriptor from predefined locations) increases
performance at computational cost.
133. Medium memory footprints
Razavian et al, A baseline for visual instance retrieval with deep convolutional networks, ICLR 2015.
133
OTS: Retrieval: Conv layers
134. Mohedano et al, Bag of Convolutional Local Features for Scalable Instance Search (submitted)
134
...
Instance Retrieval
(Instance: Object, Building, Person, Place…)
OTS: Retrieval: Bag of Words (BoW)
135. Mohedano et al, Bags of Local Convolutional Features for Scalable Instance Search (ICMR 2016)
135
OTS: Retrieval: Bag of Words (BoW)
An assignment map of BoF quantized features can be defined over the spatial
locations of the convolutional feature maps.
136. 136Mohedano et al, Bag of Convolutional Local Features for Scalable Instance Search (submitted)
OTS: Retrieval: Bag of Words (BoW)
137. 137
OTS: Summarization
Bolaños M, Mestre R, Talavera E, Giró-i-Nieto X, Radeva P. Visual Summary of Egocentric Photostreams by Representative
Keyframes. In: IEEE International Workshop on Wearable and Ego-vision Systems for Augmented Experience (WEsAX) 2015. Turin, Italy:
2015
Clustering based on Euclidean distance over FC7 features from AlexNet.
138. 138
OTS: Reinforcement learning: DeepQ
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra,
and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602
(2013).
(Google) DeepMind
139. 139
OTS: Reinforcement learning: Deep Q
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin
Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
Demo: Deep Q learning Demo
140. 140
OTS: Reinforcement learning
Caicedo, Juan C., and Svetlana Lazebnik. "Active object localization with deep reinforcement learning."
ICCV 2015 [Slides by Miriam Bellver]
Other: Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing
atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
Object is localized based on visual features from AlexNet FC6.
141. ConvNets: Learn more
141
The New York Times: “The Race Is On to Control Artificial Intelligence, and Tech’s
Future” (25/03/2016)
145. 145
● Reading Group on Tuesdays (UPC) and Wednesdays (UB) at 11am.
Schedule, list of paper, slides & video(s): https://github.com/imatge-upc/readcv
ConvNets: Learn more
Open to MCV students.
E-mail me at xavier.
giro@upc.edu
before attending :)
146. 146
● Summer Course “Deep learning for computer vision” (4-8 July 2016)
(Very) temptative program at: https://github.com/imatge-upc/etsetb-2016-dlcv
ConvNets: Learn more
E-mail me at xavier.
giro@upc.edu for a
free spot
(subject to
availability)
147. 147
● Broader and more friendly slides for dissemination.
[Available on slideshare.net]
ConvNets: Learn more