The document summarizes research on introducing visual saliency predictions into image classification networks to improve performance. Specifically, it explores concatenating saliency maps generated from a separate saliency prediction model into various layers of a pre-trained image classification network (AlexNet) to determine if this additional visual attention information can boost classification accuracy. Several strategies for concatenation are tested, including at different convolutional and fully-connected layers. The methodology aims to leverage saliency predictions to help focus image classification networks on the most relevant image regions.
8. Introduction - Imagenet
8
ILSVRC - Evolution since 2010
Slide credit: Kaiming He (FAIR)
Some models have
already reached
human-level performance.
Still the olympic
games of computer
vision?
9. Introduction - Imagenet
9Slide credit: Kaiming He (FAIR)
-9.4%
2012
Introduction of the
Convolutional Neural
Networks (CNN) in the
competition with AlexNet
ILSVRC - Evolution since 2010
10. Introduction - AlexNet
10
Ref: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neural networks." Advances in neural information processing systems. NIPS 2012.
12. Introduction - CNN
12
LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the
IEEE 86.11 (1998): 2278-2324.
13. Introduction - CNN
13
LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the
IEEE 86.11 (1998): 2278-2324.
CNN are very useful in computer vision:
● Reduction of parameters (shared filters)
● Spatial coherence
22. State-of-the-art - Saliency prediction
22
SalNet
Pan, Junting and McGuinness, Kevin and Sayrol, Elisa and Giro-i-Nieto, Xavier and O'Connor, Noel
E. Shallow and Deep Convolutional Networks for Saliency Prediction. CVPR 2016.
Trained on SALICON
24. Saliency prediction
24
Application of saliency:
● In image retrieval
○ Finding the last
appearance of an object.
Ref: Reyes, Cristian et al. Where is my Phone? Personal Object Retrieval from Egocentric Images (2016)
25. Saliency prediction
25
Application of saliency:
● In image retrieval
○ Finding the last
appearance of an object.
● Object recognition
○ Health care
Ref: Reyes, Cristian et al. Where is my Phone? Personal Object Retrieval from Egocentric Images (2016)
Ref: Pérez de San Roman, Philippe et al. Saliency Driven Object recognition in egocentric videos with
deep CNN. 2016
43. 43
Alexnet
Alexnet
Alexnet
CNN
Makes sense to use
the baseline, which
is already trained
Multiplication
Fan-in Network
Concatenation
Pre-trained
CNN
How to introduce saliency predictions?
45. Multiplication vs. Concatenation
45
Three strategies for each of them:
RGBS
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
FC 1
FC 1
FC 3 - Output
Drop Out
Drop Out
Batch Norm.
Batch Norm.
Max-Pooling
Max-Pooling
Max-Pooling
RGB
Saliency
46. Multiplication vs. Concatenation
46
Three strategies for each of them:
RGB-1S-2SRGBS
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
FC 1
FC 1
FC 3 - Output
Drop Out
Drop Out
Batch Norm.
Batch Norm.
Max-Pooling
Max-Pooling
Max-Pooling
RGB
Saliency
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
FC 1
FC 1
FC 3 - Output
Drop Out
Drop Out
Batch Norm.
Batch Norm.
Max-Pooling
Max-Pooling
Max-Pooling
RGB
Saliency
47. Multiplication vs. Concatenation
47
Three strategies for each of them:
RGBS RGB-1S-2S RGBS-1S-2S
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
FC 1
FC 1
FC 3 - Output
Drop Out
Drop Out
Batch Norm.
Batch Norm.
Max-Pooling
Max-Pooling
Max-Pooling
RGB
Saliency
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
FC 1
FC 1
FC 3 - Output
Drop Out
Drop Out
Batch Norm.
Batch Norm.
Max-Pooling
Max-Pooling
Max-Pooling
RGB
Saliency
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
FC 1
FC 1
FC 3 - Output
Drop Out
Drop Out
Batch Norm.
Batch Norm.
Max-Pooling
Max-Pooling
Max-Pooling
RGB
Saliency
71. Fan-in architecture
71
The best option is concatenation:
● Fan-in C2.1
● Fan-in C2
Surprising result for
Fan-in C2 since it
has less parameters
than the baseline
More experiments
12.4%
83. ● CNNs trained to predict saliency maps can be used to improve other
computer vision tasks such as image classification
83
Conclusions
84. ● CNNs trained to predict saliency maps can be used to improve other
computer vision tasks such as image classification
84
Conclusions
Fan-in Network
85. ● CNNs trained to predict saliency maps can be used to improve other
computer vision tasks such as image classification
85
Conclusions
Fan-in Network
86. ● The best way to introduce the saliency maps to a CNN is with a Fan-in
architecture, that provides freedom to the network to decide how to introduce
the saliency maps
86
Conclusions
87. ● The best way to introduce the saliency maps to a CNN is with a Fan-in
architecture, that provides freedom to the network to decide how to introduce
the saliency maps
87
Conclusions
Fan-in C2.1
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
FC 1
FC 1
FC 3 - Output
Drop Out
Drop Out
Batch Norm.
Batch Norm.
Max-Pooling
Max-Pooling
Max-Pooling
RGB
Saliency
Conv 1
Batch Norm.
Max-Pooling
Conv 2
Batch Norm.
Max-Pooling
Fan-in NetworkConcatenation
RGBS
Conv 1
Conv 2
Conv 3
Conv 4
Conv 5
FC 1
FC 1
FC 3 - Output
Drop Out
Drop Out
Batch Norm.
Batch Norm.
Max-Pooling
Max-Pooling
Max-Pooling
RGB
Saliency
88. ● The best way to introduce the saliency maps to a CNN is with a Fan-in
architecture, that provides freedom to the network to decide how to introduce
the saliency maps
88
Conclusions
89. ● The methodology of downsampling the images provides accurate results on
the improvements of the CNN in larger images
89
Conclusions
227 x 227
128 x 128
91. Future work
91
● Several experiments:
○ Fan-in:
■ Fan-in C2 without saliency maps
■ Concatenating instead of multiplying
○ Concatenation only in the first convolutional layer
○ Multiplication and training from scratch
● Once we have a reasonable model try with other saliency models
92. Future work
92
● Several experiments:
○ Fan-in:
■ Fan-in C2 without saliency maps
■ Concatenating instead of multiplying
○ Concatenation only in the first convolutional layer
○ Multiplication and training from scratch
● Once we have a reasonable model try with other saliency models
Thank you