O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Introduction to
Deep Learning
Presenter: Sungjoon Choi
(sungjoon.choi@cpslab.snu.ac.kr)
Optimization methods
CNN basics
Semantic segmentation
Weakly supervised localization
Image detection
RNN
Visual QnA
Word2V...
What is deep learning?
3
“Deep learning is a branch of machine learning based on a set of
algorithms that attempt to model...
Is it brand new?
4
Neural Nets McCulloch & Pitt 1943
Perception Rosenblatt 1958
RNN Grossberg 1973
CNN Fukushima 1979
RBM ...
Deep architectures
5
Feed-Forward: multilayer neural nets, convolutional nets
Feed-Back: Stacked Sparse Coding, Deconvolut...
CNN basics
CNN
7
CNNs are basically layers of convolutions followed by
subsampling and fully connected layers.
Intuitively speaking, ...
8
9
10
11
12
13
14
15
16
Optimization
methods
Gradient descent?
Gradient descent?
There are three variants of gradient descent
Differ in how much data we use to compute
gradient
We make ...
Batch gradient descent
In batch gradient decent, we use the entire
training dataset to compute the gradient.
Stochastic gradient descent
In stochastic gradient descent (SGD), the
gradient is computed from each training
sample, one ...
Mini-batch gradient decent
In mini-batch gradient decent, we take the
best of both worlds.
Common mini-batch sizes range b...
Challenges
Choosing a proper learning rate is cumbersome.
 Learning rate schedule
Avoiding getting trapped in suboptimal ...
Momentum
Nesterov accelerated gradient
Adagrad
It adapts the learning rate to the parameters,
performing larger updates for infrequent and
smaller updates for fr...
Adadelta
Adadelta is an extension of Adagrad that seeks
to reduce its monotonically decreasing learning
rate.
It restricts...
Exponential moving average
28
RMSprop
RMSprop is an unpublished, adaptive learning
rate method proposed by Geoff Hinton in his
lecture..
𝐸 𝑔2
𝑡 = 𝛾𝐸 𝑔2
...
Adam
Adaptive Moment Estimation (Adam) stores both
exponentially decaying average of past gradients
and and squared gradie...
Adam
Adaptive Moment Estimation (Adam) stores both
exponentially decaying average of past gradients
and and squared gradie...
Visualization
Semantic
segmentation
Semantic Segmentation?
lion
dog
giraffe
Image Classification
bicycle
person
ball
dog
Object Detection
person
person
person...
Semantic segmentation
35
36
37
38
39
40
41
42
43
44
Results
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Results
73
Results
74
Weakly
supervised
localization
Weakly supervised localization
76
Weakly supervised localization
77
Weakly Supervised Object Localization
78
Usually supervised learning of localization is annotated with bounding box
What i...
Architecture
79
AlexNet+GAP+places205
Living room
11x11 Avg Pooling: Global Average Pooling (GAP)
11x11x512
512 205
227x22...
Class activation map (CAM)
80
• Identify important image regions by projecting back
the weights of output layer to convolu...
Results
81
• CAM on top 5 predictions on an image
• CAM for one object class in images
GAP vs. GMP
82
• Oquab et al. CVPR2015
Is object localization for free? weakly-supervised learning with convolutional neur...
GAP & GMP
83
• GAP (upper) vs GMP (lower)
• GAP outperforms GMP
• GAP highlights more complete
object regions and less
bac...
84
Concept localization
85
Concept localization in weakly
labeled images
• Positive set: short phrase in text caption
• Negat...
Image detection
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
Results
102
SPPnet
104
105
106
107
108
109
110
111
112
Results
113
Results
114
Fast R-CNN
116
117
118
119
120
121
122
123
124
125
Faster R-CNN
127
128
129
130
131
132
133
134
135
136
137
138
Results
139
Results
140
R-CNN
141
Image Regions Resize Convolution
Features
Classify
SPP net
142
Image Convolution Features SPPRegions Classify
R-CNN vs. SPP net
143
R-CNN SPP net
Fast R-CNN
144
Image
Convolution Features
Regions
RoI Pooling
Layer
Class Label
Confidence
RoI Pooling
Layer
Class Label
C...
R-CNN vs. SPP net vs. Fast R-CNN
145
R-CNN SPP net
Fast R-CNN
Faster R-CNN
146
Image Fully Convolutional
Features
Bounding Box
Regression
BB Classification
FastR-CNN
R-CNN vs. SPP net vs. Fast R-CNN
147
R-CNN SPP net
Fast R-CNN Faster R-CNN
148
Results
149
150
151
152
RNN
Recurrent Neural Network
155
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Network
156
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM comes in!
157
Long Short Term Memory
This is just a standard RNN.
http://colah.github.io/posts/2015-08-Understanding-...
LSTM comes in!
158
Long Short Term Memory
This is just a standard RNN.This is the LSTM!
http://colah.github.io/posts/2015-...
Overall Architecture
159
(Cell) state
Hidden State
Forget Gate
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
I...
The Core Idea
160
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Visual QnA
VQA: Dataset and Problem definition
162
VQA dataset - Example
Q: How many dogs are seen?
Q: What animal is this?
Q: What c...
Solving VQA
163
Approach
[Malinowski et al., 2015] [Ren et al., 2015] [Andres et al., 2015]
[Ma et al., 2015] [Jiang et al...
DPPnet
164
Motivation
Common pipeline of using deep learning for vision
CNN trained on ImageNet
Switch the final layer and...
DPPnet
165
Main Idea
Switching parameters of a layer based on a question
Dynamic Parameter Layer
Question Parameter Predic...
DPPnet
166
Parameter Explosion
Number of parameter for fc-layer (R):
DynamicParameterLayer
Question Feature
Predicted Para...
DPPnet
167
Parameter Explosion
Number of parameter for fc-layer (R):
DynamicParameterLayer
Question Feature
Predicted Para...
DPPnet
168
Weight Sharing with Hashing Trick
Weights of Dynamic Parameter Layer are picked from Candidate weights by Hashi...
DPPnet
169
Final Architecture
End-to-End Fine-tuning is possible (Fully-differentiable)
DPPnet
170
Qualitative Results
Q: What is the boy holding?
DPPnet: surfboard DPPnet: bat
DPPnet
171
Qualitative Results
Q: What animal is shown?
DPPnet: giraffe DPPnet: elephant
DPPnet
172
Qualitative Results
Q: How does the woman feel?
DPPnet: happy
Q: What type of hat is she wearing?
DPPnet: cowboy
DPPnet
173
Qualitative Results
Q: How many cranes are in the image?
DPPnet: 2 (3)
Q: How many people are on the bench?
DPP...
How to combine image and question?
174
How to combine image and question?
175
How to combine image and question?
176
How to combine image and question?
177
How to combine image and question?
178
How to combine image and question?
179
How to combine image and question?
180
How to combine image and question?
181
Multimodal Compact Bilinear Pooling
182
Multimodal Compact Bilinear Pooling
183
Multimodal Compact Bilinear Pooling
184
Multimodal Compact Bilinear Pooling
185
MCB without Attention
186
MCB with Attention
187
Results
188
Results
189
Results
190
Results
191
Results
192
Results
193
Word2Vec
Word2vec?
195
196
197
198
199
200
201
202
203
204
205
206
207
208
Image
Captioning
Image Captioning?
210
Overall Architecture
211
Language Model
212
Language Model
213
Language Model
214
Language Model
215
Language Model
216
Training phase
217
Training phase
218
Training phase
219
Training phase
220
Training phase
221
Training phase
222
Test phase
223
Test phase
224
Test phase
225
Test phase
226
Test phase
227
Test phase
228
Test phase
229
Test phase
230
Test phase
231
Results
232
Results
233
But not always..
234
235
Show, attend and tell
236
237
238
239
240
Results
241
Results
242
Results (mistakes)
243
Neural Art
Preliminaries
245
Understanding Deep Image
Representations by Inverting Them
CVPR2015
Texture Synthesis Using
Convolutiona...
A Neural Algorithm of Artistic Style
246
A Neural Algorithm of Artistic Style
247
248
Texture Synthesis Using
Convolutional Neural Networks
-NIPS2015
Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
Texture?
249
Visual texture synthesis
250
Which one do you think is real?
Right one is real.
Goal of texture synthesis is to produce (a...
Results of this work
251
Right ones are given sources!
How?
252
Texture Model
253
𝑋 𝑎
Input a
𝐹𝑎
1
𝐹𝑎
2
𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
number of filters
Feature Correlations
254
𝑋 𝑎
Input a
𝐹𝑎
1
𝐹𝑎
2
𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
number of filters
𝐺 𝑎
2
= 𝐹𝑎
2 𝑇
𝐹𝑎
2
(Gram...
Feature Correlations
255
𝐺 𝑎
2
𝐹𝑎
2
𝐹𝑎
2
=
number of filters W*H
𝐹𝑎
2
𝐺 𝑎
2 = 𝐹𝑎
2 𝑇 𝐹𝑎
2
(Gram matrix)
number of filters
Texture Generation
256
𝑋 𝑎
Input a
𝐹𝑎
1 𝐹𝑎
2 𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
𝐺 𝑎
1
𝐺 𝑏
1
𝐺 𝑎
1
𝐺 𝑏
1
𝐺 𝑎
1
𝐺 𝑏
1
Texture Generation
257
𝑋 𝑎
Input a
𝐹𝑎
1 𝐹𝑎
2 𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
𝐺 𝑎
1
𝐺 𝑏
1
𝐺 𝑎
1
𝐺 𝑏
1
𝐺 𝑎
1
𝐺 𝑏
1
Element-w...
Results
258
Results
259
260
Understanding Deep Image
Representations by Inverting Them
-CVPR2015
Aravindh Mahendran, Andrea Vedaldi (VGGgroup)
Reconstruction from feature map
261
Reconstruction from feature map
262
𝑋 𝑎
Input a
𝐹𝑎
1 𝐹𝑎
2 𝐹𝑎
3
𝑋 𝑏
Input b
𝐹𝑏
1
𝐹𝑏
2
𝐹𝑏
3
number of filters
Let’s make thi...
Receptive Field
263
264
A Neural Algorithm of Artistic Style
Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
How?
265
Style Image
Content Image
Mixed ImageNeural Art
How?
266
Style Image
Content Image
Mixed ImageNeural Art
Texture Synthesis Using
Convolutional Neural Networks
Understandi...
How?
267
Gram matrix
Neural Art
268
𝑝: original photo, 𝑎: original artwork
𝑥: image to be generated
Content Style
Total loss = content loss + s...
Results
269
Results
270
271
Deep Learning in Computer Vision
Próximos SlideShares
Carregando em…5
×

Deep Learning in Computer Vision

Deep Learning in Computer Vision Applications
1. Basics on Convolutional Neural Network
2. Otimization Methods (Momentum, AdaGrad, RMSProp, Adam, etc)
3. Semantic Segmentation
4. Class Activation Map
5. Object Detection
6. Recurrent Neural Network
7. Visual Question and Answering
8. Word2Vec (Word embedding)
9. Image Captioning

  • Seja o primeiro a comentar

Deep Learning in Computer Vision

  1. 1. Introduction to Deep Learning Presenter: Sungjoon Choi (sungjoon.choi@cpslab.snu.ac.kr)
  2. 2. Optimization methods CNN basics Semantic segmentation Weakly supervised localization Image detection RNN Visual QnA Word2Vec Image Captioning Contents
  3. 3. What is deep learning? 3 “Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.” Wikipedia says: Machine Learning High-level abstraction Network
  4. 4. Is it brand new? 4 Neural Nets McCulloch & Pitt 1943 Perception Rosenblatt 1958 RNN Grossberg 1973 CNN Fukushima 1979 RBM Hinton 1999 DBN Hinton 2006 D-AE Vincent 2008 AlexNet Alex 2012 GoogLeNet Szegedy 2015
  5. 5. Deep architectures 5 Feed-Forward: multilayer neural nets, convolutional nets Feed-Back: Stacked Sparse Coding, Deconvolutional Nets Bi-Directional: Deep Boltzmann Machines, Stacked Auto-Encoders Recurrent: Recurrent Nets, Long-Short Term Memory
  6. 6. CNN basics
  7. 7. CNN 7 CNNs are basically layers of convolutions followed by subsampling and fully connected layers. Intuitively speaking, convolutions and subsampling layers works as feature extraction layers while a fully connected layer classifies which category current input belongs to using extracted features.
  8. 8. 8
  9. 9. 9
  10. 10. 10
  11. 11. 11
  12. 12. 12
  13. 13. 13
  14. 14. 14
  15. 15. 15
  16. 16. 16
  17. 17. Optimization methods
  18. 18. Gradient descent?
  19. 19. Gradient descent? There are three variants of gradient descent Differ in how much data we use to compute gradient We make a trade-off between the accuracy and computing time
  20. 20. Batch gradient descent In batch gradient decent, we use the entire training dataset to compute the gradient.
  21. 21. Stochastic gradient descent In stochastic gradient descent (SGD), the gradient is computed from each training sample, one by one.
  22. 22. Mini-batch gradient decent In mini-batch gradient decent, we take the best of both worlds. Common mini-batch sizes range between 50 and 256 (but can vary).
  23. 23. Challenges Choosing a proper learning rate is cumbersome.  Learning rate schedule Avoiding getting trapped in suboptimal local minima
  24. 24. Momentum
  25. 25. Nesterov accelerated gradient
  26. 26. Adagrad It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. 𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − 𝜂 𝐺𝑡,𝑖𝑖 + 𝜖 𝑔𝑡,𝑖 Performing larger updates for infrequent and smaller updates for frequent parameters.
  27. 27. Adadelta Adadelta is an extension of Adagrad that seeks to reduce its monotonically decreasing learning rate. It restricts the window of accumulated past gradients to some fixed size 𝑤. 𝐸 𝑔2 𝑡 = 𝛾𝐸 𝑔2 𝑡−1 + 1 − 𝛾 𝑔𝑡 2 𝐸 ∆𝜃2 𝑡 = 𝛾𝐸 ∆𝜃2 𝑡−1 + 1 − 𝛾 ∆𝜃𝑡 2 𝜃𝑡+1 = 𝜃𝑡 − 𝐸 ∆𝜃2 𝑡 + 𝜖 𝐸 𝑔2 𝑡 + 𝜖 𝑔𝑡 No learning rate!
  28. 28. Exponential moving average 28
  29. 29. RMSprop RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture.. 𝐸 𝑔2 𝑡 = 𝛾𝐸 𝑔2 𝑡−1 + 1 − 𝛾 𝑔𝑡 2 𝜃𝑡+1 = 𝜃𝑡 − 𝜂 𝐸 𝑔2 𝑡 + 𝜖 𝑔𝑡
  30. 30. Adam Adaptive Moment Estimation (Adam) stores both exponentially decaying average of past gradients and and squared gradients. 𝑚 𝑡 = 𝛽1 𝑚 𝑡−1 + 1 − 𝛽1 𝑔𝑡 𝑣 𝑡 = 𝛽2 𝑣 𝑡−1 + 1 − 𝛽2 𝑔𝑡 2 𝜃𝑡+1 = 𝜃𝑡 − 𝜂 𝑣 𝑡 + 𝜖 1 − 𝛽2 𝑡 1 − 𝛽1 𝑡 𝑚 𝑡 Momentum Running average of gradient squares
  31. 31. Adam Adaptive Moment Estimation (Adam) stores both exponentially decaying average of past gradients and and squared gradients. 𝑚 𝑡 = 𝛽1 𝑚 𝑡−1 + 1 − 𝛽1 𝑔𝑡 𝑣 𝑡 = 𝛽2 𝑣 𝑡−1 + 1 − 𝛽2 𝑔𝑡 2 𝜃𝑡+1 = 𝜃𝑡 − 𝜂 𝑣 𝑡 + 𝜖 1 − 𝛽2 𝑡 1 − 𝛽1 𝑡 𝑚 𝑡
  32. 32. Visualization
  33. 33. Semantic segmentation
  34. 34. Semantic Segmentation? lion dog giraffe Image Classification bicycle person ball dog Object Detection person person person person person bicyclebicycle Semantic Segmentation
  35. 35. Semantic segmentation 35
  36. 36. 36
  37. 37. 37
  38. 38. 38
  39. 39. 39
  40. 40. 40
  41. 41. 41
  42. 42. 42
  43. 43. 43
  44. 44. 44
  45. 45. Results 45
  46. 46. 46
  47. 47. 47
  48. 48. 48
  49. 49. 49
  50. 50. 50
  51. 51. 51
  52. 52. 52
  53. 53. 53
  54. 54. 54
  55. 55. 55
  56. 56. 56
  57. 57. 57
  58. 58. 58
  59. 59. 59
  60. 60. 60
  61. 61. 61
  62. 62. 62
  63. 63. 63
  64. 64. 64
  65. 65. 65
  66. 66. 66
  67. 67. 67
  68. 68. 68
  69. 69. 69
  70. 70. 70
  71. 71. 71
  72. 72. 72
  73. 73. Results 73
  74. 74. Results 74
  75. 75. Weakly supervised localization
  76. 76. Weakly supervised localization 76
  77. 77. Weakly supervised localization 77
  78. 78. Weakly Supervised Object Localization 78 Usually supervised learning of localization is annotated with bounding box What if localization is possible with image label without bounding box annotations? Today’s seminar: Learning Deep Features for Discriminative Localization 1512.04150v1 Zhou et al. 2015 CVPR2016
  79. 79. Architecture 79 AlexNet+GAP+places205 Living room 11x11 Avg Pooling: Global Average Pooling (GAP) 11x11x512 512 205 227x227x3
  80. 80. Class activation map (CAM) 80 • Identify important image regions by projecting back the weights of output layer to convolutional feature maps. • CAMs can be generated for each class in single image. • Regions for each categories are different in given image. • palace, dome, church …
  81. 81. Results 81 • CAM on top 5 predictions on an image • CAM for one object class in images
  82. 82. GAP vs. GMP 82 • Oquab et al. CVPR2015 Is object localization for free? weakly-supervised learning with convolutional neural networks. • Use global max pooling(GMP) • Intuitive difference between GMP and GAP? • GAP loss encourages identification on the extent of an object. • GMP loss encourages it to identify just one discriminative part. • GAP, average of a map maximized by finding all discriminative parts of object • if activations is all low, output of particular map reduces. • GMP, low scores for all image regions except the most discriminative part • do not impact the score when perform MAX pooling
  83. 83. GAP & GMP 83 • GAP (upper) vs GMP (lower) • GAP outperforms GMP • GAP highlights more complete object regions and less background noise. • Loss for average pooling benefits when the network identifies all discriminative regions of an object
  84. 84. 84
  85. 85. Concept localization 85 Concept localization in weakly labeled images • Positive set: short phrase in text caption • Negative set: randomly selected images • Model catch the concept, phrases are much more abstract than object name. Weakly supervised text detector • Positive set: 350 Google StreeView images that contain text. • Negative set: outdoor scene images in SUN dataset • Text highlighted without bounding box annotations.
  86. 86. Image detection
  87. 87. 87
  88. 88. 88
  89. 89. 89
  90. 90. 90
  91. 91. 91
  92. 92. 92
  93. 93. 93
  94. 94. 94
  95. 95. 95
  96. 96. 96
  97. 97. 97
  98. 98. 98
  99. 99. 99
  100. 100. 100
  101. 101. 101
  102. 102. Results 102
  103. 103. SPPnet
  104. 104. 104
  105. 105. 105
  106. 106. 106
  107. 107. 107
  108. 108. 108
  109. 109. 109
  110. 110. 110
  111. 111. 111
  112. 112. 112
  113. 113. Results 113
  114. 114. Results 114
  115. 115. Fast R-CNN
  116. 116. 116
  117. 117. 117
  118. 118. 118
  119. 119. 119
  120. 120. 120
  121. 121. 121
  122. 122. 122
  123. 123. 123
  124. 124. 124
  125. 125. 125
  126. 126. Faster R-CNN
  127. 127. 127
  128. 128. 128
  129. 129. 129
  130. 130. 130
  131. 131. 131
  132. 132. 132
  133. 133. 133
  134. 134. 134
  135. 135. 135
  136. 136. 136
  137. 137. 137
  138. 138. 138
  139. 139. Results 139
  140. 140. Results 140
  141. 141. R-CNN 141 Image Regions Resize Convolution Features Classify
  142. 142. SPP net 142 Image Convolution Features SPPRegions Classify
  143. 143. R-CNN vs. SPP net 143 R-CNN SPP net
  144. 144. Fast R-CNN 144 Image Convolution Features Regions RoI Pooling Layer Class Label Confidence RoI Pooling Layer Class Label Confidence
  145. 145. R-CNN vs. SPP net vs. Fast R-CNN 145 R-CNN SPP net Fast R-CNN
  146. 146. Faster R-CNN 146 Image Fully Convolutional Features Bounding Box Regression BB Classification FastR-CNN
  147. 147. R-CNN vs. SPP net vs. Fast R-CNN 147 R-CNN SPP net Fast R-CNN Faster R-CNN
  148. 148. 148 Results
  149. 149. 149
  150. 150. 150
  151. 151. 151
  152. 152. 152
  153. 153. RNN
  154. 154. Recurrent Neural Network 155 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  155. 155. Recurrent Neural Network 156 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  156. 156. LSTM comes in! 157 Long Short Term Memory This is just a standard RNN. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  157. 157. LSTM comes in! 158 Long Short Term Memory This is just a standard RNN.This is the LSTM! http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  158. 158. Overall Architecture 159 (Cell) state Hidden State Forget Gate http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Input Gate Output Gate Next (Cell) State Next Hidden State Input Output Output = Hidden state
  159. 159. The Core Idea 160 http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  160. 160. Visual QnA
  161. 161. VQA: Dataset and Problem definition 162 VQA dataset - Example Q: How many dogs are seen? Q: What animal is this? Q: What color is the car? Q: What is the mustache made of?Q: Is this vegetarian pizza?
  162. 162. Solving VQA 163 Approach [Malinowski et al., 2015] [Ren et al., 2015] [Andres et al., 2015] [Ma et al., 2015] [Jiang et al., 2015] Various methods have been proposed
  163. 163. DPPnet 164 Motivation Common pipeline of using deep learning for vision CNN trained on ImageNet Switch the final layer and fine-tune for the New Task In VQA, Task is determined by a question Observation:
  164. 164. DPPnet 165 Main Idea Switching parameters of a layer based on a question Dynamic Parameter Layer Question Parameter Prediction Network
  165. 165. DPPnet 166 Parameter Explosion Number of parameter for fc-layer (R): DynamicParameterLayer Question Feature Predicted Parameter M N Q P : Dimension of hidden state fc-layer N=Q×P R=Q×P×M Q=1000, P=1000, M=500 For example: R=500,000,000 1.86GB for single layer Number of parameters for VGG19: 144,000,000
  166. 166. DPPnet 167 Parameter Explosion Number of parameter for fc-layer (R): DynamicParameterLayer Question Feature Predicted Parameter M N Q P : Dimension of hidden state fc-layer Solution: R=Q×P×M R= N×M N=Q×P N<Q×P We can control N
  167. 167. DPPnet 168 Weight Sharing with Hashing Trick Weights of Dynamic Parameter Layer are picked from Candidate weights by Hashing Question Feature Candidate Weights fc-layer 0.11.2-0.70.3-0.2 0.1 0.1 -0.2 -0.7 1.2 -0.2 0.1 -0.7 -0.7 1.2 0.3 -0.2 0.3 0.3 0.1 1.2 DynamicParameterLayer Hasing [Chen et al., 2015]
  168. 168. DPPnet 169 Final Architecture End-to-End Fine-tuning is possible (Fully-differentiable)
  169. 169. DPPnet 170 Qualitative Results Q: What is the boy holding? DPPnet: surfboard DPPnet: bat
  170. 170. DPPnet 171 Qualitative Results Q: What animal is shown? DPPnet: giraffe DPPnet: elephant
  171. 171. DPPnet 172 Qualitative Results Q: How does the woman feel? DPPnet: happy Q: What type of hat is she wearing? DPPnet: cowboy
  172. 172. DPPnet 173 Qualitative Results Q: How many cranes are in the image? DPPnet: 2 (3) Q: How many people are on the bench? DPPnet: 2 (1)
  173. 173. How to combine image and question? 174
  174. 174. How to combine image and question? 175
  175. 175. How to combine image and question? 176
  176. 176. How to combine image and question? 177
  177. 177. How to combine image and question? 178
  178. 178. How to combine image and question? 179
  179. 179. How to combine image and question? 180
  180. 180. How to combine image and question? 181
  181. 181. Multimodal Compact Bilinear Pooling 182
  182. 182. Multimodal Compact Bilinear Pooling 183
  183. 183. Multimodal Compact Bilinear Pooling 184
  184. 184. Multimodal Compact Bilinear Pooling 185
  185. 185. MCB without Attention 186
  186. 186. MCB with Attention 187
  187. 187. Results 188
  188. 188. Results 189
  189. 189. Results 190
  190. 190. Results 191
  191. 191. Results 192
  192. 192. Results 193
  193. 193. Word2Vec
  194. 194. Word2vec? 195
  195. 195. 196
  196. 196. 197
  197. 197. 198
  198. 198. 199
  199. 199. 200
  200. 200. 201
  201. 201. 202
  202. 202. 203
  203. 203. 204
  204. 204. 205
  205. 205. 206
  206. 206. 207
  207. 207. 208
  208. 208. Image Captioning
  209. 209. Image Captioning? 210
  210. 210. Overall Architecture 211
  211. 211. Language Model 212
  212. 212. Language Model 213
  213. 213. Language Model 214
  214. 214. Language Model 215
  215. 215. Language Model 216
  216. 216. Training phase 217
  217. 217. Training phase 218
  218. 218. Training phase 219
  219. 219. Training phase 220
  220. 220. Training phase 221
  221. 221. Training phase 222
  222. 222. Test phase 223
  223. 223. Test phase 224
  224. 224. Test phase 225
  225. 225. Test phase 226
  226. 226. Test phase 227
  227. 227. Test phase 228
  228. 228. Test phase 229
  229. 229. Test phase 230
  230. 230. Test phase 231
  231. 231. Results 232
  232. 232. Results 233
  233. 233. But not always.. 234
  234. 234. 235
  235. 235. Show, attend and tell 236
  236. 236. 237
  237. 237. 238
  238. 238. 239
  239. 239. 240
  240. 240. Results 241
  241. 241. Results 242
  242. 242. Results (mistakes) 243
  243. 243. Neural Art
  244. 244. Preliminaries 245 Understanding Deep Image Representations by Inverting Them CVPR2015 Texture Synthesis Using Convolutional Neural Networks NIPS2015
  245. 245. A Neural Algorithm of Artistic Style 246
  246. 246. A Neural Algorithm of Artistic Style 247
  247. 247. 248 Texture Synthesis Using Convolutional Neural Networks -NIPS2015 Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
  248. 248. Texture? 249
  249. 249. Visual texture synthesis 250 Which one do you think is real? Right one is real. Goal of texture synthesis is to produce (arbitrarily many) new samples from an example texture.
  250. 250. Results of this work 251 Right ones are given sources!
  251. 251. How? 252
  252. 252. Texture Model 253 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 number of filters
  253. 253. Feature Correlations 254 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 number of filters 𝐺 𝑎 2 = 𝐹𝑎 2 𝑇 𝐹𝑎 2 (Gram matrix)
  254. 254. Feature Correlations 255 𝐺 𝑎 2 𝐹𝑎 2 𝐹𝑎 2 = number of filters W*H 𝐹𝑎 2 𝐺 𝑎 2 = 𝐹𝑎 2 𝑇 𝐹𝑎 2 (Gram matrix) number of filters
  255. 255. Texture Generation 256 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 𝐺 𝑎 1 𝐺 𝑏 1 𝐺 𝑎 1 𝐺 𝑏 1 𝐺 𝑎 1 𝐺 𝑏 1
  256. 256. Texture Generation 257 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 𝐺 𝑎 1 𝐺 𝑏 1 𝐺 𝑎 1 𝐺 𝑏 1 𝐺 𝑎 1 𝐺 𝑏 1 Element-wise squared loss Total layer-wise loss function
  257. 257. Results 258
  258. 258. Results 259
  259. 259. 260 Understanding Deep Image Representations by Inverting Them -CVPR2015 Aravindh Mahendran, Andrea Vedaldi (VGGgroup)
  260. 260. Reconstruction from feature map 261
  261. 261. Reconstruction from feature map 262 𝑋 𝑎 Input a 𝐹𝑎 1 𝐹𝑎 2 𝐹𝑎 3 𝑋 𝑏 Input b 𝐹𝑏 1 𝐹𝑏 2 𝐹𝑏 3 number of filters Let’s make this features similar! By changing the input image!
  262. 262. Receptive Field 263
  263. 263. 264 A Neural Algorithm of Artistic Style Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
  264. 264. How? 265 Style Image Content Image Mixed ImageNeural Art
  265. 265. How? 266 Style Image Content Image Mixed ImageNeural Art Texture Synthesis Using Convolutional Neural Networks Understanding Deep Image Representations by Inverting Them
  266. 266. How? 267 Gram matrix
  267. 267. Neural Art 268 𝑝: original photo, 𝑎: original artwork 𝑥: image to be generated Content Style Total loss = content loss + style loss
  268. 268. Results 269
  269. 269. Results 270
  270. 270. 271

×