4. 1. VQA: Visual Question Answering
VQA is a new dataset containing open-ended questions about images,
for understanding of vision, language and commonsense knowledge to
answer.
VQA Challenge
Antol et al., ICCV 2015
5. 1. VQA: Visual Question Answering
Examples of Visual-Question, or Only-Question answers by Human
VQA Challenge
Antol et al., ICCV 2015
6. 1. VQA: Visual Question Answering
# images: 204,721 (MS COCO)
# questions: 760K (3 per image)
# answers: 10M (10 per question + ɑ)
Numbers
Images Questions Answers
Training 80K 240K 2.4M
Validation 40K 120K 1.2M
Test 80K 240K
Antol et al., ICCV 2015
7. 1. VQA: Visual Question Answering
80K test images / Four splits of 20K images each
Test-dev (development)
Debugging and Validation - 10/day submission to the evaluation server.
Test-standard (publications)
Used to score entries for the Public Leaderboard.
Test-challenge (competitions)
Used to rank challenge participants.
Test-reserve (check overfitting)
Used to estimate overfitting. Scores on this set are never released.
Test Dataset
Slide adapted from: MSCOCO Detection/Segmentation Challenge, ICCV 2015
9. 2. Vision Part
ResNet: A Thing among Convolutional Neural Networks
He et al., CVPR 2016
1st place on the ImageNet 2015 classification task
1st place on the ImageNet 2015 detection task
1st place on the ImageNet 2015 localization task
1st place on the COCO object detection task
1st place on the COCO object segmentation task
10. 2. Vision Part
ResNet-152
He et al., CVPR 2016
Conv 1x1, 512
Conv 3x3, 512
Conv 1x1, 2048
AveragePooling
7x7x2048
1x1x2048
Linear
Output Size
Softmax
152-Layered Convolutional
Neural Networks
3x224x224 Input Size
11. 2. Vision Part
ResNet-152
He et al., CVPR 2016
Conv 1x1, 512
Conv 3x3, 512
Conv 1x1, 2048
AveragePooling
7x7x2048
1x1x2048
Linear
Output Size
Softmax
As a visual feature extractor
3x224x224 Input Size
12. 2. Vision Part
ResNet-152
He et al., CVPR 2016
Pre-trained models are available!
For Torch, TensorFlow, Theano, Caffe, Lasagne, Neon, and MatConvNet
https://github.com/KaimingHe/deep-residual-networks
14. 3. Question Part
Word-Embedding
What color are her eyes?
what color are her eyes ?
preprocessing
indexing
53 7 44 127 2 6177
w53 | w7 | w44 | w127 | w2 | w6177
lookup
w1T
w2T
⋮
w7T
⋮
w44T
⋮
Lookup Table
{wi} are learning parameters
for back-propagation algorithm.
16. 3. Question Part
Choice of RNN: Gated Recurrent Units (GRU)
Cho et al., EMNLP 2014
Chung et al., arXiv 2014
hst-1
st-1
h
z = σ(xtUz + st-1Wz)
r = σ(xtUr + st-1Wr)
h = tanh(xtUh + (st-1⚬r)Wh)
st = (1 - z)⚬h + z⚬st-1
17. 3. Question Part
Skip-Thought Vectors
Pre-trained model for word-embedding and question-embedding
Kiros et al., NIPS 2015
I got back home.
I could see the cat on the steps.
This was strange.
try to reconstruct the
previous sentence and
next sentence
BookCorpus dataset
(Zhu et al., ArXiv 2015)
18. 3. Question Part
Skip-Thought Vectors
Pre-trained model for word-embedding and question-embedding
Its encoder as Sent2Vec model
Kiros et al., NIPS 2015
w1T
w2T
w3T
w4T
⋮
Lookup Table
Gated
Recurrent
Units
Pre-trained GRU
19. 3. Question Part
Skip-Thought Vectors
Pre-trained model (Theano) and porting code (Torch) are available!
https://github.com/ryankiros/skip-thoughts
https://github.com/HyeonwooNoh/DPPnet/tree/master/
003_skipthoughts_porting
Noh et al., CVPR 2016
Kiros et al., NIPS 2015
21. 4. Multimodal Residual Learning
Idea 1: Deep Residual Learning
Extend the idea of deep residual learning for multimodal learning
He et al., CVPR 2016
identity
weight layer
weight layer
relu
relu
F(x) + x
x
F(x)
x
Figure 2. Residual learning: a building block.
22. 4. Multimodal Residual Learning
Idea 2: Hadamard product for Joint Residual Mapping
One modality is directly involved in the gradient with respect to the other
modality
https://github.com/VT-vision-lab/VQA_LSTM_CNN
vQ vI
tanh tanh
◉
softmax
∂σ (x)!σ (y)
∂x
= diag( ′σ (x)!σ (y))
Scaling Problem?
Wu et al., NIPS 2016
23. 4. Multimodal Residual Learning
Multimodal Residual Networks
Kim et al., NIPS 2016
Q
V
ARNN
CNN
softmax
Multimodal Residual Networks
What kind of animals
are these ?
sheep
word
embedding
question
shortcuts
Hadamard products
word2vec
(Mikolov et al., 2013)
skip-thought vectors
(Kiros et al., 2015)
ResNet
(He et al., 2015)
24. 4. Multimodal Residual Learning
Multimodal Residual Networks
Kim et al., NIPS 2016
A
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
H1
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H2
V
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H3
V
Linear
Softmax
⊙
⊕
⊙
⊕
⊙
⊕
Softmax
26. 5. Results
Tanh
Linear
Linear
Tanh
Linear
Q V
Hl V
⊙
⊕
(a)
Linear
Tanh
Linear
Tanh
Linear
Tanh
Linear
Q V
Hl
V⊙
⊕
(c)
Linear
Tanh
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl V
⊙
⊕
(b)
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl
V
⊙
⊕
(e)
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl V
⊙
⊕
(d)
if l=1
else
Identity
if l=1
Linear
else
none
Table 1: The results of alternative models
(a)-(e) on the test-dev.
Open-Ended
All Y/N Num. Other
(a) 60.17 81.83 38.32 46.61
(b) 60.53 82.53 38.34 46.78
(c) 60.19 81.91 37.87 46.70
(d) 59.69 81.67 37.23 46.00
(e) 60.20 81.98 38.25 46.57
Table 2: The e ect of the visual features and
# of target answers on the test-dev results.
Vgg for VGG-19, and Res for ResNet-152 fea-
tures described in Section 4.
Open-Ended
All Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78
Vgg, 2k 60.79 82.13 38.87 47.52
Vgg, 3k 60.68 82.40 38.69 47.10
Res, 1k 61.45 82.36 38.40 48.81
Res, 2k 61.68 82.28 38.82 49.25
Res, 3k 61.47 82.28 39.09 48.76
5 Results
The VQA Challenge, which released the VQA dataset, provides evaluation servers for test-
dev and test-standard test splits. For the test-dev, the evaluation server permits unlimited
submissions for validation, while the test-standard permits limited submissions for the
competition. We report accuracies in percentage.
Alternative Models The test-dev results for the Open-Ended task on the of alternative
models are shown in Table 1. (a) shows a significant improvement over SAN. However, (b) is
marginally better than (a). As compared to (b), (c) deteriorates the performance. An extra
embedding for a question vector may easily cause overfitting leading the overall degradation.
And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of the
linear mappings may e ectively support to do the task.
(e) shows a reasonable performance, however, the extra shortcut is not essential. The
ive models
Other
46.61
46.78
46.70
46.00
46.57
Table 2: The e ect of the visual features and
# of target answers on the test-dev results.
Vgg for VGG-19, and Res for ResNet-152 fea-
tures described in Section 4.
Open-Ended
All Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78
Vgg, 2k 60.79 82.13 38.87 47.52
Vgg, 3k 60.68 82.40 38.69 47.10
Res, 1k 61.45 82.36 38.40 48.81
Res, 2k 61.68 82.28 38.82 49.25
Res, 3k 61.47 82.28 39.09 48.76
A Appendix
A.1 VQA test-dev Results
Table 1: The effects of various options for VQA test-dev. Here, the model of Figure 3a is used, since
these experiments are preliminarily conducted. VGG-19 features and 1k target answers are used. s
stands for the usage of Skip-Thought Vectors [6] to initialize the question embedding model of GRU,
b stands for the usage of Bayesian Dropout [3], and c stands for the usage of postprocessing using
image captioning model [5].
Open-Ended Multiple-Choice
All Y/N Num. Other All Y/N Num. Other
baseline 58.97 81.11 37.63 44.90 63.53 81.13 38.91 54.06
s 59.38 80.65 38.30 45.98 63.71 80.68 39.73 54.65
s,b 59.74 81.75 38.13 45.84 64.15 81.77 39.54 54.67
s,b,c 59.91 81.75 38.13 46.19 64.18 81.77 39.51 54.72
Table 2: The results for VQA test-dev. The precision of some accuracies [11, 2, 10] are one less than
27. 5. Results
Table 3: The effects of shortcut connections of MRN for VQA test-dev. ResNet-152 features and 2k
target answers are used. MN stands for Multimodal Networks without residual learning, which does
not have any shortcut connections. Dim. stands for common embedding vector’s dimension. The
number of parameters for word embedding (9.3M) and question embedding (21.8M) is subtracted
from the total number of parameters in this table.
Open-Ended
L Dim. #params All Y/N Num. Other
MN 1 4604 33.9M 60.33 82.50 36.04 46.89
MN 2 2350 33.9M 60.90 81.96 37.16 48.28
MN 3 1559 33.9M 59.87 80.55 37.53 47.25
MRN 1 3355 33.9M 60.09 81.78 37.09 46.78
MRN 2 1766 33.9M 61.05 81.81 38.43 48.43
MRN 3 1200 33.9M 61.68 82.28 38.82 49.25
MRN 4 851 33.9M 61.02 82.06 39.02 48.04
A.2 More Examples
28. 5. Results
Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are one
less than others, so, zero-filled to match others.
Open-Ended Multiple-Choice
All Y/N Num. Other All Y/N Num. Other
DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79
D-NMN [1] 58.00 - - - - - - -
Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64
SAN [30] 58.90 - - - - - - -
ACK [27] 59.44 81.07 37.12 45.83 - - - -
FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20
DMN+ [28] 60.36 80.43 36.82 48.33 - - - -
MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40
Human [2] 83.30 95.77 83.39 72.67 - - - -
5.1 Visualization
31. 6. Discussionsexamples examples
What kind of animals are these ? sheep What animal is the picture ? elephant
What is this animal ? zebra What game is this person playing ? tennis
How many cats are here ? 2 What color is the bird ? yellow
What sport is this ? surfing Is the horse jumping ? yes
(a) (b)
(c) (d)
(e) (f)
(g) (h)
33. Acknowledgments
This work was supported by Naver Corp.
and partly by the Korea government (IITP-R0126-16-1072-
SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-
RISF, ADD-UD130070ID-BMRR).
36. Low-rank Bilinear Pooling
Bilinear model
Low-rank Restriction
Kim et al., arXiv 2016
fi = wijk
k=1
M
∑
j=1
N
∑ xj yk + bi = xT
Wiy + bi
fi = xT
Wiy + bi = xT
UiVi
T
y + bi = 1T
(Ui
T
x!Vi
T
y)+ bi
f = PT
(UT
x!VT
y)+ b
N x D D x M
rank(Wi) <= min(N,M)
37. Low-rank Bilinear Pooling
Bilinear model
Low-rank Restriction
Kim et al., arXiv 2016
fi = wijk
k=1
M
∑
j=1
N
∑ xj yk + bi = xT
Wiy + bi
fi = xT
Wiy + bi = xT
UiVi
T
y + bi = 1T
(Ui
T
x!Vi
T
y)+ bi
f = PT
(UT
x!VT
y)+ b
vQ vI
◉
N x D D x M
rank(Wi) <= min(N,M)
41. MLB Attention Networks
Kim et al., arXiv 2016
A
Tanh
Conv
Tanh
Linear
Replicate
Q V
Softmax
Conv
Tanh
Linear
Tanh
LinearLinear
Softmax
42. MLB Attention Networks (MLB)
Kim et al., arXiv 2016
Under review as a conference paper at ICLR 2017
Table 2: The VQA test-standard results to compare with state-of-the-art. Notice that these results
are trained by provided VQA train and validation splits, without any data augmentation.
Open-Ended MC
MODEL ALL Y/N NUM ETC ALL
iBOWIMG (Zhou et al., 2015) 55.89 76.76 34.98 42.62 61.97
DPPnet (Noh et al., 2015) 57.36 80.28 36.92 42.24 62.69
Deeper LSTM+Normalized CNN (Lu et al., 2015) 58.16 80.56 36.53 43.73 63.09
SMem (Xu & Saenko, 2016) 58.24 80.80 37.53 43.48 -
Ask Your Neuron (Malinowski et al., 2016) 58.43 78.24 36.27 46.32 -
SAN (Yang et al., 2015) 58.85 79.11 36.41 46.42 -
D-NMN (Andreas et al., 2016) 59.44 80.98 37.48 45.81 -
ACK (Wu et al., 2016b) 59.44 81.07 37.12 45.83 -
FDA (Ilievski et al., 2016) 59.54 81.34 35.67 46.10 64.18
HYBRID (Kafle & Kanan, 2016b) 60.06 80.34 37.82 47.56 -
DMN+ (Xiong et al., 2016) 60.36 80.43 36.82 48.33 -
MRN (Kim et al., 2016b) 61.84 82.39 38.23 49.41 66.33
HieCoAtt (Lu et al., 2016) 62.06 79.95 38.22 51.95 66.07
RAU (Noh & Han, 2016) 63.2 81.7 38.2 52.8 67.3
MLB (ours) 65.07 84.02 37.90 54.77 68.89
The rate of the divided answers is approximately 16.40%, and only 0.23% of questions have more
than two divided answers in VQA dataset. We assume that it eases the difficulty of convergence
without severe degradation of performance.
43. MLB Attention Networks (MLB)
Kim et al., arXiv 2016
Open-Ended task of VQA. The major improvements are from yes-or-no (Y/N) and others (ETC)-
type answers. In Table 3, we also report the accuracy of our ensemble model to compare with other
ensemble models on VQA test-standard, which won 1st to 5th places in VQA Challenge 20162
. We
beat the previous state-of-the-art with a margin of 0.42%.
Table 3: The VQA test-standard results for ensemble models to compare with state-of-the-art. For
unpublished entries, their team names are used instead of their model names. Some of their figures
are updated after the challenge.
Open-Ended MC
MODEL ALL Y/N NUM ETC ALL
RAU (Noh & Han, 2016) 64.12 83.33 38.02 53.37 67.34
MRN (Kim et al., 2016b) 63.18 83.16 39.14 51.33 67.54
DLAIT (not published) 64.83 83.23 40.80 54.32 68.30
Naver Labs (not published) 64.79 83.31 38.70 54.79 69.26
MCB (Fukui et al., 2016) 66.47 83.24 39.47 58.00 70.10
MLB (ours) 66.89 84.61 39.07 57.79 70.29
Human (Antol et al., 2015) 83.30 95.77 83.39 72.67 91.54
7 RELATED WORKS
7.1 COMPACT BILINEAR POOLING