Multimodal Residual Learning for Visual Question-Answering

Jin-Hwa Kim
BI Lab, Seoul National University
Multimodal Residual
Learning for Visual
Question-Answering
!""!#$% #&"
% "!#$% #&" '
% "!#$% #&" (
BS06+#*,- #$% #&"
+#*,+$ ./% +0"1!, " ) 2.3&45%$ 6% 43 " #&" 7 37"#+$ $ 36%8 +$ 9%034*. " 7 :;%< 9& ł =8 6" &% 7 6#"#%5
" $ 7 .!4*"#%5 7 %0" + >! + ? %0 %< 3$ 6# #$% #&" 87 "##&" " )*! +#*, " 4$3@ ," "#%+ #$% #&" &"+$ 7 "##&" "
! 9%0! # 2 2 6% %# ' 33 8 7 %0)* 2A6',B = 8 45! 3@ "6% 2( " . ł+ > )*! +#*, " 7 7 2 CDEEFGEGH %07 7 7
! 9 36% 7 %0"#3 " " 7 7 : %0%< +0" 7 " 2 27 < & % " 7 7 : %0 7 $ " + > 7 2 FI-JGK %< . 6# : L 43 7 %0 7
N 36% " +

Table of Contents
1. VQA: Visual Question Answering
2. Vision Part
3. Question Part
4. Multimodal Residual Learning
5. Results
6. Discussion
7. Recent Works
8. Q&A

1.
VQA: Visual Question
Answering

VQA is a new dataset containing open-ended questions about images,  
for understanding of vision, language and commonsense knowledge to
answer.
VQA Challenge
Antol et al., ICCV 2015

Examples of Visual-Question, or Only-Question answers by Human
VQA Challenge

# images: 204,721 (MS COCO)
# questions: 760K (3 per image)
# answers: 10M (10 per question + ɑ)
Numbers
Images Questions Answers
Training 80K 240K 2.4M
Validation 40K 120K 1.2M
Test 80K 240K

80K test images / Four splits of 20K images each
Test-dev (development)
Debugging and Validation - 10/day submission to the evaluation server.
Test-standard (publications)
Used to score entries for the Public Leaderboard.
Test-challenge (competitions)
Used to rank challenge participants.
Test-reserve (check overfitting)
Used to estimate overfitting. Scores on this set are never released.
Test Dataset
Slide adapted from: MSCOCO Detection/Segmentation Challenge, ICCV 2015

2. Vision Part
ResNet: A Thing among Convolutional Neural Networks
He et al., CVPR 2016
1st place on the ImageNet 2015 classification task
1st place on the ImageNet 2015 detection task
1st place on the ImageNet 2015 localization task
1st place on the COCO object detection task
1st place on the COCO object segmentation task

2. Vision Part
ResNet-152
Conv 1x1, 512
Conv 3x3, 512
Conv 1x1, 2048
AveragePooling
7x7x2048
1x1x2048
Linear
Output Size
Softmax
152-Layered Convolutional
Neural Networks
3x224x224 Input Size

2. Vision Part
ResNet-152
Conv 1x1, 512
Conv 3x3, 512
Conv 1x1, 2048
AveragePooling
7x7x2048
1x1x2048
Linear
Output Size
Softmax
As a visual feature extractor
3x224x224 Input Size

2. Vision Part
ResNet-152
Pre-trained models are available!
For Torch, TensorFlow, Theano, Caffe, Lasagne, Neon, and MatConvNet
https://github.com/KaimingHe/deep-residual-networks

3. Question Part
Word-Embedding
What color are her eyes?
what color are her eyes ?
preprocessing
indexing
53 7 44 127 2 6177
w53 | w7 | w44 | w127 | w2 | w6177
lookup
w1T
w2T
⋮
w7T
⋮
w44T
⋮
Lookup Table
{wi} are learning parameters
for back-propagation algorithm.

3. Question Part
Question-Embedding
RNNw53 (what)
h0
h1
RNNw7 (color)
h1
h2
RNN
w6177 (?)
h5
h6
Step 0
Step 1
Step 5 use this

3. Question Part
Choice of RNN: Gated Recurrent Units (GRU)
Cho et al., EMNLP 2014 
Chung et al., arXiv 2014
hst-1
st-1
h
z = σ(xtUz + st-1Wz)
r = σ(xtUr + st-1Wr)
h = tanh(xtUh + (st-1⚬r)Wh)
st = (1 - z)⚬h + z⚬st-1

3. Question Part
Skip-Thought Vectors
Pre-trained model for word-embedding and question-embedding
Kiros et al., NIPS 2015
I got back home.
I could see the cat on the steps.
This was strange.
try to reconstruct the
previous sentence and  
next sentence
BookCorpus dataset  
(Zhu et al., ArXiv 2015)

3. Question Part
Pre-trained model for word-embedding and question-embedding
Its encoder as Sent2Vec model
w1T
w2T
w3T
w4T
⋮
Lookup Table
Gated  
Recurrent  
Units
Pre-trained GRU

3. Question Part
Pre-trained model (Theano) and porting code (Torch) are available!
https://github.com/ryankiros/skip-thoughts
https://github.com/HyeonwooNoh/DPPnet/tree/master/
003_skipthoughts_porting
Noh et al., CVPR 2016

4.
Multimodal Residual
Learning

Idea 1: Deep Residual Learning
Extend the idea of deep residual learning for multimodal learning
identity
weight layer
weight layer
relu
relu
F(x) + x
x
F(x)
x
Figure 2. Residual learning: a building block.

Idea 2: Hadamard product for Joint Residual Mapping
One modality is directly involved in the gradient with respect to the other
modality
https://github.com/VT-vision-lab/VQA_LSTM_CNN
vQ vI
tanh tanh
◉
softmax
∂σ (x)!σ (y)
∂x
= diag( ′σ (x)!σ (y))
Scaling Problem?
Wu et al., NIPS 2016

Multimodal Residual Networks
Kim et al., NIPS 2016
Q
V
ARNN
CNN
softmax
What kind of animals
are these ?
sheep
word
embedding
question
shortcuts
Hadamard products
word2vec
(Mikolov et al., 2013)
skip-thought vectors
(Kiros et al., 2015)
ResNet
(He et al., 2015)

Kim et al., NIPS 2016
A
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
H1
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H2
V
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H3
V
Linear
Softmax
⊙
⊕
⊙
⊕
⊙
⊕
Softmax

5. Results
Tanh
Linear
Linear
Tanh
Linear
Q V
Hl V
⊙
⊕
(a)
Linear
Tanh
Linear
Tanh
Linear
Tanh
Linear
Q V
Hl
V⊙
⊕
(c)
Linear
Tanh
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl V
⊙
⊕
(b)
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl
V
⊙
⊕
(e)
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
Hl V
⊙
⊕
(d)
if l=1
else
Identity
if l=1
Linear
else
none
Table 1: The results of alternative models
(a)-(e) on the test-dev.
Open-Ended
All Y/N Num. Other
(a) 60.17 81.83 38.32 46.61
(b) 60.53 82.53 38.34 46.78
(c) 60.19 81.91 37.87 46.70
(d) 59.69 81.67 37.23 46.00
(e) 60.20 81.98 38.25 46.57
Table 2: The e ect of the visual features and
# of target answers on the test-dev results.
Vgg for VGG-19, and Res for ResNet-152 fea-
tures described in Section 4.
Open-Ended
All Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78
Vgg, 2k 60.79 82.13 38.87 47.52
Vgg, 3k 60.68 82.40 38.69 47.10
Res, 1k 61.45 82.36 38.40 48.81
Res, 2k 61.68 82.28 38.82 49.25
Res, 3k 61.47 82.28 39.09 48.76
5 Results
The VQA Challenge, which released the VQA dataset, provides evaluation servers for test-
dev and test-standard test splits. For the test-dev, the evaluation server permits unlimited
submissions for validation, while the test-standard permits limited submissions for the
competition. We report accuracies in percentage.
Alternative Models The test-dev results for the Open-Ended task on the of alternative
models are shown in Table 1. (a) shows a signiﬁcant improvement over SAN. However, (b) is
marginally better than (a). As compared to (b), (c) deteriorates the performance. An extra
embedding for a question vector may easily cause overﬁtting leading the overall degradation.
And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of the
linear mappings may e ectively support to do the task.
(e) shows a reasonable performance, however, the extra shortcut is not essential. The
ive models
Other
46.61
46.78
46.70
46.00
46.57
Table 2: The e ect of the visual features and
# of target answers on the test-dev results.
Vgg for VGG-19, and Res for ResNet-152 fea-
tures described in Section 4.
Open-Ended
All Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78
Vgg, 2k 60.79 82.13 38.87 47.52
Vgg, 3k 60.68 82.40 38.69 47.10
Res, 1k 61.45 82.36 38.40 48.81
Res, 2k 61.68 82.28 38.82 49.25
Res, 3k 61.47 82.28 39.09 48.76
A Appendix
A.1 VQA test-dev Results
Table 1: The effects of various options for VQA test-dev. Here, the model of Figure 3a is used, since
these experiments are preliminarily conducted. VGG-19 features and 1k target answers are used. s
stands for the usage of Skip-Thought Vectors [6] to initialize the question embedding model of GRU,
b stands for the usage of Bayesian Dropout [3], and c stands for the usage of postprocessing using
image captioning model [5].
Open-Ended Multiple-Choice
All Y/N Num. Other All Y/N Num. Other
baseline 58.97 81.11 37.63 44.90 63.53 81.13 38.91 54.06
s 59.38 80.65 38.30 45.98 63.71 80.68 39.73 54.65
s,b 59.74 81.75 38.13 45.84 64.15 81.77 39.54 54.67
s,b,c 59.91 81.75 38.13 46.19 64.18 81.77 39.51 54.72
Table 2: The results for VQA test-dev. The precision of some accuracies [11, 2, 10] are one less than

5. Results
Table 3: The effects of shortcut connections of MRN for VQA test-dev. ResNet-152 features and 2k
target answers are used. MN stands for Multimodal Networks without residual learning, which does
not have any shortcut connections. Dim. stands for common embedding vector’s dimension. The
number of parameters for word embedding (9.3M) and question embedding (21.8M) is subtracted
from the total number of parameters in this table.
Open-Ended
L Dim. #params All Y/N Num. Other
MN 1 4604 33.9M 60.33 82.50 36.04 46.89
MN 2 2350 33.9M 60.90 81.96 37.16 48.28
MN 3 1559 33.9M 59.87 80.55 37.53 47.25
MRN 1 3355 33.9M 60.09 81.78 37.09 46.78
MRN 2 1766 33.9M 61.05 81.81 38.43 48.43
MRN 3 1200 33.9M 61.68 82.28 38.82 49.25
MRN 4 851 33.9M 61.02 82.06 39.02 48.04
A.2 More Examples

5. Results
Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are one
less than others, so, zero-ﬁlled to match others.
Open-Ended Multiple-Choice
All Y/N Num. Other All Y/N Num. Other
DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79
D-NMN [1] 58.00 - - - - - - -
Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64
SAN [30] 58.90 - - - - - - -
ACK [27] 59.44 81.07 37.12 45.83 - - - -
FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20
DMN+ [28] 60.36 80.43 36.82 48.33 - - - -
MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40
Human [2] 83.30 95.77 83.39 72.67 - - - -
5.1 Visualization

6. Discussions
A
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
Q V
H1
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H2
V
Linear
Tanh
Linear
TanhLinear
Tanh
Linear
H3
V
Linear
Softmax
⊙
⊕
⊙
⊕
⊙
⊕
Softmax
pretrained
model
1st visualization
2nd visualization
3rd visualization
Visualization

6. Discussionsexamples examples
What kind of animals are these ? sheep What animal is the picture ? elephant
What is this animal ? zebra What game is this person playing ? tennis
How many cats are here ? 2 What color is the bird ? yellow
What sport is this ? surﬁng Is the horse jumping ? yes
(a) (b)
(c) (d)
(e) (f)
(g) (h)

6. Discussions
What color is the bird ? yellow
Is the horse jumping ? yes
(f)
(h)

Acknowledgments
This work was supported by Naver Corp.
and partly by the Korea government (IITP-R0126-16-1072-
SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-
RISF, ADD-UD130070ID-BMRR).

Recent Works
Low-rank Bilinear
Pooling

Low-rank Bilinear Pooling
Bilinear model
Low-rank Restriction
Kim et al., arXiv 2016
fi = wijk
k=1
M
∑
j=1
N
∑ xj yk + bi = xT
Wiy + bi
fi = xT
Wiy + bi = xT
UiVi
T
y + bi = 1T
(Ui
T
x!Vi
T
y)+ bi
f = PT
(UT
x!VT
y)+ b
N x D D x M
rank(Wi) <= min(N,M)

Bilinear model
Low-rank Restriction
fi = wijk
k=1
M
∑
j=1
N
∑ xj yk + bi = xT
Wiy + bi
fi = xT
Wiy + bi = xT
UiVi
T
y + bi = 1T
(Ui
T
x!Vi
T
y)+ bi
f = PT
(UT
x!VT
y)+ b
vQ vI
◉
N x D D x M
rank(Wi) <= min(N,M)

x1 x2 … xN
WT
∑wixi ∑wjxj … ∑wkxk

x1 x2 … xN
Wx
T
∑wixi ∑wjxj … ∑wkxk
y1 y2 … yN
Wy
T
∑wlyl ∑wmym … ∑wnyn
∑∑wiwlxiyl ∑∑wjwmxjym … ∑∑wkwnxkyn
Hadamard Product (Element-wise Multiplication)

Recent Works
Multimodal Low-rank
Bilinear Attention
Networks (MLB)

MLB Attention Networks
A
Tanh
Conv
Tanh
Linear
Replicate
Q V
Softmax
Conv
Tanh
Linear
Tanh
LinearLinear
Softmax

MLB Attention Networks (MLB)
Under review as a conference paper at ICLR 2017
Table 2: The VQA test-standard results to compare with state-of-the-art. Notice that these results
are trained by provided VQA train and validation splits, without any data augmentation.
Open-Ended MC
MODEL ALL Y/N NUM ETC ALL
iBOWIMG (Zhou et al., 2015) 55.89 76.76 34.98 42.62 61.97
DPPnet (Noh et al., 2015) 57.36 80.28 36.92 42.24 62.69
Deeper LSTM+Normalized CNN (Lu et al., 2015) 58.16 80.56 36.53 43.73 63.09
SMem (Xu & Saenko, 2016) 58.24 80.80 37.53 43.48 -
Ask Your Neuron (Malinowski et al., 2016) 58.43 78.24 36.27 46.32 -
SAN (Yang et al., 2015) 58.85 79.11 36.41 46.42 -
D-NMN (Andreas et al., 2016) 59.44 80.98 37.48 45.81 -
ACK (Wu et al., 2016b) 59.44 81.07 37.12 45.83 -
FDA (Ilievski et al., 2016) 59.54 81.34 35.67 46.10 64.18
HYBRID (Kaﬂe & Kanan, 2016b) 60.06 80.34 37.82 47.56 -
DMN+ (Xiong et al., 2016) 60.36 80.43 36.82 48.33 -
MRN (Kim et al., 2016b) 61.84 82.39 38.23 49.41 66.33
HieCoAtt (Lu et al., 2016) 62.06 79.95 38.22 51.95 66.07
RAU (Noh & Han, 2016) 63.2 81.7 38.2 52.8 67.3
MLB (ours) 65.07 84.02 37.90 54.77 68.89
The rate of the divided answers is approximately 16.40%, and only 0.23% of questions have more
than two divided answers in VQA dataset. We assume that it eases the difﬁculty of convergence
without severe degradation of performance.

MLB Attention Networks (MLB)
Open-Ended task of VQA. The major improvements are from yes-or-no (Y/N) and others (ETC)-
type answers. In Table 3, we also report the accuracy of our ensemble model to compare with other
ensemble models on VQA test-standard, which won 1st to 5th places in VQA Challenge 20162
. We
beat the previous state-of-the-art with a margin of 0.42%.
Table 3: The VQA test-standard results for ensemble models to compare with state-of-the-art. For
unpublished entries, their team names are used instead of their model names. Some of their ﬁgures
are updated after the challenge.
Open-Ended MC
MODEL ALL Y/N NUM ETC ALL
RAU (Noh & Han, 2016) 64.12 83.33 38.02 53.37 67.34
MRN (Kim et al., 2016b) 63.18 83.16 39.14 51.33 67.54
DLAIT (not published) 64.83 83.23 40.80 54.32 68.30
Naver Labs (not published) 64.79 83.31 38.70 54.79 69.26
MCB (Fukui et al., 2016) 66.47 83.24 39.47 58.00 70.10
MLB (ours) 66.89 84.61 39.07 57.79 70.29
Human (Antol et al., 2015) 83.30 95.77 83.39 72.67 91.54
7 RELATED WORKS
7.1 COMPACT BILINEAR POOLING

DEMO
Q: 아니 이게 뭐야?
A: 냉장고 입니다.

Multimodal Residual Learning for Visual Question-Answering

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Multimodal Residual Learning for Visual Question-Answering

Semelhante a Multimodal Residual Learning for Visual Question-Answering (20)

Mais de NAVER D2

Mais de NAVER D2 (20)

Último

Último (20)

Multimodal Residual Learning for Visual Question-Answering