Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities. The contents of this tutorial are available at: https://telecombcn-dl.github.io/2019-mmm-tutorial/.
1. Multimodal Deep Learning
#MMM2019
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Intelligent Data Science and Artificial
Intelligence Center (IDEAI)
Universitat Politecnica de Catalunya (UPC)
Barcelona Supercomputing Center (BSC)
TUTORIAL
Thessaloniki, Greece
8 January 2019
22. bit.ly/MMM2019
@DocXavi
22
MLPs for Multimedia Data
Limitation #1:
Very large amount of input data samples (xi
),
which requires a gigantic amount of model
parameters.
Figure: Ranzatto
Limitation #2:
Does not naturally handle input data of
variable dimension
(eg. audio/video/word sequences).
#1
#2
23. bit.ly/MMM2019
@DocXavi
23
Limitation #1:
Very large amount of input data samples (xi
), which requires a gigantic amount of
model parameters.
For a 200x200 image, we have 4x104
neurons each one with 4x104
inputs,
that is 16x108
parameters (*), only for
one layer!!!
Figure Credit: Ranzatto
16x108
(*) biases not counted
MLPs for Multimedia Data #1
24. bit.ly/MMM2019
@DocXavi
24
MLPs for Multimedia Data
Locally connected network:
For a 200x200 image, we have 4x104
neurons each one with 10x10 “local
connections” (also called receptive
field) inputs, that is 4x106
What else can we do to reduce the
number of parameters?
Figure Credit: Ranzatto
4x106
#1
25. bit.ly/MMM2019
@DocXavi
25
CNNs for Multimedia Data (2D)
#1
104
Ex: With 100 different filters (or
feature extractors) of size 10x10,
the number of parameters is 104a
Convolutional Neural Networks
(ConvNets, CNNs):
Translation invariance: we can use same
parameters to capture a specific “feature” in
any area of the image. We can use different
sets of parameters to capture different
features.
These operations are equivalent to perform
convolutions with different filters.
26. bit.ly/MMM2019
@DocXavi
26
CNNs for Multimedia Data (2D)
LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE
86, no. 11 (1998): 2278-2324.
LeNet-5
27. bit.ly/MMM2019
@DocXavi
27
CNNs for Multimedia Data (1D)
hi there how are you doing ?
100 dim
Let’s say we have a sequence of 100 dimensional vectors describing words (text).
sequence length = 8
arrangein 2D
100
dim
sequence length = 8
,
Slide: Santiago Pascual (UPC DLAI 2018)
28. bit.ly/MMM2019
@DocXavi
28
CNNs for Multimedia Data (1D)
Slide: Santiago Pascual (UPC DLAI 2018)
We can apply a 1D convolutional activation over the 2D matrix: for an arbitrary kernel of width=3
100
dim
sequence length = 8
K1
Each 1D convolutional kernel is a
2D matrix of size (3, 100)100
dim
29. bit.ly/MMM2019
@DocXavi
29
CNNs for Multimedia Data (1D)
Slide: Santiago Pascual (UPC DLAI 2018)
(Keep in mind we are working with depth=100 although here we depict just depth=1 for simplicity)
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6
w1 w2 w3
The length result of the convolution is
known to be:
seq_length - filter_width + 1 = 8 - 3 + 1 = 6
So the output matrix will be (6, 100).
30. bit.ly/MMM2019
@DocXavi
30
CNNs for Multimedia Data (1D)
Slide: Santiago Pascual (UPC DLAI 2018)
When can add zero padding on both sides of the sequence:
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6 7 8
w1 w2 w3
The length result of the convolution is well
known to be:
seq_length - filter_width + 1 = 10 - 3 + 1 = 8
So the output matrix will be (8, 100).
0 0
w1 w2 w3
w1 w2 w3
31. bit.ly/MMM2019
@DocXavi
31
CNNs for Multimedia Data (1D)
Slide: Santiago Pascual (UPC DLAI 2018)
When can add zero padding on just one sides of the sequence:
1 2 3 4 5 6 7 8
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
w1 w2 w3
1 2 3 4 5 6 7 8
w1 w2 w3
The length result of the convolution is well
known to be:
seq_length - filter_width + 1 = 10 - 3 + 1 = 8
So the output matrix will be (8, 100) because
we had padding
HOWEVER: now every time-step t depends
on the two previous inputs as well as the
current time-step → every output is causal
Roughly: We make a causal convolution by
padding left the sequence with
(filter_width - 1) zeros
0
w1 w2 w3
w1 w2 w3
0
32. bit.ly/MMM2019
@DocXavi
32
MLPs for Sequences
Limitation #2:
Dimensionality of input data is variable (eg. audio/video/word sequences).
If we have a sequence of samples...
predict sample x[t+1] knowing previous values {x[t], x[t-1], x[t-2], …, x[t-τ]}
Slide: Santiago Pascual (UPC 2017)
#2
33. bit.ly/MMM2019
@DocXavi
33
MLPs for Sequences
Slide: Santiago Pascual (UPC 2017)
Feed Forward approach:
● static window of size L
● slide the window time-step wise
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
x[t+1]
L
#2
34. bit.ly/MMM2019
@DocXavi
34
MLPs for Sequences
Slide: Santiago Pascual (UPC 2017)
Feed Forward approach:
● static window of size L
● slide the window time-step wise
...
...
...
x[t+2]
x[t-L+1], …, x[t], x[t+1]
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
x[t+2]
L
#2
35. bit.ly/MMM2019
@DocXavi
35
MLPs for Sequences
Slide: Santiago Pascual (UPC 2017)
35
Feed Forward approach:
● static window of size L
● slide the window time-step wise
x[t+3]
L
...
...
...
x[t+3]
x[t-L+2], …, x[t+1], x[t+2]
...
...
...
x[t+2]
x[t-L+1], …, x[t], x[t+1]
...
...
...
x[t+1]
x[t-L], …, x[t-1], x[t]
#2
36. bit.ly/MMM2019
@DocXavi
36
Slide: Santiago Pascual (UPC 2017)
36
...
...
...
x1, x2, …, xL
Problems for the feed forward + static window approach:
● What’s the matter increasing L? → Fast growth of num of parameters!
● Decisions are independent between time-steps!
○ The network doesn’t care about what happened at previous time-step, only present window
matters → doesn’t look good
x1, x2, …, xL, …, x2L
...
...
x1, x2, …, xL, …, x2L, …, x3L
...
...
...
...
MLPs for Sequences #2
44. bit.ly/MMM2019
@DocXavi
44
GRUs for Sequences
#2
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua
Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014.
GRU obtains a similar performance as LSTM with one gate less.
46. bit.ly/MMM2019
@DocXavi
46
Efficient RNN for Sequences
Used
Unused
Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State
Updates in Recurrent Neural Networks”, ICLR 2018. #SkipRNN
47. bit.ly/MMM2019
@DocXavi
47
Efficient RNN for Sequences
Victor Campos, Brendan Jou, Xavier Giro-i-Nieto, Jordi Torres, and Shih-Fu Chang. “Skip RNN: Learning to Skip State
Updates in Recurrent Neural Networks”, ICLR 2018. #SkipRNN
Used Unused
CNN CNN CNN...
RNN RNN RNN...
51. bit.ly/MMM2019
@DocXavi
51
Auto-regressive FF for Sequences
#2
van den Oord, Aaron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "WaveNet: A Generative Model for Raw Audio." arXiv preprint arXiv:1609.03499 (2016).
53. bit.ly/MMM2019
@DocXavi
53
Outline
1. Motivation
2. Deep Neural Topologies
3. Multimedia Encoding and Decoding
4. Multimodal Architectures
a. Cross-modal
b. Self-supervised Learning
c. Multimodal (input)
d. Multi-task (output)
57. bit.ly/MMM2019
@DocXavi
57
Video Encoding
Slide: Víctor Campos (UPC 2018)
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a pooling operation
(e.g. max, sum, average).
Drawback: pooling is not
aware of the temporal order!
Ng et al., Beyond short snippets: Deep networks for video classification, CVPR 2015
58. bit.ly/MMM2019
@DocXavi
58
Video Encoding
Slide: Víctor Campos (UPC 2018)
Recurrent Neural Networks are
well suited for processing
sequences.
Drawback: RNNs are sequential
and cannot be parallelized.
Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015
CNN CNN CNN...
RNN RNN RNN...
69. 69
One hot encoding
Number of words, |V| ?
B2: 5K
C2: 18K
LVSR: 50-100K
Wikipedia (1.6B): 400K
Crawl data (42B): 2M
cat: xT
= [1,0,0, ..., 0]
dog: xT
= [0,1,0, ..., 0]
.
.
house: xT
= [0,0,0, …,0,1,0,...,0]
.
.
.
70. cat: xT
= [1,0,0, ..., 0]
dog: xT
= [0,1,0, ..., 0]
.
.
house: xT
= [0,0,0, …,0,1,0,...,0]
.
.
.
70
One hot encoding
● Large dimensionality
● Sparse representation (mostly zeros)
● Blind representation
○ Only operators: ‘!=’ and ‘==’
71. 71
Text projection to word embeddings
The one-hot is linearly projected to
a embedded space of lower
dimension with a MLP.
FC
Representation
72. 72
Embed high dimensional data points
(i.e. feature codes) so that pairwise
distances are preserved in local
neighborhoods.
Maaten & Hinton. Visualizing High-Dimensional Data using t-SNE. Journal of Machine Learning Research (2008) #tsne.
t-SNE
Figure:
Christopher Olah, Visualizing Representations
Text projection to word embeddings
73. 73Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." EMNLP
2014
Woman-Man
Text projection to word embeddings
74. 74
● Represent words using vectors of reduced dimension d (~100 - 500)
● Meaningful (semantic, syntactic) distances
● Good embeddings are useful for many other tasks
Text projection to word embeddings
75. 75
Training Word Embeddings
Figure:
TensorFlow tutorial
Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. "A neural probabilistic language
model." Journal of machine learning research 3, no. Feb (2003): 1137-1155.
Self-supervised
learning
76. 76Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of
words and phrases and their compositionality." NIPS 2013 #word2vec #continuousbow
the cat climbed a tree
Given context:
a, cat, the, tree
Estimate prob. of
climbed
Self-supervised
learning
Training Word Embeddings
77. 77Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of
words and phrases and their compositionality." NIPS 2013 #word2vec #skipgram
Self-supervised
learning
the cat climbed a tree
Given word:
climbed
Estimate prob. of context words:
a, cat, the, tree
Training Word Embeddings
78. bit.ly/MMM2019
@DocXavi
78
Fig: Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua
Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." EMNLP 2014.
(2)
(3)
Text Encoding
RNN
FC
Representation
85. 85
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Representation or
Embedding
Neural Machine Translation (NMT)
86. 86
Neural Machine Translation (NMT)
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014.
The Seq2Seq variation:
● trigger the output generation with an input <go> symbol.
● the predicted word at timestep t, becomes the input at t+1.
94. bit.ly/MMM2019
@DocXavi
94
Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016.
Speech Encoding
RNN
CNN
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
96. bit.ly/MMM2019
@DocXavi
96
Audio Decoding
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua
Bengio. "SampleRNN: An unconditional end-to-end neural audio generation model." ICLR 2017.
RNN
CNN
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
101. bit.ly/MMM2019
@DocXavi
101
Outline
1. Motivation
2. Deep Neural Topologies
3. Multimedia Encoding and Decoding
4. Multimodal Architectures
a. Cross-modal
b. Self-supervised Learning
c. Multimodal (input)
d. Multi-task (output)
105. bit.ly/MMM2019
@DocXavi
105
Automatic Speech Recognition (ASR)
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent
Neural Networks. ICML 2006
The Connectionist Temporal Classification loss (CTC) allows training RNNs with
no need of exact alignement.
Figure: Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
● Avoiding the need for
alignment between input and
output sequence by
predicting an additional “_”
blank word
● Before computing the loss,
repeated words and blank
tokens are removed
111. bit.ly/MMM2019
@DocXavi
111
Speech Synthesis
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
116. bit.ly/MMM2019
@DocXavi
116
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
117. bit.ly/MMM2019
@DocXavi
117
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
119. bit.ly/MMM2019
@DocXavi
119
Captioning (+ Detection): DenseCap
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
121. bit.ly/MMM2019
@DocXavi
121
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Captioning: Video
122. bit.ly/MMM2019
@DocXavi
122
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video
127. bit.ly/MMM2019
@DocXavi
127
Lipreading: Watch, Listen, Attend & Spell
Audio
features
Image
features
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
128. bit.ly/MMM2019
@DocXavi
128
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
134. bit.ly/MMM2019
@DocXavi
134
Self-supervised Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.
135. bit.ly/MMM2019
@DocXavi
135
Self-supervised Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Use videos to train a CNN that predicts the audio statistics of a frame.
136. bit.ly/MMM2019
@DocXavi
136
Self-supervised Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means overthe training set
Cluster assignments at test time (one row=one cluster)
137. bit.ly/MMM2019
@DocXavi
137
Self-supervised Feature Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Although the CNN was not trained with class labels, local units with semantic
meaning emerge.
138. bit.ly/MMM2019
@DocXavi
138
Video Sonorization
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Retrieve matching sounds for videos of people hitting objects with a drumstick.
147. bit.ly/MMM2019
@DocXavi
147
Speech to Pixels
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” (under progress)
148. bit.ly/MMM2019
@DocXavi
148
Speech to Pixels
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” (under progress)
Generated faces from known identities.
149. bit.ly/MMM2019
@DocXavi
149
Speech to Pixels
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” (under progress)
Faces from average speeches.
150. bit.ly/MMM2019
@DocXavi
150
Speech to Pixels
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” (under progress)
Interpolated faces from interpolated speeches.
151. bit.ly/MMM2019
@DocXavi
151
Outline
1. Motivation
2. Deep Neural Topologies
3. Multimedia Encoding and Decoding
4. Multimodal Architectures
a. Cross-modal
b. Joint Representations (embeddings)
c. Multimodal (input)
d. Multi-task (output)
155. bit.ly/MMM2019
@DocXavi
155
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .
160. bit.ly/MMM2019
@DocXavi
160
Multimodal Retrieval
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Best
match
Audio feature
164. bit.ly/MMM2019
@DocXavi
164
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Learned audio features are good for environmental sound recognition.
Feature Learning by Label Transfer
165. bit.ly/MMM2019
@DocXavi
165
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Learned audio features are good for environmental sound recognition.
Feature Learning by Label Transfer
166. bit.ly/MMM2019
@DocXavi
166
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Feature Learning by Label Transfer
167. bit.ly/MMM2019
@DocXavi
167
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Feature Learning by Label Transfer
168. bit.ly/MMM2019
@DocXavi
168
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Feature Learning by Label Transfer
169. bit.ly/MMM2019
@DocXavi
169
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Feature Learning by Label Transfer
170. bit.ly/MMM2019
@DocXavi
170
Cross-modal Label Transfer
S Albanie, A Nagrani, A Vedaldi, A Zisserman, “Emotion Recognition in Speech using Cross-Modal Transfer in the Wild”
ACM Multimedia 2018.
Teacher network: Facial Emotion Recognition (visual)
184. bit.ly/MMM2019
@DocXavi
184
Speech Grounding (temporal)
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS
2016. [talk]
Train a visual & speech networks with pairs of (non-)corresponding images & speech.
185. bit.ly/MMM2019
@DocXavi
185
Harwath, David, Antonio Torralba, and James Glass. "Unsupervised learning of spoken language with visual context." NIPS
2016. [talk]
Similarity curve show which regions of the spectrogram are relevant for the image.
Important: no text transcriptions used during the training !!
Speech Grounding (temporal)
187. bit.ly/MMM2019
@DocXavi
187
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly Discovering Visual Objects
and Spoken Words from Raw Sensory Input." ECCV 2018.
Regions matching the spoken word “WOMAN”:
Speech Grounding (spatiotemporal)
188. bit.ly/MMM2019
@DocXavi
188
Harwath, David, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, and James Glass. "Jointly Discovering Visual Objects
and Spoken Words from Raw Sensory Input." ECCV 2018
197. bit.ly/MMM2019
@DocXavi
197
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with
dynamic parameter prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
Visual Question Answering (VQA)
198. bit.ly/MMM2019
@DocXavi
198
Visual Question Answering: Dynamic
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic
Memory Networks for Visual and Textual Question Answering." ICML 2016
199. bit.ly/MMM2019
@DocXavi
199
Visual Question Answering: Grounded
(Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded
Question Answering in Images." CVPR 2016.
200. bit.ly/MMM2019
@DocXavi
200
Visual Reasoning
Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A
Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
201. bit.ly/MMM2019
@DocXavi
201
Visual Reasoning
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine
202. bit.ly/MMM2019
@DocXavi
202
Visual Reasoning
Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017.
Relation Networks concatenate all possible pairs of objects with the an encoded question to later find the
answer with a MLP.
205. bit.ly/MMM2019
@DocXavi
205
Speech Separation with Vision (lips)
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement."
Interspeech 2018.
214. bit.ly/MMM2019
@DocXavi
214
Visual Re-dubbing (lip keypoints)
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
215. bit.ly/DLCV2018
#DLUPC
215
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
216. bit.ly/MMM2019
@DocXavi
216
Visual Re-dubbing (3D meshes)
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end
learning of pose and emotion." SIGGRAPH 2017
217. bit.ly/DLCV2018
#DLUPC
217
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
224. bit.ly/MMM2019
@DocXavi
224
Our team at UPC-BSC Barcelona
Victor
Campos
Amaia
Salvador
Amanda
Duarte
Dèlia
Fernández
Eduard
Ramon
Andreu
Girbau
Dani
Fojo
Oscar
Mañas
Santi
Pascual
Xavi
Giró
Miriam
Bellver
Janna
Escur
Carles
Ventura
Miquel
Tubau
Paula
Gómez
Benet
Oriol
Mariona
Carós
Jordi
Torres