2. MULTIMODAL DEEP LEARNING
Jiquan Ngiam
Aditya Khosla, Mingyu Kim, Juhan Nam,
Honglak Lee, Andrew Y. Ng
Computer Science Department, Stanford
University
Department of Music, Stanford University
Computer Science & Engineering Division,
University of Michigan, Ann Arbor
3. MCGURK EFFECT
In speech recognition, people are known to
integrate audio-visual information in order to
understand speech.
This was first exemplified in the McGurk
effect where a visual /ga/ with a voiced /ba/
is perceived as /da/ by most subjects.
6. REPRESENTING LIPS
• Can we learn better representations for
audio/visual speech recognition?
• How can multimodal data (multiple
sources of input) be used to find better
features?
13. FEATURE LEARNING WITH AUTOENCODERS
Audio Reconstruction Video Reconstruction
... ...
... ...
... ...
Audio Input Video Input
14. BIMODAL AUTOENCODER
Video Reconstruction
Audio Reconstruction
... ...
... Hidden
Representation
... ...
Audio Input Video Input
15. SHALLOW LEARNING
Hidden Units
Video Input Audio Input
• Mostly unimodal features learned
16. BIMODAL AUTOENCODER
Video Reconstruction
Audio Reconstruction
... ...
... Hidden
Representation
...
Video Input
Cross-modality Learning:
Learn better video features by using audio as a
cue
17. CROSS-MODALITY DEEP AUTOENCODER
Audio Reconstruction Video Reconstruction
... ...
... ...
... Learned
Representation
...
...
Video Input
18. CROSS-MODALITY DEEP AUTOENCODER
Audio Reconstruction Video Reconstruction
... ...
... ...
... Learned
Representation
...
...
Audio Input
19. BIMODAL DEEP AUTOENCODERS
Audio Reconstruction Video Reconstruction
... ...
... ...
... Shared
Representation
“Phonemes” “Visemes”
... ... (Mouth Shapes)
... ...
Audio Input Video Input
20. BIMODAL DEEP AUTOENCODERS
Audio Reconstruction Video Reconstruction
... ...
... ...
...
“Visemes”
... (Mouth Shapes)
...
Video Input
21. BIMODAL DEEP AUTOENCODERS
Audio Reconstruction Video Reconstruction
... ...
... ...
...
“Phonemes”
...
...
Audio Input
22. TRAINING BIMODAL DEEP AUTOENCODER
Audio Reconstruction Video Reconstruction
... ... Audio Reconstruction
...
Video Reconstruction
...
Audio Reconstruction
...
Video Reconstruction
...
... ... ... ... ... ...
... Shared
Representation
... Shared
Representation
... Shared
Representation
... ... ... ...
... ... ... ...
Audio Input Video Input Audio Input Video Input
• Train a single model to perform all 3
tasks
• Similar in spirit to denoising
autoencoders
24. VISUALIZATIONS OF LEARNED FEATURES
0 ms 33 ms 67 ms 100 ms
0 ms 33 ms 67 ms 100 ms
Audio (spectrogram) and Video
features
learned over 100ms windows
25. LEARNING SETTINGS
We will consider the learning settings
shown in Figure 1.
26. LIP-READING WITH AVLETTERS
AVLetters: Audio Reconstruction
...
Video Reconstruction
...
26-way Letter Classification ... ...
10 Speakers ... Learned
Representation
60x80 pixels lip regions ...
...
Cross-modality learning Video Input
Feature Supervised
Testing
Learning Learning
Audio + Video Video Video
27. LIP-READING WITH AVLETTERS
Feature Representation Classification
Accuracy
Multiscale Spatial Analysis 44.6%
(Matthews et al., 2002)
Local Binary Pattern 58.5%
(Zhao & Barnard, 2009)
28. LIP-READING WITH AVLETTERS
Feature Representation Classification
Accuracy
Multiscale Spatial Analysis 44.6%
(Matthews et al., 2002)
Local Binary Pattern 58.5%
(Zhao & Barnard, 2009)
Video-Only Learning
54.2%
(Single Modality Learning)
29. LIP-READING WITH AVLETTERS
Feature Representation Classification
Accuracy
Multiscale Spatial Analysis 44.6%
(Matthews et al., 2002)
Local Binary Pattern 58.5%
(Zhao & Barnard, 2009)
Video-Only Learning
54.2%
(Single Modality Learning)
Our Features
64.4%
(Cross Modality Learning)
30. LIP-READING WITH CUAVE
CUAVE: Audio Reconstruction Video Reconstruction
... ...
10-way Digit Classification
... ...
36 Speakers
... Learned
Representation
Cross Modality Learning ...
...
Video Input
Feature Supervised
Testing
Learning Learning
Audio + Video Video Video
31. LIP-READING WITH CUAVE
Classification
Feature Representation
Accuracy
Baseline Preprocessed Video 58.5%
Video-Only Learning
65.4%
(Single Modality Learning)
32. LIP-READING WITH CUAVE
Classification
Feature Representation
Accuracy
Baseline Preprocessed Video 58.5%
Video-Only Learning
65.4%
(Single Modality Learning)
Our Features
68.7%
(Cross Modality Learning)
33. LIP-READING WITH CUAVE
Classification
Feature Representation
Accuracy
Baseline Preprocessed Video 58.5%
Video-Only Learning
65.4%
(Single Modality Learning)
Our Features
68.7%
(Cross Modality Learning)
Discrete Cosine Transform 64.0%
(Gurban & Thiran, 2009)
Visemic AAM 83.0%
(Papandreou et al., 2009)
34. MULTIMODAL RECOGNITION
Audio Reconstruction Video Reconstruction
... ...
CUAVE: ... ...
10-way Digit Classification ... Shared
Representation
... ...
36 Speakers
... ...
Audio Input Video Input
Evaluate in clean and noisy audio
scenarios
Inthe clean audio scenario, audio performs
extremely well alone
Feature Supervised
Testing
Learning Learning
Audio +
Audio + Video Audio + Video
Video
35. MULTIMODAL RECOGNITION
Classification
Accuracy
Feature Representation
(Noisy Audio at 0db
SNR)
Audio Features (RBM) 75.8%
Our Best Video Features 68.7%
36. MULTIMODAL RECOGNITION
Classification
Accuracy
Feature Representation
(Noisy Audio at 0db
SNR)
Audio Features (RBM) 75.8%
Our Best Video Features 68.7%
Bimodal Deep Autoencoder 77.3%
37. MULTIMODAL RECOGNITION
Classification
Accuracy
Feature Representation
(Noisy Audio at 0db
SNR)
Audio Features (RBM) 75.8%
Our Best Video Features 68.7%
Bimodal Deep Autoencoder 77.3%
Bimodal Deep Autoencoder
82.2%
+ Audio Features (RBM)
38. SHARED REPRESENTATION EVALUATION
Feature Supervised
Testing
Learning Learning
Audio + Video Audio Video
Linear Classifier Supervised
Testing
Shared Shared
Representation Representation
Audio Video Audio Video
Training Testing
39. SHARED REPRESENTATION EVALUATION
Method: Learned Features + Canonical Correlation
Analysis
Feature Supervised
Testing Accuracy
Learning Learning
Audio + Video Audio Video 57.3%
Audio + Video Video Audio 91.7%
Linear Classifier Supervised
Testing
Shared Shared
Representation Representation
Audio Video Audio Video
Training Testing
40. MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.
Audio Video Model Predictions
Input Input /ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%
41. MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often
perceived as /da/.
Audio Video Model Predictions
Input Input /ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%
/ba/ /ga/ 28.3% 13.0% 58.7%
42. CONCLUSION
Applied deep autoencoders to Audio Reconstruction
...
Video Reconstruction
...
discover features in multimodal ... ...
data ... Learned
Representation
...
...
Cross-modality Learning: Video Input
We obtained better video features
Video Reconstruction
(for lip-reading) using audio as a
Audio Reconstruction
... ...
cue ... ...
... Shared
Representation
Multimodal Feature Learning: ... ...
Learn representations that relate ... ...
Audio Input Video Input
across audio and video data
In this work, I’m going to talk about audio visual speech recognition and how we can apply deep learning to this multimodal setting.For example, if we’re given a small speech segment with the video of person saying letters can we determine which letter he said – images of his lips; and the audio – how do we integrate these two sources of data.Multimodal learning involves relating information from multiple sources. For example, images and 3-d depth scans are correlated at first-order as depth discontinuities often manifest as strong edges in images. Conversely, audio and visual data for speech recognition have non-linear correlations at a “mid-level”, as phonemes or visemes; it is difficult to relate raw pixels to audio waveforms or spectrograms. In this paper, we are interested in modeling “mid-level” relationships, thus we choose to use audio-visual speech classification to validate our methods. In particular, we focus on learning representations for speech audio which are coupled with videos of the lips.
So how do we solve this problem? A common machine learning pipeline goes like this – we take the inputs and extract some features and then feed it into our standard ML toolbox (e.g., classifier). The hardest part is really the features – how we represent the audio and video data for use in our classifier.While for audio, the speech community have developed many features such as MFCCs which work really well,it is not obvious what features we should use for lips.
So what does state of the art features look like? Engineering these features took long timeTo this, we address two questions in this work – [click]Furthermore, what is interesting in this problem is the deep question – that audio and video features are only related at a deep level
Concretely, our task is to convert sequences of lip images into a vector of numbersSimilarly, for the audio
Now that we have multimodal data, one easy way is to simply concatenate them – however simply concatenating the features like this fails to model the interactions between the modalitiesHowever, this is a very limited view of multimodal features – instead what we would like to do [click] is to
Find better ways to relate the audio and visual inputs and get features that arise out of relating them together
Next I’m going to describe adifferent feature learning settingSuppose that at test time, only the lip images are available, and you donot get the audio signal And supposed at training time, you have both audio and video – can the audio at training time help you do better at test time even though you don’t have audio at test time(lip-reading not well defined)But there are more settings to consider!If our task is only to do lip reading, visual speech recognition.An interesting question to ask is -- can we improve our lip reading features if we had audio data.
Lets step back a bit and take a similar but related approach to the problem.What if we learn an autoencoderBut, this still has the problem! But, wait now we can do something interesting
So there are different versions of these shallow models and if you trained a model of this form, this is what one usually gets.If you look at the hidden units, it turns out that there are hidden units ….that respond to X and Y only…So why doesn’t this work? We think that there are two reasons for this.In the shallow models, we’re trying relate pixels values to the values in the audio spectrogram.Instead, what we expect is for mid-level video features such as mouth motions to inform us on the audio content.It turns out that the model learn many unimodal units. The figure shows the connectivity ..(explain)We think there are two reasons possible here – 1) that the model is unable to do it (no incentive) 2) we’re actually trying to relate pixel values to values in the audio spectrogram. But this is really difficult, for example, we do not expect that change in one pixel value to inform us abouhow the audio pitch changing. Thus, the relations across the modalities are deep – and we really need a deep model to do this.Review: 1) no incentive and 2) deep
But, this still has the problem! But, wait now we can do something interestingThis model will be trained on clips with both audio and video.
However, the connections between audio and video are (arguably) deep instead of shallow – so ideally, we want to extract mid-level features before trying to connect them together …. moreSince audio is really good for speech recognition, the model is going to learn representations that can reconstruct audio and thus hopefully be good for speech recognition as well
But, what we like to do is not to have to train many versions of this models. It turns out that you can unify the separates models together.
[pause] the second model we present is the bimodal deep autoencoderWhat we want this bimodal deep AE to do is – to learn representations that relate both the audio and video data. Concretely, we want the bimodal deep AE to learn representations that are robust to the input modality
Features correspond to mouth motions and are also paired up with the audio spectrogramThe features are generic and are not speaker specific