Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)

[course site]
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Audio and Vision
Day 4 Lecture 6
#DLUPC

2
Audio & Vision
● Transfer Learning
● Sonorization
● Speech generation
● Lip Reading

3
Audio & Vision
● Sonorization
● Lip Reading

4
Transfer Learning
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Audio features help learn semantically meaningful visual representation

5
Transfer Learning
Images are grouped into audio clusters.
Mean & standard
deviation of the
frequency channels

6
Transfer Learning
Task: Predict the audio cluster for a given image (classification)

7
Transfer Learning
Although the model was not trained with class labels, units with semantical meaning
emerge

8
Transfer Learning: SoundNet
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled
video." NIPS 2016
Pretrained visual ConvNets supervise the training of a model for sound representation

9
Videos for training are unlabeled. Relies on Convnets trained on labeled images.
video." NIPS 2016

10
Recognizing Objects and Scenes from Sound
video." NIPS 2016

11
Recognizing Objects and Scenes from Sound
video." NIPS 2016

12
Hidden layers of Soundnet are used to train a standard SVM
classifier that outperforms state of the art.
video." NIPS 2016

13
Visualization of the 1D filters over raw audio in conv1.
video." NIPS 2016

14
Visualization of the 1D filters over raw audio in conv1.
video." NIPS 2016

15
Visualize samples that mostly activate a neuron in a late layer (conv7)
video." NIPS 2016

16
Visualization of the video frames associated to the sounds that activate some of the
last hidden units (conv7):
video." NIPS 2016

17
Hearing sounds that most activate a neuron in the sound network (conv7)
video." NIPS 2016

18
Hearing sounds that most activate a neuron in the sound network (conv5)
video." NIPS 2016

19
Audio & Vision
● Sonorization
● Lip Reading

20
Sonorization
Owens et al. Visually indicated sounds. CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick.
Are features learned when training for sound prediction useful to identify the object’s material?

21
Sonorization
The Greatest Hits Dataset

22
Sonorization
The Greatest Hits Dataset

23
Sonorization
● Map video sequences to sequences of
sound features
● Sound features in the output are
converted into waveform by
example-based synthesis (i.e. nearest
neighbor retrieval).

24
Sonorization

25
Sonorization

26
Sonorization

27
Audio & Vision
● Sonorization
● Lip Reading

28
Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Speech Generation from Video
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis

29
Downsample
8kHz
Split in 40ms
chunks
(.5 overlap)
LPC analysis
9-dim LPC
features
Waveform processing

30Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
Waveform processing (2 LSP features/ frame)

31
Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of
America, 2006
GRID Dataset

32

33
Audio & Vision
● Sonorization
● Lip Reading

34
Lip Reading: LipNet
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End
Sentence-level Lipreading." (2016).

35
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End
Sentence-level Lipreading." (2016).
Lip Reading: LipNet
Input (frames) and output (sentence) sequences are not aligned

36
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. ICML 2006
Lip Reading: LipNet
CTC Loss: Connectionist temporal classification
● Avoiding the need for alignment between input and output sequence by predicting
an additional “_” blank word
● Before computing the loss, repeated words and blank tokens are removed
● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”

37
Lip Reading: LipNet
Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016

38Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
Lip Reading: Watch, Listen, Attend & Spell

39
Chung et al. Lip reading sentences in the wild. arXiv Nov 2016

40
Audio
features
Image
features

41
Attention over output
states from audio and
video is computed at
each timestep

42

43
LipNet

44

45
Summary
● Transfer learning
● Sonorization
● Lip Reading

Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)

Semelhante a Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision) (20)

Mais de Universitat Politècnica de Catalunya

Mais de Universitat Politècnica de Catalunya (20)

Último

Último (20)

Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)