Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

•

2 likes•1,526 views

https://telecombcn-dl.github.io/2017-dlsl/ Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN. The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

Data & Analytics

[course site]
Day 4 Lecture 1
Speaker ID II
Javier Hernando

3
DL Modeling i-Vectors
O. Ghahabi, J. Hernando, Deep Learning Backend for Single and Multi-Session i-Vector Speaker Recognition, to
be appear in IEEE Trans. Audio, Speech and Language Processing

4
Decoder
Problem:
A large number of impostor data (negative samples)
Very few number of target data (positive samples)
Solutions:
Global Impostor Selection
Clustering using K-means
Equally distributing positive and negative
samples among minibatches

DL Feature Classification
Speaker Verification Speaker Identification
Credit S. H. Yella,

12
DL Feature Classification
P. Safari, O. Ghahabi, J. Hernando, “Restricted Boltzmann Machines for speaker vector
extraction and feature classification”, Proc. URSI 2016

16
RBM vectors
P. Safari, O. Ghahabi, and J. Hernando. From
Features to Speaker Vectors by means of
Restricted Boltzmann Machine Adaptation.
Odyssey Speaker and Language Recognition
Workshop,, June, 2016

20
CDBN vectors
Unsupervised feature learning for audio classification using convolutional deep belifef
networks, H. Lee et al., Advances in Neural Information Processing Systems,
22:1096–1104, 2009

Speaker Clustering: Speaker Comparison
Shallow Speaker Comparison
Harsha et al. “Artificial Neural Network Features for
Speaker Diarization”. IEEE Spoken Language
Technology Workshop. (2014) 402-406
Speaker errors obtained on AMI and ICSI datasets for
matched and mismatched training conditions. MFCC
corresponds to baseline clustering using BIC.
ANN+MFCC is referred to the ANN shown in right
figure.
Pa

Speaker Embeddings
Mickael Rouvier et al. “Speaker Diarization trough Speaker Embeddings”. 23rd European Signal
Processing Conference. (2015)

Speaker Embeddings
500 size Speaker Embeddings rearranged in 10x50.
Representation of two utterances from each speaker.
2D projection of four Speaker
Embeddings using PCA.

CNN BN Feature
Yanik Lukic et al. “Speaker Identification and Clustering using
Convolutional Neural Networks”. In 2016 IEEE International
workshop on machine learning for signal processing. (2016)
Five Speaker representations in 2 dimensions.
Left figure show the output vector of the softmax layer L8.
Right figure correspond to the same output vector of L5 dense layer.
Differents colors are assigned to different speakers.

CNN BN Features
● L5 and L7 size depend proportionally to the number of speakers.
● L5 and L7 outperforms the softmax layer L8, where L7 is better than L5.
● trainning data (speaker ammount ) must be above 10 * (# speakers) for a good performance.

What's hot

http://ixa2.si.ehu.es/deep_learning_seminar/ Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language and vision. Image captioning, visual question answering or multimodal translation are some of the first applications of a new and exciting field that exploiting the generalization properties of deep neural representations. This talk will provide an overview of how vision and language problems are addressed with deep neural networks, and the exciting challenges being addressed nowadays by the research community.

One Perceptron to Rule Them All: Language and Vision

Universitat Politècnica de Catalunya

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.

Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...

Universitat Politècnica de Catalunya

Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019

Universitat Politècnica de Catalunya

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)

Universitat Politècnica de Catalunya

Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...

Universitat Politècnica de Catalunya

Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...

Universitat Politècnica de Catalunya

https://telecombcn-dl.github.io/2018-dlcv/ Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018

Universitat Politècnica de Catalunya

Slides from Portland Machine Learning meetup, April 13th. Abstract: You've heard all the cool tech companies are using them, but what are Convolutional Neural Networks (CNNs) good for and what is convolution anyway? For that matter, what is a Neural Network? This talk will include a look at some applications of CNNs, an explanation of how CNNs work, and what the different layers in a CNN do. There's no explicit background required so if you have no idea what a neural network is that's ok.

Intro To Convolutional Neural Networks

Mark Scully

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.

Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Universitat Politècnica de Catalunya

Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019

Universitat Politècnica de Catalunya

Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Universitat Politècnica de Catalunya

https://imatge-upc.github.io/wav2pix/ Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised fashion by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of ten youtubers with notable expressiveness in both the speech and visual signals.

Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...

Universitat Politècnica de Catalunya

Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona

Universitat Politècnica de Catalunya

Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018

Universitat Politècnica de Catalunya

Lecture by Xavier Giro-i-Nieto (UPC) at the Master in Computer Vision Barcelona (March 30, 2016). http://pagines.uab.cat/mcv/ This lecture provides an overview of computer vision analysis of images at a global scale using deep learning techniques. The session is structure in two blocks: a first one addressing end to end learning, and a second one focusing on applications that use off-the-shelf features. Please submit your feedback as comments on the GDrive source slides: https://docs.google.com/presentation/d/1ms9Fczkep__9pMCjxtVr41OINMklcHWc74kwANj7KKI/edit?usp=sharing

Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)

Universitat Politècnica de Catalunya

Neural Architectures for Video Encoding

Universitat Politècnica de Catalunya

https://telecombcn-dl.github.io/2017-dlcv/ Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...

Universitat Politècnica de Catalunya

Deep Learning - Convolutional Neural Networks - Architectural Zoo

Christian Perone

Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)

Universitat Politècnica de Catalunya

Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019

Universitat Politècnica de Catalunya

What's hot (20)

One Perceptron to Rule Them All: Language and Vision

Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...

Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019

Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)

Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...

Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...

Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018

Intro To Convolutional Neural Networks

Deep Learning for Computer Vision (2/4): Object Analytics @ laSalle 2016

Deep Video Object Tracking - Xavier Giro - UPC Barcelona 2019

Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016

Wav2Pix: Speech-conditioned face generation using Generative Adversarial Netw...

Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona

Deep Language and Vision - Xavier Giro-i-Nieto - UPC Barcelona 2018

Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)

Neural Architectures for Video Encoding

Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...

Deep Learning - Convolutional Neural Networks - Architectural Zoo

Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)

Deep Learning Architectures for Video - Xavier Giro - UPC Barcelona 2019

Viewers also liked

Speech Recognition with Deep Neural Networks (D3L2 Deep Learning for Speech a...

Universitat Politècnica de Catalunya

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)

Universitat Politècnica de Catalunya

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...

Universitat Politècnica de Catalunya

Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)

Universitat Politècnica de Catalunya

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

Universitat Politècnica de Catalunya

Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...

Universitat Politècnica de Catalunya

Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC ...

Universitat Politècnica de Catalunya

Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...

Universitat Politècnica de Catalunya

Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

Universitat Politècnica de Catalunya

Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)

Universitat Politècnica de Catalunya

The Perceptron (D1L2 Deep Learning for Speech and Language)

Universitat Politècnica de Catalunya

Slides by Xunyu Lin at the Artificial Intelligence Reading Group at UPC BarcelonaTech on the paper: Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." ECCV 2016: In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.

Shuffle and learn: Unsupervised Learning using Temporal Order Verification (U...

Universitat Politècnica de Catalunya

https://imatge.upc.edu/web/publications/impact-visual-saliency-prediction-image-classification This thesis introduces an architecture to improve the accuracy of a Convolutional Neural Network trained for image classification using visual saliency predictions from the original images. In this thesis the accuracy of a Convolutional Neural Network (CNN) trained for classification has been improved using saliency maps from the original images. The network had an AlexNet architecture and was trained using 1.2 million images from the Imagenet dataset. Two methods had been explored in order to exploit the information from the visual saliency predictions. The first methodologies implemented applied the saliency maps directly to the existing layers of the CNN, which in some cases were already trained for classification and in other they were initialized with random weights. In the second methodology the information from the saliency maps was merged from a new branch, trained at the same time as the initial CNN. In order to speed up the training of the networks the experiments were implemented using images reduced to 128x128. With this sizes the proposed model achieves 12.39% increase in Top-1 accuracy performance with respect to the original CNN, and additionally reduces the number of parameters needed compared to AlexNet. Regarding the original size images 227x227 a model that increases 1.72% Top-1 accuracy is proposed. To accelerate the training process of the network the images have been reduced. The methodology that provides the higher improvement in accuracy will be implemented using the original size of the images. The results will be compared to those obtained from the network trained only with the original images. All the methodologies proposed are implemented in a network previously trained for classification. Additionally the most successful methodologies will be implemented in the training of a network. The results will provide information about the best way to add saliency maps to improve the accuracy.

The impact of visual saliency prediction in image classification

Universitat Politècnica de Catalunya

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model (UP...

Universitat Politècnica de Catalunya

The recent emergence of machine learning and deep learning methods for medical image analysis has enabled the development of intelligent medical imaging-based diagnosis systems that can assist physicians in making better decisions about a patient’s health. In particular, skin imaging is a field where these new methods can be applied with a high rate of success. This thesis focuses on the problem of automatic skin lesion detection, particularly on melanoma detection, by applying semantic segmentation and classification from dermoscopic images using a deep learning based approach. For the first problem, a U-Net convolutional neural network architecture is applied for an accurate extraction of the lesion region. For the second problem, the current model performs a binary classification (benign versus malignant) that can be used for early melanoma detection. The model is general enough to be extended to multi-class skin lesion classification. The proposed solution is built around the VGG-Net ConvNet architecture and uses the transfer learning paradigm. Finally, this work performs a comparative evaluation of classification alone (using the entire image) against a combination of the two approaches (segmentation followed by classification) in order to assess which of them achieves better classification results.

Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...

Universitat Politècnica de Catalunya

We describe the “Multimodal Person Discovery in Broadcast TV” task of MediaEval 2015 benchmarking initiative. Participants were asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people was not known a priori and their names had to be discovered in an unsupervised way from media content using text overlay or speech transcripts. The task was evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus. http://ceur-ws.org/Vol-1436/ http://www.multimediaeval.org

MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

multimediaeval

Paper crf design_tools

Dave John

Slides from Eva Mohedano and Andrea Calafell for the the UPC Computer Vision Reading Group about the paper: Zhong, Yujie, Relja Arandjelović, and Andrew Zisserman. "Faces In Places: compound query retrieval." In BMVC 2016. The goal of this work is to retrieve images containing both a target person and a target scene type from a large dataset of images. At run time this compound query is handled using a face classifier trained for the person, and an image classifier trained for the scene type. We make three contributions: first, we propose a hybrid convolutional neural network architecture that produces place-descriptors that are aware of faces and their corresponding descriptors. The network is trained to correctly classify a combination of face and scene classifier scores. Second, we propose an image synthesis system to render high quality fully-labelled face-and-place images, and train the network only from these synthetic images. Last, but not least, we collect and annotate a dataset of real images containing celebrities in different places, and use this dataset to evaluate the retrieval system. We demonstrate significantly improved retrieval performance for compound queries using the new face-aware place-descriptors.

Faces in Places: Compound Query Retrieval

Universitat Politècnica de Catalunya

Conditional Random Fields - Vidya Venkiteswaran

WithTheBest

Recurrent Instance Segmentation (UPC Reading Group)

Universitat Politècnica de Catalunya

Viewers also liked (20)