Exploring Automatic Speech Recognition with Rensorflow

Exploring Automatic Speech
Recognition with Tensorflow
Janna Escur Xavier Giró Marta Ruiz
1
February 5th, 2018

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
- Conclusions and future work
2

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
3

Introduction: problem definition
Automatic Speech
Recognition (ASR)
system
“Good morning”
6

Introduction: difficulties
- Variability: child, male, women…
- Circumstances: same speaker different speech signal
- Not native speaker
- Noisy conditions
- More than one speaker
- Crossing conversations
- ...
7

Introduction: context
Speech2Signs: Spoken to Sign Language Translation using Neural Networks
8

Introduction: context
Speech2Signs: Spoken to Sign Language Translation using Neural Networks
9

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
10

Related work: Automatic Speech Recognition
history
- NO Deep Learning: Gaussian Mixture Models (GMM) - Hidden Markov Model (HMM)
- Deep Neural Networks (DNN) - HMM
- DNN without HMM: Connectionist Temporal Classification
- End-to-end trained
11

Related work: classic ASR system
12

Related work:
Gaussian Mixture Model (GMM) -
Hidden Markov Model (HMM)
13
Gales, Mark, and Steve Young. "The application of hidden Markov models in speech recognition." Foundations and Trends® in Signal Processing 1.3
(2008): 195-304.

Related work:
Deep Neural Networks - HMM
Recurrent Neural Networks - HMM
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature
521.7553 (2015): 436.
14
Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal
Processing Magazine 29.6 (2012): 82-97.

Related work: Connectionist Temporal
Classification
- Does not impose frame by frame synchronization between
input and output
- The system can choose the position of the output
characters
15
Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
https://distill.pub/2017/ctc/
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006

Related work: Connectionist Temporal
Classification
The Recurrent Neural Network (RNN) estimates the per
time-step probabilities
16
Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
https://distill.pub/2017/ctc/
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006

Related work: end-to-end ASR
Acoustic
model
Pronunciation
dictionary
Language
model
DNN
Big models & lots of data!
17

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
18

Methodology: Listen, Attend and Spell (LAS)
LAS learns all the components of
a speech recognizer jointly
Sequence to sequence + attention
19
Chan, William, et al. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." Acoustics, Speech and Signal
Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.

Listener (encoder)
20
Pyramidal BLSTM

21
Speller (decoder)

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
22

Contribution
LAS implementation by Vincent Renkens (Phd at KULeuven): Nabu under development
Github code: https://github.com/vrenkens/nabu
23

Contribution
TOTAL of 10 issues.
We closed all of them!
24

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
25

Experiments: TIMIT database
- Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers
- Train: 462 speakers
- Complete test set: 168 speakers, 8 sentences each.
- Core test set: 24 speakers, 8 sentences each.
- Validation set: 50 speakers, 8 sentences each.
26

Experiments: TIMIT database
- Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers
- Train: 462 speakers
- Complete test set: 168 speakers, 8 sentences each.
- Core test set: 24 speakers, 8 sentences each.
- Validation set: 50 speakers, 8 sentences each.
27

Evaluation metric
Phoneme Error Rate (PER)
PER =
28
S + D + I
N
Where,
S: number of substitutions
D: number of deletions
I: number of insertions
N: number of phonemes in the reference sentence.

Experiments: [1] NABU
Encoder
- Layers 2 + 1 non-pyramidal while training
- Units/layer 128 per direction
Decoder
- Layers 1
- Units/layer 128
PER = 31,94%
29
Steps
Steps
Validation loss
Training loss

Experiments: [2] LAS
Encoder
- Layers 3
- Units/layer 256 per direction
Decoder
- Layers 2
- Units/layer 512
PER = 27.29%
30
Validation loss
Training loss
Steps
Steps

Outline
- Introduction
- Related work
- Methodology
- Contribution
- Experiments
31

Conclusions and future work
- We have reached a better understanding of the techniques used to process speech in the
deep learning framework.
- We have faced with an under-development implementation
32

Conclusions and future work
- We finally have trained the first end-to-end model with sequence to sequence learning
improved with the attention mechanism in the research area of TALP.
33
- We tried it with different configurations in order to obtain the lowest Phoneme Error Rate.
- The system is ready to be used as the first module of Speech2Signs or as a baseline in further
research projects.

Exploring Automatic Speech Recognition with Rensorflow

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Universitat Politècnica de Catalunya

Mais de Universitat Politècnica de Catalunya (20)

Último

Último (20)

Exploring Automatic Speech Recognition with Rensorflow