https://imatge.upc.edu/web/publications/exploring-automatic-speech-recognition-tensorflow
Speech recognition is the task aiming to identify words in spoken language and convert them into text. This bachelor's thesis focuses on using deep learning techniques to build an end-to-end Speech Recognition system. As a preliminary step, we overview the most relevant methods carried out over the last several years. Then, we study one of the latest proposals for this end-to-end approach that uses a sequence to sequence model with attention-based mechanisms. Next, we successfully reproduce the model and test it over the TIMIT database. We analyze the similarities and differences between the current implementation proposal and the original theoretical work. And finally, we experiment and contrast using different parameters (e.g. number of layer units, learning rates and batch sizes) and reduce the Phoneme Error Rate in almost 12% relative.
7. Introduction: difficulties
- Variability: child, male, women…
- Circumstances: same speaker different speech signal
- Not native speaker
- Noisy conditions
- More than one speaker
- Crossing conversations
- ...
7
13. Related work:
Gaussian Mixture Model (GMM) -
Hidden Markov Model (HMM)
13
Gales, Mark, and Steve Young. "The application of hidden Markov models in speech recognition." Foundations and Trends® in Signal Processing 1.3
(2008): 195-304.
14. Related work:
Deep Neural Networks - HMM
Recurrent Neural Networks - HMM
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature
521.7553 (2015): 436.
14
Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." IEEE Signal
Processing Magazine 29.6 (2012): 82-97.
15. Related work: Connectionist Temporal
Classification
- Does not impose frame by frame synchronization between
input and output
- The system can choose the position of the output
characters
15
Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
https://distill.pub/2017/ctc/
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006
16. Related work: Connectionist Temporal
Classification
The Recurrent Neural Network (RNN) estimates the per
time-step probabilities
16
Hannun, Awni. "Sequence Modeling with CTC." Distill 2.11 (2017): e8.
https://distill.pub/2017/ctc/
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006
17. Related work: end-to-end ASR
Acoustic
model
Pronunciation
dictionary
Language
model
DNN
Big models & lots of data!
17
19. Methodology: Listen, Attend and Spell (LAS)
LAS learns all the components of
a speech recognizer jointly
Sequence to sequence + attention
19
Chan, William, et al. "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition." Acoustics, Speech and Signal
Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
26. Experiments: TIMIT database
- Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers
- Train: 462 speakers
- Complete test set: 168 speakers, 8 sentences each.
- Core test set: 24 speakers, 8 sentences each.
- Validation set: 50 speakers, 8 sentences each.
26
27. Experiments: TIMIT database
- Complete corpus: 6300 sentences, 10 sentences spoken by each of 630 speakers
- Train: 462 speakers
- Complete test set: 168 speakers, 8 sentences each.
- Core test set: 24 speakers, 8 sentences each.
- Validation set: 50 speakers, 8 sentences each.
27
28. Evaluation metric
Phoneme Error Rate (PER)
PER =
28
S + D + I
N
Where,
S: number of substitutions
D: number of deletions
I: number of insertions
N: number of phonemes in the reference sentence.
29. Experiments: [1] NABU
Encoder
- Layers 2 + 1 non-pyramidal while training
- Units/layer 128 per direction
Decoder
- Layers 1
- Units/layer 128
PER = 31,94%
29
Steps
Steps
Validation loss
Training loss
30. Experiments: [2] LAS
Encoder
- Layers 3
- Units/layer 256 per direction
Decoder
- Layers 2
- Units/layer 512
PER = 27.29%
30
Validation loss
Training loss
Steps
Steps
32. Conclusions and future work
- We have reached a better understanding of the techniques used to process speech in the
deep learning framework.
- We have faced with an under-development implementation
32
33. Conclusions and future work
- We finally have trained the first end-to-end model with sequence to sequence learning
improved with the attention mechanism in the research area of TALP.
33
- We tried it with different configurations in order to obtain the lowest Phoneme Error Rate.
- The system is ready to be used as the first module of Speech2Signs or as a baseline in further
research projects.