Presented at AI NEXTCon Seattle 1/17-20, 2018
http://aisea18.xnextcon.com
join our free online AI group with 50,000+ tech engineers to learn and practice AI technology, including: latest AI news, tech articles/blogs, tech talks, tutorial videos, and hands-on workshop/codelabs, on machine learning, deep learning, data science, etc..
3. Speech Recognition
• Determines the most likely word sequence, W = w1, ..., wn, given an
acoustic input sequence, x = x1, ..., xT , where T represents the
number of frames in the utterance
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 3
Pronunciation model (PM):
convert a word sequence to
a phoneme sequence
Acoustic model (AM): predicts the
likelihood of the acoustic input speech
utterance given a phoneme sequence
language model (LM):
predicts the likelihood of a
word sequence
Pronunciation model is
unnecessary but helpful
for some languages
Systems without a
pronunciation model are
called grapheme based
4. Can a Single Model Do All End-to-end?
• Speech recognition is essentially a sequence (audio sequence) to
sequence (word sequence) transformation problem
• Why not direct sequence to sequence transformation
• Connectionist Temporal Classification (CTC)
• Recurrent neural network transducer (RNN-T)
• Recurrent neural network aligner (RNN-A)
• Sequence to sequence with attention (seq2seq-attention)
• Neural Transducer (NT) (Limited Size Attention)
• Key problems:
• How to address the variable length problem
• How to address the length difference and alignment between the input and
output
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 4
6. Connectionist Temporal Classification
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 6
blank symbol:
no output is
generated; not
confident enough
repeated symbol:
Treated as one
Frame number:
time
Recognition unit:
Characters, words,
phonemes
alignments:
Many paths lead to
the same
recognition result
7. •Inference
Connectionist Temporal Classification
• Training (conditional likelihood, sensitive to initialization)
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 7
Label sequence: e.g.,
good
Alignments lead to same
Label sequence, e.g.,
_gg_o_oo_d
Label sequence in
training set
Find the single alignment
with the highest score
Conditional
independence
assumption
8. Properties of CTC
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 8
CTC (Spiky): output
blanks until confident to
output the associated
label
• Simple: direct audio sequence to label sequence transformation, flexible modeling unit
• Fast: decoding is fast due to spikes (confident output) or less output units
• Random timing: spikes can happen at any delayed time (latency), may be outside of the
label boundary
• Limitation: assumes that model outputs at a given frame are independent of previous
output labels
Framewise (Flat): output
of each frame is the
same label
9. Improve: Sequence Discriminative Training
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 9
• CTC training cannot exploit external text to improve LM
• Use LM trained with external text to improve performance
• CTC training objective function is likelihood of observing the label
sequence given the audio sequence
• Use sequence discriminative training
10. Improve: Use Word Unit and Better Tricks
• The quality of the implicit LM depends on the modeling unit
• CTC-Word: directly models word, word-piece, or cross-word unit
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 10
State of the art
hybrid system on
300hr SWB is
around 10%
To achieve good result,
you need to use
complicated engineering
procedures similar to
that used in hybrid
systems
Decoding is simple
greedy search:
extremely fast
11. Solve OOV in CTC-Word
• Spell and recognize (SAR)
• Present training examples that contain both words and characters
b-t h e-e THE b-c a e-t CAT b-i e-s IS b-b l a c e-k BLACK
• The model is trained to first spell the word and then recognize it
• The SAR model has a single softmax over words+characters in the
output layer
• Allows to leverage the greedy search decoding: no beam or other
graph-based search is needed
• Not the ideal solution
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 11
word
Beginning char ending char
13. RNN Transducer (RNN-T)
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 13
maps acoustic frames
into a higher-level
representation.
Conditioned on previous
acoustic frames.
Initialized from CTC
model
Combines acoustic and language
information
• A streaming, all-neural, sequence-to-sequence architecture
• Jointly learns acoustic and language model components
Language model trained on text
only data.
Explicitly conditioned on the
history of previous non-blank
targets predicted by the model.
Can use grapheme, word,
word-piece units
14. Properties of RNN-T
• Solves the conditional independency assumption in CTC: decoding
results now dependent on previous output symbol through the
prediction network
• Solves the deficiency of not able to exploit large text-only data in
CTC: the prediction network can exploit larger text only data
• The prediction network is not conditioned on the encoder output:
allows for the pre-training of the decoder as a RNN language model
on text-only data
• Still uses the blank symbol and same repeated symbol handling
technique used in CTC
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 14
15. RNN-T Training Procedure
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 15
The training procedure is very
complicated to achieve good result.
Not significantly simpler than
hybrid system
16. RNN Transducer: Inference
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 16
next acoustic frame
previously predicted label
Updated only if the predicted
label is non-blank
next output label probabilities• Alternate between updating the encoder and
the prediction network based on if the
predicted label is a blank or non-blank.
• Inference is terminated when blank is output
at the last frame, T .
• Greedy search or beam search
17. Recurrent Neural Aligner (RNA)
• Similar to RNN-T:
• Aims at solving the conditional independency assumption in CTC
• Uses the predicted label at time t−1 as an additional input to the recurrent
model when predicting the label at time t.
• Has an encoder network to encode raw input as input sequence x.
• Has a recurrent decoder network. The input to the decoder network at time t
for a given alignment z is [xt zt−1].
• Different from RNN-T:
• RNN-T Uses an RNN for LM and another one for AM and then combine them;
RNN-A uses just one RNN to train AM/LM jointly (not factorized)
• RNN-A requires approximate forward-backward algorithm to train due to the
joint RNN model.
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 17
alignment
20. Alternative View of Attention Model
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 20
Encoder: maps input acoustic
vectors into a higher-level
representation
Attention: summarizes the
output of the encoder based on
the current state of the decoder
Decoder: models an output
distribution over the next target
conditioned on the sequence of
previous predictions
21. Properties of Basic Attention Model
• Strength (similar to RNN-T and better than CTC):
• No conditional independence assumption
• Prediction of next unit depends on both LM and AM information
• Different from RNN-T:
• The attention weights depend on the current decoder state
• Weakness (worse than CTC):
• Exposure bias: conditioned on true label during training and estimated label
during decoding.
• Too flexible: Attention weights are not constrained to attend from left to right
• Very difficult to train well esp. when the input length is long: Even with
pyramid structure and/or other subsampling techniques
• High latency: cannot be streamed
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 21
22. Constrain Attention Model with CTC
• left to right constraint in CTC can help regularize attention model
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 22
left to right constraint in CTC can
help regularize attention model
Joint training criterion
Regularization through shared
encoder
23. Constraint Attention Model with CTC
• Speed in learning alignments between characters (y-axis) and acoustic
frames (x-axis) is significantly improved with multi-task learning
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 23
Aligned to
the end
Incorrect
orderAttention Only
Attention + CTC
24. Decoding in Attention + CTC Model
• Basic Idea: Beam search to find
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 24
CTC decodes at the frame rate attention decoder operates character-by-character
• Difficulty: mismatch between CTC and Attention model scoring
• Solution: compute the probability of each partial hypothesis based on
the CTC prefix probability defined as the cumulative probability of all
label sequences that have h as their prefix
25. Choice of Attention
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 25
• Additive attention is more stable than dot-product attention
• Multiple independent attention heads significantly improve model
performance:
• Allows the model to simultaneously attend to multiple locations in the input
utterance.
attention value for
head i at frame t and
output unit u
additive
Attention probability:
normalized across
frames
Attention context:
summarize over all
frames
26. Word Error Rate Training
• Word Error Rate Training: Minimize the expected number of word
errors over the training set (4-7% WERR)
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 26
• Cross-entropy training:
number of word errors in a hypothesis
relative to the ground-truth sequence
approximate the expectation
on N-best list
adding CE criterion important
to stabilize training
intractable since it involves a summation
over all possible label sequences
27. Inference in Attention Model
• Coverage penalty
• measures the extent to which the input frames are
“covered” by the attention weights
• addresses the common s2s failure mode of assigning
high probability to shorter output sequence
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 27
coverage penalty: penalize
incomplete transcripts
attention probability of the j-th
output label on the i-th frame
Attention model AM score
External LM score:
shallow fusion
Find with beam search
Shallow
fusion
28. Deal With OOV
• End-to-end models perform better when word or sub-word units are used
• Probably due to stronger constraint built inside the units
• Introduces the OOV problem
• Solution: Combines the character and word LMs
• Trick: exploit character LM (CLM) when word LM (WLM) not available and use WLM
when it’s available
• Benefit: keep promising candidates inside beam.
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 28
set of labels that indicate
the end of word,
last word of the
character sequence
word-level history
(excluding wg)
factor: adjust the
probabilities for OOV
probability of wg obtained
by CLM; used to cancel
the CLM probabilities
accumulated for wg.
29. Combine With RNN-T
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 29
Attention model:
integrates acoustic and
language information
Joint Decoder: combine
attention model output
and additional acoustic
information
31. Neural Transducer (NT)
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 31
• Drawback in Seq2seq: entire input sequence needs to be encoded
before the output sequence may be decoded
• Neural Transducer (NT): limits attention to fixed-size blocks of the
encoder space.
32. Neural Transducer (NT)
• Examines each block in
turn;
• Attention is only
computed over the
frames in each block;
• Within each block,
produces a sequence of
k (0 < k ≤ M ) outputs;
• Outputs an <epsilon>
symbol to signify the
end of block processing.
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 32
33. Neural Transducer (NT)
• Training:
• Requires knowing which sub-word/word units occur in each chunk -> an
alignment is needed.
• Finds the approximate best alignment with a dynamic programming-like
algorithm
• Batch the alignment inference steps and cache these alignments.
• Inference:
• Use a beam search heuristic
• At each output step m, extend each candidate by one symbol with all possible
extensions, and keep only the best n extensions.
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 33
34. Improve Neural Transducer
• Performance of Neural Transducer is much worse than attention
model however allows for streamed recognition
• Improvements:
• Allow attention to be computed looking back many previous chunks and look-
ahead by 5 frames
• Initialize NT from a pre-trained attention model
• Incorporate a stronger LM (e.g. sub-word and word LM);
• Use an external LM via shallow fusion
• Use multi-head attention
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 34
38. Tencent AI Lab
• Shenzhen office established in April 2016; Seattle
office established in May 2017.
• Mission: Improve existing scenarios and enable new
scenarios through technology breakthrough
1/21/18 38Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems
39. Over 100 Publications Since 2016
CVPR
Computer
Vision
ACL
Computational
Linguistics
ICML
Machine
Learning
NIPS
Machine
Learning and
Computational
Neuroscience
• CVPR 2017 received 2,680 valid
submissions, and accepted 783.
Acceptance rate 29.22%.
• 6 papers from Tencent AI Lab were
accepted.
• ACL 2017 received 1318 valid
submissions, and accepted 302.
Acceptance rate 22.91%.
• 3 papers from Tencent AI Lab were
accepted.
• ICML 2017 received 1676 valid
submissions, and accepted 434.
Acceptance rate 25.89%.
• 4 papers from Tencent AI Lab were
accepted.
• NIPS 2017 received 3240 valid
submissions, and accepted 678.
Acceptance rate 20.9%.
• 8 papers from Tencent AI Lab were
accepted, including 1 oral
(acceptance rate 1.2%).
1/21/18 39Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems
40. Current Focus Areas at Seattle Lab
Speech Processing
• Mic-array processing
• Speech recognition
• Speaker recognition
• Text to speech (TTS)
Natural Language
Processing
• Semantic parsing and
representation
• Semantic reasoning
• Knowledge extraction
and representation
• Natural language
generation
Dialog System
• Dialog state tracking
and management
• Dialog strategy
inference and
optimization
• Personalized adaptive
dialog
Optimization techniques, weakly supervised and reinforcement learning
Multi-modal signal processing and semantic grounding
1/21/18 40Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems
41. We Are Hiring Full Time Researchers
• In the area of speech processing, natural language processing, and
dialog system
• Self-motivated, good at both theory and engineering
• Principal researcher
• Experienced researchers who have made significant innovative scientific
contributions
• Apply at https://app.jobvite.com/j?cj=oKUh5fwK&s=LinkedIn
• Senior researcher
• Researchers who have made innovative scientific contributions
• Apply at https://app.jobvite.com/j?cj=oSUh5fwS&s=LinkedIn
• Send CV to us-career@tencent.com (mention the job and location)
1/21/18 41Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems
42. References and Credit of Pictures, Tables
• Survey and comparison
• Yu, D. and Li, J., 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica,
4(3), pp.396-409.
• Prabhavalkar, R., Rao, K., Sainath, T.N., Li, B., Johnson, L. and Jaitly, N., 2017. A comparison of sequence-to-sequence models
for speech recognition. In Proc. Interspeech (pp. 939-943).
• Battenberg, E., Chen, J., Child, R., Coates, A., Gaur, Y., Li, Y., Liu, H., Satheesh, S., Seetapun, D., Sriram, A. and Zhu, Z., 2017.
Exploring Neural Transducers for End-to-End Speech Recognition. arXiv preprint arXiv:1707.07413.
• Connectionist Temporal Classification (CTC)
• Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J., 2006, June. Connectionist temporal classification: labelling
unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on
Machine learning (pp. 369-376). ACM.
• Sak, H., Senior, A., Rao, K. and Beaufays, F., 2015. Fast and accurate recurrent neural network acoustic models for speech
recognition. arXiv preprint arXiv:1507.06947.
• RNN Transducer (RNN-T)
• Graves, A., 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
• Rao, K., Prabhavalkar, R. and Sak, H., 2017. Exploring Architectures, Data and Units for Streaming End-to-End Speech
Recognition with RNN-Transducer. In Proc. ASRU.
• Recurrent Neural Aligner (RNA)
• Sak, H., Shannon, M., Rao, K. and Beaufays, F., 2017. Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model
for Sequence to Sequence Mapping. In Proc. of Interspeech.
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 42
43. References and Credit of Pictures, Tables
• Attention Model
• Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P. and Bengio, Y., 2016, March. End-to-end attention-based large vocabulary
speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on (pp. 4945-
4949). IEEE.
• Chan, W., Jaitly, N., Le, Q. and Vinyals, O., 2016, March. Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference
on (pp. 4960-4964). IEEE.
• Kim, S., Hori, T. and Watanabe, S., 2017, March. Joint CTC-attention based end-to-end speech recognition using multi-task
learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 4835-4839). IEEE.
• Prabhavalkar, R., Sainath, T.N., Wu, Y., Nguyen, P., Chen, Z., Chiu, C.C. and Kannan, A., 2017. Minimum Word Error Rate
Training for Attention-based Sequence-to-Sequence Models. arXiv preprint arXiv:1712.01818.
• Kannan, A., Wu, Y., Nguyen, P., Sainath, T.N., Chen, Z. and Prabhavalkar, R., 2017. An analysis of incorporating an external
language model into a sequence-to-sequence model. arXiv preprint arXiv:1712.01996.
• Neural Transducer (NT)
• Jaitly, N., Le, Q.V., Vinyals, O., Sutskever, I., Sussillo, D. and Bengio, S., 2016. An online sequence-to-sequence model using
partial conditioning. In Advances in Neural Information Processing Systems (pp. 5067-5075).
• Sainath, T.N., Chiu, C.C., Prabhavalkar, R., Kannan, A., Wu, Y., Nguyen, P. and Chen, Z., 2017. Improving the Performance of
Online Neural Transducer Models. arXiv preprint arXiv:1712.01807.
• Joint Char-LM and Word-LM
• Hori, T., Watanabe, S. and Hershey, J.R., 2017. Multi-level Language Modeling and Decoding for Open Vocabulary End-to-End
Speech Recognition.
• Audhkhasi, K., Kingsbury, B., Ramabhadran, B., Saon, G. and Picheny, M., 2017. Building competitive direct acoustics-to-word
models for English conversational speech recognition. arXiv preprint arXiv:1712.03133.
1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 43