State of art e2e speech recognition system by Dong Yu from Tencent AI lab

1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 1
AI NEXTCon Seattle ‘18
1/17-20th | Seattle
#ainextcon
http://aisea18.xnextcon.com

State-of-the-art of End-to-end
Speech Recognition Systems
Dong Yu
Tencent AI Lab

Speech Recognition
• Determines the most likely word sequence, W = w1, ..., wn, given an
acoustic input sequence, x = x1, ..., xT , where T represents the
number of frames in the utterance
Pronunciation model (PM):
convert a word sequence to
a phoneme sequence
Acoustic model (AM): predicts the
likelihood of the acoustic input speech
utterance given a phoneme sequence
language model (LM):
predicts the likelihood of a
word sequence
Pronunciation model is
unnecessary but helpful
for some languages
Systems without a
pronunciation model are
called grapheme based

Can a Single Model Do All End-to-end?
• Speech recognition is essentially a sequence (audio sequence) to
sequence (word sequence) transformation problem
• Why not direct sequence to sequence transformation
• Connectionist Temporal Classification (CTC)
• Recurrent neural network transducer (RNN-T)
• Recurrent neural network aligner (RNN-A)
• Sequence to sequence with attention (seq2seq-attention)
• Neural Transducer (NT) (Limited Size Attention)
• Key problems:
• How to address the variable length problem
• How to address the length difference and alignment between the input and
output

Outline
• Recurrent neural network transducer (RNN-T) & Recurrent neural
network aligner (RNN-A)
• Summary

Connectionist Temporal Classification
blank symbol:
no output is
generated; not
confident enough
repeated symbol:
Treated as one
Frame number:
time
Recognition unit:
Characters, words,
phonemes
alignments:
Many paths lead to
the same
recognition result

•Inference
Connectionist Temporal Classification
• Training (conditional likelihood, sensitive to initialization)
Label sequence: e.g.,
good
Alignments lead to same
Label sequence, e.g.,
_gg_o_oo_d
Label sequence in
training set
Find the single alignment
with the highest score
Conditional
independence
assumption

Properties of CTC
CTC (Spiky): output
blanks until confident to
output the associated
label
• Simple: direct audio sequence to label sequence transformation, flexible modeling unit
• Fast: decoding is fast due to spikes (confident output) or less output units
• Random timing: spikes can happen at any delayed time (latency), may be outside of the
label boundary
• Limitation: assumes that model outputs at a given frame are independent of previous
output labels
Framewise (Flat): output
of each frame is the
same label

Improve: Sequence Discriminative Training
• CTC training cannot exploit external text to improve LM
• Use LM trained with external text to improve performance
• CTC training objective function is likelihood of observing the label
sequence given the audio sequence
• Use sequence discriminative training

Improve: Use Word Unit and Better Tricks
• The quality of the implicit LM depends on the modeling unit
• CTC-Word: directly models word, word-piece, or cross-word unit
State of the art
hybrid system on
300hr SWB is
around 10%
To achieve good result,
you need to use
complicated engineering
procedures similar to
that used in hybrid
systems
Decoding is simple
greedy search:
extremely fast

Solve OOV in CTC-Word
• Spell and recognize (SAR)
• Present training examples that contain both words and characters
b-t h e-e THE b-c a e-t CAT b-i e-s IS b-b l a c e-k BLACK
• The model is trained to ﬁrst spell the word and then recognize it
• The SAR model has a single softmax over words+characters in the
output layer
• Allows to leverage the greedy search decoding: no beam or other
graph-based search is needed
• Not the ideal solution
word
Beginning char ending char

Outline
• Summary

RNN Transducer (RNN-T)
maps acoustic frames
into a higher-level
representation.
Conditioned on previous
acoustic frames.
Initialized from CTC
model
Combines acoustic and language
information
• A streaming, all-neural, sequence-to-sequence architecture
• Jointly learns acoustic and language model components
Language model trained on text
only data.
Explicitly conditioned on the
history of previous non-blank
targets predicted by the model.
Can use grapheme, word,
word-piece units

Properties of RNN-T
• Solves the conditional independency assumption in CTC: decoding
results now dependent on previous output symbol through the
prediction network
• Solves the deﬁciency of not able to exploit large text-only data in
CTC: the prediction network can exploit larger text only data
• The prediction network is not conditioned on the encoder output:
allows for the pre-training of the decoder as a RNN language model
on text-only data
• Still uses the blank symbol and same repeated symbol handling
technique used in CTC

RNN-T Training Procedure
The training procedure is very
complicated to achieve good result.
Not significantly simpler than
hybrid system

RNN Transducer: Inference
next acoustic frame
previously predicted label
Updated only if the predicted
label is non-blank
next output label probabilities• Alternate between updating the encoder and
the prediction network based on if the
predicted label is a blank or non-blank.
• Inference is terminated when blank is output
at the last frame, T .
• Greedy search or beam search

Recurrent Neural Aligner (RNA)
• Similar to RNN-T:
• Aims at solving the conditional independency assumption in CTC
• Uses the predicted label at time t−1 as an additional input to the recurrent
model when predicting the label at time t.
• Has an encoder network to encode raw input as input sequence x.
• Has a recurrent decoder network. The input to the decoder network at time t
for a given alignment z is [xt zt−1].
• Different from RNN-T:
• RNN-T Uses an RNN for LM and another one for AM and then combine them;
RNN-A uses just one RNN to train AM/LM jointly (not factorized)
• RNN-A requires approximate forward-backward algorithm to train due to the
joint RNN model.
alignment

Outline
• Summary

Sequence-to-Sequence with Attention
Listen, Attend, Spell (LAS Model)

Alternative View of Attention Model
Encoder: maps input acoustic
vectors into a higher-level
representation
Attention: summarizes the
output of the encoder based on
the current state of the decoder
Decoder: models an output
distribution over the next target
conditioned on the sequence of
previous predictions

Properties of Basic Attention Model
• Strength (similar to RNN-T and better than CTC):
• No conditional independence assumption
• Prediction of next unit depends on both LM and AM information
• Different from RNN-T:
• The attention weights depend on the current decoder state
• Weakness (worse than CTC):
• Exposure bias: conditioned on true label during training and estimated label
during decoding.
• Too flexible: Attention weights are not constrained to attend from left to right
• Very difficult to train well esp. when the input length is long: Even with
pyramid structure and/or other subsampling techniques
• High latency: cannot be streamed

Constrain Attention Model with CTC
• left to right constraint in CTC can help regularize attention model
left to right constraint in CTC can
help regularize attention model
Joint training criterion
Regularization through shared
encoder

Constraint Attention Model with CTC
• Speed in learning alignments between characters (y-axis) and acoustic
frames (x-axis) is significantly improved with multi-task learning
Aligned to
the end
Incorrect
orderAttention Only
Attention + CTC

Decoding in Attention + CTC Model
• Basic Idea: Beam search to find
CTC decodes at the frame rate attention decoder operates character-by-character
• Difficulty: mismatch between CTC and Attention model scoring
• Solution: compute the probability of each partial hypothesis based on
the CTC prefix probability defined as the cumulative probability of all
label sequences that have h as their prefix

Choice of Attention
• Additive attention is more stable than dot-product attention
• Multiple independent attention heads signiﬁcantly improve model
performance:
• Allows the model to simultaneously attend to multiple locations in the input
utterance.
attention value for
head i at frame t and
output unit u
additive
Attention probability:
normalized across
frames
Attention context:
summarize over all
frames

Word Error Rate Training
• Word Error Rate Training: Minimize the expected number of word
errors over the training set (4-7% WERR)
• Cross-entropy training:
number of word errors in a hypothesis
relative to the ground-truth sequence
approximate the expectation
on N-best list
adding CE criterion important
to stabilize training
intractable since it involves a summation
over all possible label sequences

Inference in Attention Model
• Coverage penalty
• measures the extent to which the input frames are
“covered” by the attention weights
• addresses the common s2s failure mode of assigning
high probability to shorter output sequence
coverage penalty: penalize
incomplete transcripts
attention probability of the j-th
output label on the i-th frame
Attention model AM score
External LM score:
shallow fusion
Find with beam search
Shallow
fusion

Deal With OOV
• End-to-end models perform better when word or sub-word units are used
• Probably due to stronger constraint built inside the units
• Introduces the OOV problem
• Solution: Combines the character and word LMs
• Trick: exploit character LM (CLM) when word LM (WLM) not available and use WLM
when it’s available
• Benefit: keep promising candidates inside beam.
set of labels that indicate
the end of word,
last word of the
character sequence
word-level history
(excluding wg)
factor: adjust the
probabilities for OOV
probability of wg obtained
by CLM; used to cancel
the CLM probabilities
accumulated for wg.

Combine With RNN-T
Attention model:
integrates acoustic and
language information
Joint Decoder: combine
attention model output
and additional acoustic
information

Outline
• Summary

Neural Transducer (NT)
• Drawback in Seq2seq: entire input sequence needs to be encoded
before the output sequence may be decoded
• Neural Transducer (NT): limits attention to ﬁxed-size blocks of the
encoder space.

• Examines each block in
turn;
• Attention is only
computed over the
frames in each block;
• Within each block,
produces a sequence of
k (0 < k ≤ M ) outputs;
• Outputs an <epsilon>
symbol to signify the
end of block processing.

• Training:
• Requires knowing which sub-word/word units occur in each chunk -> an
alignment is needed.
• Finds the approximate best alignment with a dynamic programming-like
algorithm
• Batch the alignment inference steps and cache these alignments.
• Inference:
• Use a beam search heuristic
• At each output step m, extend each candidate by one symbol with all possible
extensions, and keep only the best n extensions.

Improve Neural Transducer
• Performance of Neural Transducer is much worse than attention
model however allows for streamed recognition
• Improvements:
• Allow attention to be computed looking back many previous chunks and look-
ahead by 5 frames
• Initialize NT from a pre-trained attention model
• Incorporate a stronger LM (e.g. sub-word and word LM);
• Use an external LM via shallow fusion
• Use multi-head attention

Outline
• Summary

Summary
Only model with
conditional
independence
assumption
Requires external
LM or large unit
Exploits external LM
directly in the model
Decoding may be
just greedy search
Decoder hidden state is used to
extract supportive info in inputs
Combines RNN-T and
attention model

Summary
• All these models still underperform the DNN-HMM hybrid system
even with all the tricks; but the gap decreases when the training set
increases
• With all the tricks, the training process is no longer simple as what
have been claimed
• not using lexicon is not new. It’s called grapheme based model in the past
• Several problems still need to be solved
• Is RNN-T or attention model more promising?
• Is there a better solution or theory to the problem of favoring short output
sequences in attention and other segmental models?
• Is there a better model structure?
• Is there theory or procedure to reduce the demand of training data?

Tencent AI Lab
• Shenzhen office established in April 2016; Seattle
office established in May 2017.
• Mission: Improve existing scenarios and enable new
scenarios through technology breakthrough
1/21/18 38Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems

Over 100 Publications Since 2016
CVPR
Computer
Vision
ACL
Computational
Linguistics
ICML
Machine
Learning
NIPS
Machine
Learning and
Computational
Neuroscience
• CVPR 2017 received 2,680 valid
submissions, and accepted 783.
Acceptance rate 29.22%.
• 6 papers from Tencent AI Lab were
accepted.
• ACL 2017 received 1318 valid
accepted.
• ICML 2017 received 1676 valid
accepted.
• NIPS 2017 received 3240 valid
accepted, including 1 oral
(acceptance rate 1.2%).

Current Focus Areas at Seattle Lab
Speech Processing
• Mic-array processing
• Speech recognition
• Speaker recognition
• Text to speech (TTS)
Natural Language
Processing
• Semantic parsing and
representation
• Semantic reasoning
• Knowledge extraction
and representation
• Natural language
generation
Dialog System
• Dialog state tracking
and management
• Dialog strategy
inference and
optimization
• Personalized adaptive
dialog
Optimization techniques, weakly supervised and reinforcement learning
Multi-modal signal processing and semantic grounding

We Are Hiring Full Time Researchers
• In the area of speech processing, natural language processing, and
dialog system
• Self-motivated, good at both theory and engineering
• Principal researcher
• Experienced researchers who have made significant innovative scientific
contributions
• Apply at https://app.jobvite.com/j?cj=oKUh5fwK&s=LinkedIn
• Senior researcher
• Researchers who have made innovative scientific contributions
• Apply at https://app.jobvite.com/j?cj=oSUh5fwS&s=LinkedIn
• Send CV to us-career@tencent.com (mention the job and location)

References and Credit of Pictures, Tables
• Survey and comparison
• Yu, D. and Li, J., 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica,
4(3), pp.396-409.
• Prabhavalkar, R., Rao, K., Sainath, T.N., Li, B., Johnson, L. and Jaitly, N., 2017. A comparison of sequence-to-sequence models
for speech recognition. In Proc. Interspeech (pp. 939-943).
• Battenberg, E., Chen, J., Child, R., Coates, A., Gaur, Y., Li, Y., Liu, H., Satheesh, S., Seetapun, D., Sriram, A. and Zhu, Z., 2017.
Exploring Neural Transducers for End-to-End Speech Recognition. arXiv preprint arXiv:1707.07413.
• Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J., 2006, June. Connectionist temporal classification: labelling
unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on
Machine learning (pp. 369-376). ACM.
• Sak, H., Senior, A., Rao, K. and Beaufays, F., 2015. Fast and accurate recurrent neural network acoustic models for speech
recognition. arXiv preprint arXiv:1507.06947.
• RNN Transducer (RNN-T)
• Graves, A., 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
• Rao, K., Prabhavalkar, R. and Sak, H., 2017. Exploring Architectures, Data and Units for Streaming End-to-End Speech
Recognition with RNN-Transducer. In Proc. ASRU.
• Recurrent Neural Aligner (RNA)
• Sak, H., Shannon, M., Rao, K. and Beaufays, F., 2017. Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model
for Sequence to Sequence Mapping. In Proc. of Interspeech.

References and Credit of Pictures, Tables
• Attention Model
• Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P. and Bengio, Y., 2016, March. End-to-end attention-based large vocabulary
speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on (pp. 4945-
4949). IEEE.
• Chan, W., Jaitly, N., Le, Q. and Vinyals, O., 2016, March. Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference
on (pp. 4960-4964). IEEE.
• Kim, S., Hori, T. and Watanabe, S., 2017, March. Joint CTC-attention based end-to-end speech recognition using multi-task
learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 4835-4839). IEEE.
• Prabhavalkar, R., Sainath, T.N., Wu, Y., Nguyen, P., Chen, Z., Chiu, C.C. and Kannan, A., 2017. Minimum Word Error Rate
Training for Attention-based Sequence-to-Sequence Models. arXiv preprint arXiv:1712.01818.
• Kannan, A., Wu, Y., Nguyen, P., Sainath, T.N., Chen, Z. and Prabhavalkar, R., 2017. An analysis of incorporating an external
language model into a sequence-to-sequence model. arXiv preprint arXiv:1712.01996.
• Neural Transducer (NT)
• Jaitly, N., Le, Q.V., Vinyals, O., Sutskever, I., Sussillo, D. and Bengio, S., 2016. An online sequence-to-sequence model using
partial conditioning. In Advances in Neural Information Processing Systems (pp. 5067-5075).
• Sainath, T.N., Chiu, C.C., Prabhavalkar, R., Kannan, A., Wu, Y., Nguyen, P. and Chen, Z., 2017. Improving the Performance of
Online Neural Transducer Models. arXiv preprint arXiv:1712.01807.
• Joint Char-LM and Word-LM
• Hori, T., Watanabe, S. and Hershey, J.R., 2017. Multi-level Language Modeling and Decoding for Open Vocabulary End-to-End
Speech Recognition.
• Audhkhasi, K., Kingsbury, B., Ramabhadran, B., Saon, G. and Picheny, M., 2017. Building competitive direct acoustics-to-word
models for English conversational speech recognition. arXiv preprint arXiv:1712.03133.

Thank You
us-career@tencent.com

State of art e2e speech recognition system by Dong Yu from Tencent AI lab

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a State of art e2e speech recognition system by Dong Yu from Tencent AI lab

Semelhante a State of art e2e speech recognition system by Dong Yu from Tencent AI lab (20)

Mais de Bill Liu

Mais de Bill Liu (20)

Último

Último (20)

State of art e2e speech recognition system by Dong Yu from Tencent AI lab