SlideShare uma empresa Scribd logo
1 de 44
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 1
AI	NEXTCon Seattle	‘18
1/17-20th |	Seattle
#ainextcon
http://aisea18.xnextcon.com
State-of-the-art of End-to-end
Speech Recognition Systems
Dong	Yu
Tencent	AI	Lab
Speech Recognition
• Determines	the	most	likely	word	sequence,	W	=	w1,	...,	wn,	given	an	
acoustic	input	sequence,	x	=	x1,	...,	xT ,	where	T	represents	the	
number	of	frames	in	the	utterance
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 3
Pronunciation	model	(PM):	
convert	a	word	sequence	to	
a	phoneme	sequence
Acoustic	model	(AM):	predicts	the	
likelihood	of	the	acoustic	input	speech	
utterance	given	a	phoneme	sequence
language	model	(LM):	
predicts	the	likelihood	of	a	
word	sequence
Pronunciation	model	is	
unnecessary	but	helpful	
for	some	languages
Systems	without	a	
pronunciation	model	are	
called	grapheme based
Can a Single Model Do All End-to-end?
• Speech	recognition	is	essentially	a	sequence	(audio	sequence)	to	
sequence	(word	sequence)	transformation	problem
• Why	not	direct	sequence	to	sequence	transformation
• Connectionist	Temporal	Classification	(CTC)
• Recurrent	neural	network	transducer	(RNN-T)	
• Recurrent	neural	network	aligner	(RNN-A)
• Sequence	to	sequence	with	attention	(seq2seq-attention)
• Neural	Transducer	(NT)	(Limited	Size	Attention)
• Key	problems:	
• How	to	address	the	variable	length	problem
• How	to	address	the	length	difference	and	alignment between	the	input	and	
output
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 4
Outline
• Connectionist	Temporal	Classification	(CTC)
• Recurrent	neural	network	transducer	(RNN-T)	&	Recurrent	neural	
network	aligner	(RNN-A)
• Sequence	to	sequence	with	attention	(seq2seq-attention)
• Neural	Transducer	(NT)	(Limited	Size	Attention)
• Summary
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 5
Connectionist Temporal Classification
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 6
blank	symbol:	
no	output	is	
generated;	not	
confident	enough
repeated	symbol:	
Treated	as	one
Frame	number:	
time
Recognition	unit:	
Characters,	words,	
phonemes
alignments:	
Many	paths	lead	to	
the	same	
recognition	result
•Inference
Connectionist Temporal Classification
• Training	(conditional	likelihood,	sensitive	to	initialization)
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 7
Label	sequence:	e.g.,	
good
Alignments	lead	to	same	
Label	sequence,	e.g.,	
_gg_o_oo_d
Label	sequence	in	
training	set
Find	the	single	alignment	
with	the	highest	score
Conditional	
independence	
assumption
Properties of CTC
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 8
CTC	(Spiky):	output	
blanks	until	confident	to	
output	the	associated	
label
• Simple:	direct	audio	sequence	to	label	sequence	transformation,	flexible	modeling	unit
• Fast:	decoding	is	fast	due	to	spikes	(confident	output)	or	less	output	units
• Random	timing:	spikes	can	happen	at	any	delayed	time	(latency),	may	be	outside	of	the	
label	boundary
• Limitation:	assumes	that	model	outputs	at	a	given	frame	are	independent	of	previous	
output	labels
Framewise	(Flat):	output	
of	each	frame	is	the	
same	label
Improve: Sequence Discriminative Training
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 9
• CTC	training	cannot	exploit	external	text	to	improve	LM
• Use	LM	trained	with	external	text	to	improve	performance
• CTC	training	objective	function	is	likelihood	of	observing	the	label	
sequence	given	the	audio	sequence
• Use	sequence	discriminative	training
Improve: Use Word Unit and Better Tricks
• The	quality	of	the	implicit	LM	depends	on	the	modeling	unit
• CTC-Word:	directly	models	word,	word-piece,	or	cross-word	unit
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 10
State	of	the	art	
hybrid	system	on	
300hr	SWB	is	
around	10%
To	achieve	good	result,	
you	need	to	use	
complicated	engineering	
procedures	similar	to	
that	used	in	hybrid	
systems
Decoding	is	simple	
greedy	search:	
extremely	fast
Solve OOV in CTC-Word
• Spell	and	recognize	(SAR)	
• Present	training	examples	that	contain	both	words	and	characters
b-t	h	e-e	THE b-c	a	e-t	CAT b-i	e-s	IS b-b	l	a	c	e-k	BLACK
• The	model	is	trained	to	first	spell	the	word	and	then	recognize	it
• The	SAR	model	has	a	single	softmax over	words+characters in	the	
output	layer
• Allows	to	leverage	the	greedy	search	decoding:	no	beam	or	other	
graph-based	search	is	needed
• Not	the	ideal	solution
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 11
word
Beginning	char ending	char
Outline
• Connectionist	Temporal	Classification	(CTC)
• Recurrent	neural	network	transducer	(RNN-T)	&	Recurrent	neural	
network	aligner	(RNN-A)
• Sequence	to	sequence	with	attention	(seq2seq-attention)
• Neural	Transducer	(NT)	(Limited	Size	Attention)
• Summary
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 12
RNN Transducer (RNN-T)
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 13
maps	acoustic	frames	
into	a	higher-level	
representation.	
Conditioned	on	previous	
acoustic	frames.	
Initialized	from	CTC	
model
Combines	acoustic	and	language	
information
• A	streaming,	all-neural,	sequence-to-sequence	architecture	
• Jointly	learns	acoustic	and	language	model	components
Language	model	trained	on	text	
only	data.	
Explicitly	conditioned	on	the	
history	of	previous	non-blank	
targets	predicted	by	the	model.	
Can	use	grapheme,	word,	
word-piece	units
Properties of RNN-T
• Solves	the	conditional	independency	assumption	in	CTC:	decoding	
results	now	dependent	on	previous	output	symbol	through	the	
prediction	network
• Solves	the	deficiency	of	not	able	to	exploit	large	text-only	data	in	
CTC:	the	prediction	network	can	exploit	larger	text	only	data
• The	prediction	network	is	not	conditioned	on	the	encoder	output:	
allows	for	the	pre-training	of	the	decoder	as	a	RNN	language	model	
on	text-only	data		
• Still	uses	the	blank	symbol	and	same	repeated	symbol	handling	
technique	used	in	CTC
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 14
RNN-T Training Procedure
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 15
The	training	procedure	is	very	
complicated	to	achieve	good	result.	
Not	significantly	simpler	than	
hybrid	system
RNN Transducer: Inference
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 16
next	acoustic	frame
previously	predicted	label
Updated	only	if	the	predicted	
label	is	non-blank
next	output	label	probabilities• Alternate	between	updating	the	encoder	and	
the	prediction	network	based	on	if	the	
predicted	label	is	a	blank	or	non-blank.	
• Inference	is	terminated	when	blank	is	output	
at	the	last	frame,	T	.
• Greedy	search	or	beam	search
Recurrent Neural Aligner (RNA)
• Similar	to	RNN-T:	
• Aims	at	solving	the	conditional	independency	assumption	in	CTC
• Uses	the	predicted	label	at	time	t−1	as	an	additional	input	to	the	recurrent	
model	when	predicting	the	label	at	time	t.	
• Has	an	encoder	network	to	encode	raw	input	as	input	sequence	x.	
• Has	a	recurrent	decoder	network.	The	input	to	the	decoder	network	at	time	t	
for	a	given	alignment	z	is	[xt zt−1].	
• Different	from	RNN-T:
• RNN-T	Uses	an	RNN	for	LM	and	another	one	for	AM	and	then	combine	them;	
RNN-A	uses	just	one	RNN	to	train	AM/LM	jointly	(not	factorized)
• RNN-A	requires	approximate	forward-backward	algorithm	to	train	due	to	the	
joint	RNN	model.	
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 17
alignment
Outline
• Connectionist	Temporal	Classification	(CTC)
• Recurrent	neural	network	transducer	(RNN-T)	&	Recurrent	neural	
network	aligner	(RNN-A)
• Sequence	to	sequence	with	attention	(seq2seq-attention)
• Neural	Transducer	(NT)	(Limited	Size	Attention)
• Summary
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 18
Sequence-to-Sequence with Attention
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 19
Listen,	Attend,	Spell	(LAS	Model)
Alternative View of Attention Model
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 20
Encoder:	maps	input	acoustic	
vectors	into	a	higher-level	
representation
Attention:	summarizes	the
output	of	the	encoder	based	on	
the	current	state	of	the	decoder
Decoder:	models	an	output	
distribution	over	the	next	target	
conditioned	on	the	sequence	of	
previous	predictions
Properties of Basic Attention Model
• Strength	(similar	to	RNN-T	and	better	than	CTC):	
• No	conditional	independence	assumption
• Prediction	of	next	unit	depends	on	both	LM	and	AM	information
• Different	from	RNN-T:
• The	attention	weights	depend	on	the	current	decoder	state
• Weakness	(worse	than	CTC):
• Exposure	bias:	conditioned	on	true	label	during	training	and	estimated	label	
during	decoding.
• Too	flexible:	Attention	weights	are	not	constrained	to	attend	from	left	to	right
• Very	difficult	to	train	well	esp.	when	the	input	length	is	long:	Even	with	
pyramid	structure	and/or	other	subsampling	techniques
• High	latency:	cannot	be	streamed
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 21
Constrain Attention Model with CTC
• left	to	right	constraint	in	CTC	can	help	regularize	attention	model
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 22
left	to	right	constraint	in	CTC	can	
help	regularize	attention	model
Joint	training	criterion
Regularization	through	shared	
encoder
Constraint Attention Model with CTC
• Speed	in	learning	alignments	between	characters	(y-axis)	and	acoustic	
frames	(x-axis)	is	significantly	improved	with	multi-task	learning
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 23
Aligned	to	
the	end
Incorrect
orderAttention	Only
Attention	+	CTC
Decoding in Attention + CTC Model
• Basic	Idea:	Beam	search	to	find	
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 24
CTC	decodes	at	the	frame	rate attention	decoder	operates	character-by-character
• Difficulty:	mismatch	between	CTC	and	Attention	model	scoring
• Solution:	compute	the	probability	of	each	partial	hypothesis	based	on	
the	CTC	prefix	probability	defined	as	the	cumulative	probability	of	all	
label	sequences	that	have	h	as	their	prefix
Choice of Attention
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 25
• Additive	attention	is	more	stable	than	dot-product	attention
• Multiple	independent	attention	heads	significantly	improve	model	
performance:	
• Allows	the	model	to	simultaneously	attend	to	multiple	locations	in	the	input	
utterance.
attention	value	for	
head	i	at	frame	t	and	
output	unit	u
additive
Attention	probability:	
normalized	across	
frames
Attention	context:	
summarize	over	all	
frames
Word Error Rate Training
• Word	Error	Rate	Training:	Minimize	the	expected	number	of	word	
errors	over	the	training	set	(4-7%	WERR)
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 26
• Cross-entropy	training:
number	of	word	errors	in	a	hypothesis	
relative	to	the	ground-truth	sequence
approximate	the	expectation	
on	N-best	list
adding	CE	criterion		important	
to	stabilize	training
intractable	since	it	involves	a	summation	
over	all	possible	label	sequences
Inference in Attention Model
• Coverage	penalty	
• measures	the	extent	to	which	the	input	frames	are	
“covered”	by	the	attention	weights
• addresses	the	common	s2s	failure	mode	of	assigning	
high	probability	to	shorter	output	sequence
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 27
coverage	penalty:	penalize	
incomplete	transcripts
attention	probability	of	the	j-th
output	label	on	the	i-th frame
Attention	model	AM	score
External	LM	score:	
shallow	fusion
Find	with	beam	search
Shallow	
fusion
Deal With OOV
• End-to-end	models		perform	better	when	word	or	sub-word	units	are	used
• Probably	due	to	stronger	constraint	built	inside	the	units
• Introduces	the	OOV	problem
• Solution:	Combines	the	character	and	word	LMs
• Trick:	exploit	character	LM	(CLM)	when	word	LM	(WLM)	not	available	and	use	WLM	
when	it’s	available
• Benefit:	keep	promising	candidates	inside	beam.
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 28
set	of	labels	that	indicate	
the	end	of	word,	
last	word	of	the	
character	sequence
word-level	history	
(excluding	wg)
factor:		adjust	the	
probabilities	for	OOV	
probability	of	wg obtained	
by	CLM;	used	to	cancel	
the	CLM	probabilities	
accumulated	for	wg.
Combine With RNN-T
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 29
Attention	model:	
integrates	acoustic	and	
language	information
Joint	Decoder:	combine	
attention	model	output	
and	additional	acoustic	
information
Outline
• Connectionist	Temporal	Classification	(CTC)
• Recurrent	neural	network	transducer	(RNN-T)	&	Recurrent	neural	
network	aligner	(RNN-A)
• Sequence	to	sequence	with	attention	(seq2seq-attention)
• Neural	Transducer	(NT)	(Limited	Size	Attention)
• Summary
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 30
Neural Transducer (NT)
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 31
• Drawback	in	Seq2seq:	entire	input	sequence	needs	to	be	encoded	
before	the	output	sequence	may	be	decoded
• Neural	Transducer	(NT):	limits	attention	to	fixed-size	blocks	of	the	
encoder	space.
Neural Transducer (NT)
• Examines	each	block	in	
turn;
• Attention	is	only	
computed	over	the	
frames	in	each	block;
• Within	each	block,	
produces	a	sequence	of	
k	(0	<	k	≤	M	)	outputs;	
• Outputs	an	<epsilon>	
symbol	to	signify	the	
end	of	block	processing.
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 32
Neural Transducer (NT)
• Training:	
• Requires	knowing	which	sub-word/word	units	occur	in	each	chunk	->	an	
alignment	is	needed.	
• Finds	the	approximate	best	alignment	with	a	dynamic	programming-like	
algorithm	
• Batch	the	alignment	inference	steps	and	cache	these	alignments.	
• Inference:
• Use	a	beam	search	heuristic	
• At	each	output	step	m,	extend	each	candidate	by	one	symbol	with	all	possible	
extensions,	and	keep	only	the	best	n	extensions.	
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 33
Improve Neural Transducer
• Performance	of	Neural	Transducer	is	much	worse	than	attention	
model	however	allows	for	streamed	recognition	
• Improvements:
• Allow	attention	to	be	computed	looking	back	many	previous	chunks	and	look-
ahead	by	5	frames
• Initialize	NT	from	a	pre-trained	attention	model
• Incorporate	a	stronger	LM	(e.g.	sub-word	and	word	LM);	
• Use	an	external	LM	via	shallow	fusion	
• Use	multi-head	attention
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 34
Outline
• Connectionist	Temporal	Classification	(CTC)
• Recurrent	neural	network	transducer	(RNN-T)	&	Recurrent	neural	
network	aligner	(RNN-A)
• Sequence	to	sequence	with	attention	(seq2seq-attention)
• Neural	Transducer	(NT)	(Limited	Size	Attention)
• Summary
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 35
Summary
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 36
Only	model	with	
conditional	
independence	
assumption
Requires	external	
LM	or	large	unit
Exploits	external	LM	
directly	in	the	model
Decoding	may	be	
just	greedy	search
Decoder	hidden	state	is	used	to	
extract	supportive	info	in	inputs
Combines	RNN-T	and	
attention	model
Summary
• All	these	models	still	underperform	the	DNN-HMM	hybrid	system	
even	with	all	the	tricks;	but	the	gap	decreases	when	the	training	set	
increases
• With	all	the	tricks,	the	training	process	is	no	longer	simple	as	what	
have	been	claimed	
• not	using	lexicon	is	not	new.	It’s	called	grapheme	based	model	in	the	past
• Several	problems	still	need	to	be	solved
• Is	RNN-T	or	attention	model	more	promising?
• Is	there	a	better	solution	or	theory	to	the	problem	of	favoring	short	output	
sequences	in	attention	and	other	segmental	models?
• Is	there	a	better	model	structure?
• Is	there	theory	or	procedure	to	reduce	the	demand	of	training	data?
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 37
Tencent AI Lab
• Shenzhen	office	established	in	April	2016;	Seattle	
office	established	in	May	2017.
• Mission:	Improve	existing	scenarios	and	enable	new	
scenarios	through	technology	breakthrough
1/21/18 38Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems
Over 100 Publications Since 2016
CVPR
Computer
Vision
ACL
Computational
Linguistics
ICML
Machine
Learning
NIPS
Machine
Learning and
Computational
Neuroscience
• CVPR 2017 received 2,680 valid
submissions, and accepted 783.
Acceptance rate 29.22%.
• 6 papers from Tencent AI Lab were
accepted.
• ACL 2017 received 1318 valid
submissions, and accepted 302.
Acceptance rate 22.91%.
• 3 papers from Tencent AI Lab were
accepted.
• ICML 2017 received 1676 valid
submissions, and accepted 434.
Acceptance rate 25.89%.
• 4 papers from Tencent AI Lab were
accepted.
• NIPS 2017 received 3240 valid
submissions, and accepted 678.
Acceptance rate 20.9%.
• 8 papers from Tencent AI Lab were
accepted, including 1 oral
(acceptance rate 1.2%).
1/21/18 39Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems
Current Focus Areas at Seattle Lab
Speech	Processing
• Mic-array	processing
• Speech	recognition
• Speaker	recognition
• Text	to	speech	(TTS)
Natural	Language	
Processing
• Semantic	parsing	and	
representation
• Semantic	reasoning
• Knowledge	extraction	
and	representation
• Natural	language	
generation
Dialog	System
• Dialog	state	tracking	
and	management
• Dialog	strategy	
inference	and		
optimization
• Personalized	adaptive	
dialog
Optimization	techniques,	weakly	supervised	and	reinforcement	learning
Multi-modal	signal	processing	and	semantic	grounding
1/21/18 40Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems
We Are Hiring Full Time Researchers
• In	the	area	of	speech	processing,	natural	language	processing,	and	
dialog	system
• Self-motivated,	good	at	both	theory	and	engineering
• Principal	researcher
• Experienced	researchers	who	have	made	significant	innovative	scientific	
contributions
• Apply	at	https://app.jobvite.com/j?cj=oKUh5fwK&s=LinkedIn	
• Senior	researcher
• Researchers	who	have	made	innovative	scientific	contributions
• Apply	at	https://app.jobvite.com/j?cj=oSUh5fwS&s=LinkedIn	
• Send	CV	to	us-career@tencent.com	(mention	the	job	and	location)
1/21/18 41Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems
References and Credit of Pictures, Tables
• Survey	and	comparison
• Yu,	D.	and	Li,	J.,	2017.	Recent	progresses	in	deep	learning	based	acoustic	models.	IEEE/CAA	Journal	of	Automatica Sinica,	
4(3),	pp.396-409.
• Prabhavalkar,	R.,	Rao,	K.,	Sainath,	T.N.,	Li,	B.,	Johnson,	L.	and	Jaitly,	N.,	2017.	A	comparison	of	sequence-to-sequence	models	
for	speech	recognition.	In	Proc.	Interspeech (pp.	939-943).
• Battenberg,	E.,	Chen,	J.,	Child,	R.,	Coates,	A.,	Gaur,	Y.,	Li,	Y.,	Liu,	H.,	Satheesh,	S.,	Seetapun,	D.,	Sriram,	A.	and	Zhu,	Z.,	2017.	
Exploring	Neural	Transducers	for	End-to-End	Speech	Recognition.	arXiv preprint	arXiv:1707.07413.
• Connectionist	Temporal	Classification	(CTC)
• Graves,	A.,	Fernández,	S.,	Gomez,	F.	and	Schmidhuber,	J.,	2006,	June.	Connectionist	temporal	classification:	labelling	
unsegmented	sequence	data	with	recurrent	neural	networks.	In	Proceedings	of	the	23rd	international	conference	on	
Machine	learning (pp.	369-376).	ACM.
• Sak,	H.,	Senior,	A.,	Rao,	K.	and	Beaufays,	F.,	2015.	Fast	and	accurate	recurrent	neural	network	acoustic	models	for	speech	
recognition.	arXiv preprint	arXiv:1507.06947.
• RNN	Transducer	(RNN-T)
• Graves,	A.,	2012.	Sequence	transduction	with	recurrent	neural	networks.	arXiv preprint	arXiv:1211.3711.
• Rao,	K.,	Prabhavalkar,	R.	and	Sak,	H.,	2017.	Exploring	Architectures,	Data	and	Units	for	Streaming	End-to-End	Speech	
Recognition	with	RNN-Transducer.	In	Proc.	ASRU.
• Recurrent	Neural	Aligner	(RNA)
• Sak,	H.,	Shannon,	M.,	Rao,	K.	and	Beaufays,	F.,	2017.	Recurrent	Neural	Aligner:	An	Encoder-Decoder	Neural	Network	Model	
for	Sequence	to	Sequence	Mapping.	In	Proc.	of	Interspeech.
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 42
References and Credit of Pictures, Tables
• Attention	Model
• Bahdanau,	D.,	Chorowski,	J.,	Serdyuk,	D.,	Brakel,	P.	and	Bengio,	Y.,	2016,	March.	End-to-end	attention-based	large	vocabulary	
speech	recognition.	In	Acoustics,	Speech	and	Signal	Processing	(ICASSP),	2016	IEEE	International	Conference	on (pp.	4945-
4949).	IEEE.
• Chan,	W.,	Jaitly,	N.,	Le,	Q.	and	Vinyals,	O.,	2016,	March.	Listen,	attend	and	spell:	A	neural	network	for	large	vocabulary	
conversational	speech	recognition.	In	Acoustics,	Speech	and	Signal	Processing	(ICASSP),	2016	IEEE	International	Conference	
on (pp.	4960-4964).	IEEE.
• Kim,	S.,	Hori,	T.	and	Watanabe,	S.,	2017,	March.	Joint	CTC-attention	based	end-to-end	speech	recognition	using	multi-task	
learning.	In	Acoustics,	Speech	and	Signal	Processing	(ICASSP),	2017	IEEE	International	Conference	on (pp.	4835-4839).	IEEE.
• Prabhavalkar,	R.,	Sainath,	T.N.,	Wu,	Y.,	Nguyen,	P.,	Chen,	Z.,	Chiu,	C.C.	and	Kannan,	A.,	2017.	Minimum	Word	Error	Rate	
Training	for	Attention-based	Sequence-to-Sequence	Models.	arXiv preprint	arXiv:1712.01818.
• Kannan,	A.,	Wu,	Y.,	Nguyen,	P.,	Sainath,	T.N.,	Chen,	Z.	and	Prabhavalkar,	R.,	2017.	An	analysis	of	incorporating	an	external	
language	model	into	a	sequence-to-sequence	model.	arXiv preprint	arXiv:1712.01996.
• Neural	Transducer	(NT)
• Jaitly,	N.,	Le,	Q.V.,	Vinyals,	O.,	Sutskever,	I.,	Sussillo,	D.	and	Bengio,	S.,	2016.	An	online	sequence-to-sequence	model	using	
partial	conditioning.	In	Advances	in	Neural	Information	Processing	Systems (pp.	5067-5075).
• Sainath,	T.N.,	Chiu,	C.C.,	Prabhavalkar,	R.,	Kannan,	A.,	Wu,	Y.,	Nguyen,	P.	and	Chen,	Z.,	2017.	Improving	the	Performance	of	
Online	Neural	Transducer	Models.	arXiv preprint	arXiv:1712.01807.
• Joint	Char-LM	and	Word-LM
• Hori,	T.,	Watanabe,	S.	and	Hershey,	J.R.,	2017.	Multi-level	Language	Modeling	and	Decoding	for	Open	Vocabulary	End-to-End	
Speech	Recognition.
• Audhkhasi,	K.,	Kingsbury,	B.,	Ramabhadran,	B.,	Saon,	G.	and	Picheny,	M.,	2017.	Building	competitive	direct	acoustics-to-word	
models	for	English	conversational	speech	recognition.	arXiv preprint	arXiv:1712.03133.
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 43
1/21/18 Dong	Yu	:	State-of-the-art	of	End-to-end	Speech	Recognition	Systems 44
Thank	You
us-career@tencent.com

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

RNN-LSTM.pptx
RNN-LSTM.pptxRNN-LSTM.pptx
RNN-LSTM.pptx
 
SPEECH BASED EMOTION RECOGNITION USING VOICE
SPEECH BASED  EMOTION RECOGNITION USING VOICESPEECH BASED  EMOTION RECOGNITION USING VOICE
SPEECH BASED EMOTION RECOGNITION USING VOICE
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
 
Real Time Object Dectection using machine learning
Real Time Object Dectection using machine learningReal Time Object Dectection using machine learning
Real Time Object Dectection using machine learning
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
User Behavior Analytics Using Machine Learning
User Behavior Analytics Using Machine LearningUser Behavior Analytics Using Machine Learning
User Behavior Analytics Using Machine Learning
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
Face mask detection using convolutional neural networks article
Face mask detection using convolutional neural networks articleFace mask detection using convolutional neural networks article
Face mask detection using convolutional neural networks article
 
Role of Tensors in Machine Learning
Role of Tensors in Machine LearningRole of Tensors in Machine Learning
Role of Tensors in Machine Learning
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Computer vision
Computer vision Computer vision
Computer vision
 
Cnn
CnnCnn
Cnn
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models fo...
Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models fo...Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models fo...
Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models fo...
 
Convolution Neural Network
Convolution Neural NetworkConvolution Neural Network
Convolution Neural Network
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 

Semelhante a State of art e2e speech recognition system by Dong Yu from Tencent AI lab

Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
kkkseld
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
Amrita More
 
SMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk SystemSMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk System
CSCJournals
 
5215ijcseit01
5215ijcseit015215ijcseit01
5215ijcseit01
ijcsit
 

Semelhante a State of art e2e speech recognition system by Dong Yu from Tencent AI lab (20)

lec26_audio.pptx
lec26_audio.pptxlec26_audio.pptx
lec26_audio.pptx
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
Assign
AssignAssign
Assign
 
Speech processing
Speech processingSpeech processing
Speech processing
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
Asr
AsrAsr
Asr
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
 
10
1010
10
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
SMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk SystemSMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk System
 
Speech Technology Overview
Speech Technology OverviewSpeech Technology Overview
Speech Technology Overview
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Esa act
Esa actEsa act
Esa act
 
Asr
AsrAsr
Asr
 
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARSYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
 
5215ijcseit01
5215ijcseit015215ijcseit01
5215ijcseit01
 

Mais de Bill Liu

Mais de Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

State of art e2e speech recognition system by Dong Yu from Tencent AI lab

  • 1. 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 1 AI NEXTCon Seattle ‘18 1/17-20th | Seattle #ainextcon http://aisea18.xnextcon.com
  • 2. State-of-the-art of End-to-end Speech Recognition Systems Dong Yu Tencent AI Lab
  • 3. Speech Recognition • Determines the most likely word sequence, W = w1, ..., wn, given an acoustic input sequence, x = x1, ..., xT , where T represents the number of frames in the utterance 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 3 Pronunciation model (PM): convert a word sequence to a phoneme sequence Acoustic model (AM): predicts the likelihood of the acoustic input speech utterance given a phoneme sequence language model (LM): predicts the likelihood of a word sequence Pronunciation model is unnecessary but helpful for some languages Systems without a pronunciation model are called grapheme based
  • 4. Can a Single Model Do All End-to-end? • Speech recognition is essentially a sequence (audio sequence) to sequence (word sequence) transformation problem • Why not direct sequence to sequence transformation • Connectionist Temporal Classification (CTC) • Recurrent neural network transducer (RNN-T) • Recurrent neural network aligner (RNN-A) • Sequence to sequence with attention (seq2seq-attention) • Neural Transducer (NT) (Limited Size Attention) • Key problems: • How to address the variable length problem • How to address the length difference and alignment between the input and output 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 4
  • 5. Outline • Connectionist Temporal Classification (CTC) • Recurrent neural network transducer (RNN-T) & Recurrent neural network aligner (RNN-A) • Sequence to sequence with attention (seq2seq-attention) • Neural Transducer (NT) (Limited Size Attention) • Summary 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 5
  • 6. Connectionist Temporal Classification 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 6 blank symbol: no output is generated; not confident enough repeated symbol: Treated as one Frame number: time Recognition unit: Characters, words, phonemes alignments: Many paths lead to the same recognition result
  • 7. •Inference Connectionist Temporal Classification • Training (conditional likelihood, sensitive to initialization) 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 7 Label sequence: e.g., good Alignments lead to same Label sequence, e.g., _gg_o_oo_d Label sequence in training set Find the single alignment with the highest score Conditional independence assumption
  • 8. Properties of CTC 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 8 CTC (Spiky): output blanks until confident to output the associated label • Simple: direct audio sequence to label sequence transformation, flexible modeling unit • Fast: decoding is fast due to spikes (confident output) or less output units • Random timing: spikes can happen at any delayed time (latency), may be outside of the label boundary • Limitation: assumes that model outputs at a given frame are independent of previous output labels Framewise (Flat): output of each frame is the same label
  • 9. Improve: Sequence Discriminative Training 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 9 • CTC training cannot exploit external text to improve LM • Use LM trained with external text to improve performance • CTC training objective function is likelihood of observing the label sequence given the audio sequence • Use sequence discriminative training
  • 10. Improve: Use Word Unit and Better Tricks • The quality of the implicit LM depends on the modeling unit • CTC-Word: directly models word, word-piece, or cross-word unit 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 10 State of the art hybrid system on 300hr SWB is around 10% To achieve good result, you need to use complicated engineering procedures similar to that used in hybrid systems Decoding is simple greedy search: extremely fast
  • 11. Solve OOV in CTC-Word • Spell and recognize (SAR) • Present training examples that contain both words and characters b-t h e-e THE b-c a e-t CAT b-i e-s IS b-b l a c e-k BLACK • The model is trained to first spell the word and then recognize it • The SAR model has a single softmax over words+characters in the output layer • Allows to leverage the greedy search decoding: no beam or other graph-based search is needed • Not the ideal solution 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 11 word Beginning char ending char
  • 12. Outline • Connectionist Temporal Classification (CTC) • Recurrent neural network transducer (RNN-T) & Recurrent neural network aligner (RNN-A) • Sequence to sequence with attention (seq2seq-attention) • Neural Transducer (NT) (Limited Size Attention) • Summary 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 12
  • 13. RNN Transducer (RNN-T) 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 13 maps acoustic frames into a higher-level representation. Conditioned on previous acoustic frames. Initialized from CTC model Combines acoustic and language information • A streaming, all-neural, sequence-to-sequence architecture • Jointly learns acoustic and language model components Language model trained on text only data. Explicitly conditioned on the history of previous non-blank targets predicted by the model. Can use grapheme, word, word-piece units
  • 14. Properties of RNN-T • Solves the conditional independency assumption in CTC: decoding results now dependent on previous output symbol through the prediction network • Solves the deficiency of not able to exploit large text-only data in CTC: the prediction network can exploit larger text only data • The prediction network is not conditioned on the encoder output: allows for the pre-training of the decoder as a RNN language model on text-only data • Still uses the blank symbol and same repeated symbol handling technique used in CTC 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 14
  • 15. RNN-T Training Procedure 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 15 The training procedure is very complicated to achieve good result. Not significantly simpler than hybrid system
  • 16. RNN Transducer: Inference 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 16 next acoustic frame previously predicted label Updated only if the predicted label is non-blank next output label probabilities• Alternate between updating the encoder and the prediction network based on if the predicted label is a blank or non-blank. • Inference is terminated when blank is output at the last frame, T . • Greedy search or beam search
  • 17. Recurrent Neural Aligner (RNA) • Similar to RNN-T: • Aims at solving the conditional independency assumption in CTC • Uses the predicted label at time t−1 as an additional input to the recurrent model when predicting the label at time t. • Has an encoder network to encode raw input as input sequence x. • Has a recurrent decoder network. The input to the decoder network at time t for a given alignment z is [xt zt−1]. • Different from RNN-T: • RNN-T Uses an RNN for LM and another one for AM and then combine them; RNN-A uses just one RNN to train AM/LM jointly (not factorized) • RNN-A requires approximate forward-backward algorithm to train due to the joint RNN model. 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 17 alignment
  • 18. Outline • Connectionist Temporal Classification (CTC) • Recurrent neural network transducer (RNN-T) & Recurrent neural network aligner (RNN-A) • Sequence to sequence with attention (seq2seq-attention) • Neural Transducer (NT) (Limited Size Attention) • Summary 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 18
  • 19. Sequence-to-Sequence with Attention 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 19 Listen, Attend, Spell (LAS Model)
  • 20. Alternative View of Attention Model 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 20 Encoder: maps input acoustic vectors into a higher-level representation Attention: summarizes the output of the encoder based on the current state of the decoder Decoder: models an output distribution over the next target conditioned on the sequence of previous predictions
  • 21. Properties of Basic Attention Model • Strength (similar to RNN-T and better than CTC): • No conditional independence assumption • Prediction of next unit depends on both LM and AM information • Different from RNN-T: • The attention weights depend on the current decoder state • Weakness (worse than CTC): • Exposure bias: conditioned on true label during training and estimated label during decoding. • Too flexible: Attention weights are not constrained to attend from left to right • Very difficult to train well esp. when the input length is long: Even with pyramid structure and/or other subsampling techniques • High latency: cannot be streamed 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 21
  • 22. Constrain Attention Model with CTC • left to right constraint in CTC can help regularize attention model 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 22 left to right constraint in CTC can help regularize attention model Joint training criterion Regularization through shared encoder
  • 23. Constraint Attention Model with CTC • Speed in learning alignments between characters (y-axis) and acoustic frames (x-axis) is significantly improved with multi-task learning 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 23 Aligned to the end Incorrect orderAttention Only Attention + CTC
  • 24. Decoding in Attention + CTC Model • Basic Idea: Beam search to find 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 24 CTC decodes at the frame rate attention decoder operates character-by-character • Difficulty: mismatch between CTC and Attention model scoring • Solution: compute the probability of each partial hypothesis based on the CTC prefix probability defined as the cumulative probability of all label sequences that have h as their prefix
  • 25. Choice of Attention 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 25 • Additive attention is more stable than dot-product attention • Multiple independent attention heads significantly improve model performance: • Allows the model to simultaneously attend to multiple locations in the input utterance. attention value for head i at frame t and output unit u additive Attention probability: normalized across frames Attention context: summarize over all frames
  • 26. Word Error Rate Training • Word Error Rate Training: Minimize the expected number of word errors over the training set (4-7% WERR) 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 26 • Cross-entropy training: number of word errors in a hypothesis relative to the ground-truth sequence approximate the expectation on N-best list adding CE criterion important to stabilize training intractable since it involves a summation over all possible label sequences
  • 27. Inference in Attention Model • Coverage penalty • measures the extent to which the input frames are “covered” by the attention weights • addresses the common s2s failure mode of assigning high probability to shorter output sequence 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 27 coverage penalty: penalize incomplete transcripts attention probability of the j-th output label on the i-th frame Attention model AM score External LM score: shallow fusion Find with beam search Shallow fusion
  • 28. Deal With OOV • End-to-end models perform better when word or sub-word units are used • Probably due to stronger constraint built inside the units • Introduces the OOV problem • Solution: Combines the character and word LMs • Trick: exploit character LM (CLM) when word LM (WLM) not available and use WLM when it’s available • Benefit: keep promising candidates inside beam. 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 28 set of labels that indicate the end of word, last word of the character sequence word-level history (excluding wg) factor: adjust the probabilities for OOV probability of wg obtained by CLM; used to cancel the CLM probabilities accumulated for wg.
  • 29. Combine With RNN-T 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 29 Attention model: integrates acoustic and language information Joint Decoder: combine attention model output and additional acoustic information
  • 30. Outline • Connectionist Temporal Classification (CTC) • Recurrent neural network transducer (RNN-T) & Recurrent neural network aligner (RNN-A) • Sequence to sequence with attention (seq2seq-attention) • Neural Transducer (NT) (Limited Size Attention) • Summary 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 30
  • 31. Neural Transducer (NT) 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 31 • Drawback in Seq2seq: entire input sequence needs to be encoded before the output sequence may be decoded • Neural Transducer (NT): limits attention to fixed-size blocks of the encoder space.
  • 32. Neural Transducer (NT) • Examines each block in turn; • Attention is only computed over the frames in each block; • Within each block, produces a sequence of k (0 < k ≤ M ) outputs; • Outputs an <epsilon> symbol to signify the end of block processing. 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 32
  • 33. Neural Transducer (NT) • Training: • Requires knowing which sub-word/word units occur in each chunk -> an alignment is needed. • Finds the approximate best alignment with a dynamic programming-like algorithm • Batch the alignment inference steps and cache these alignments. • Inference: • Use a beam search heuristic • At each output step m, extend each candidate by one symbol with all possible extensions, and keep only the best n extensions. 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 33
  • 34. Improve Neural Transducer • Performance of Neural Transducer is much worse than attention model however allows for streamed recognition • Improvements: • Allow attention to be computed looking back many previous chunks and look- ahead by 5 frames • Initialize NT from a pre-trained attention model • Incorporate a stronger LM (e.g. sub-word and word LM); • Use an external LM via shallow fusion • Use multi-head attention 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 34
  • 35. Outline • Connectionist Temporal Classification (CTC) • Recurrent neural network transducer (RNN-T) & Recurrent neural network aligner (RNN-A) • Sequence to sequence with attention (seq2seq-attention) • Neural Transducer (NT) (Limited Size Attention) • Summary 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 35
  • 37. Summary • All these models still underperform the DNN-HMM hybrid system even with all the tricks; but the gap decreases when the training set increases • With all the tricks, the training process is no longer simple as what have been claimed • not using lexicon is not new. It’s called grapheme based model in the past • Several problems still need to be solved • Is RNN-T or attention model more promising? • Is there a better solution or theory to the problem of favoring short output sequences in attention and other segmental models? • Is there a better model structure? • Is there theory or procedure to reduce the demand of training data? 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 37
  • 38. Tencent AI Lab • Shenzhen office established in April 2016; Seattle office established in May 2017. • Mission: Improve existing scenarios and enable new scenarios through technology breakthrough 1/21/18 38Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems
  • 39. Over 100 Publications Since 2016 CVPR Computer Vision ACL Computational Linguistics ICML Machine Learning NIPS Machine Learning and Computational Neuroscience • CVPR 2017 received 2,680 valid submissions, and accepted 783. Acceptance rate 29.22%. • 6 papers from Tencent AI Lab were accepted. • ACL 2017 received 1318 valid submissions, and accepted 302. Acceptance rate 22.91%. • 3 papers from Tencent AI Lab were accepted. • ICML 2017 received 1676 valid submissions, and accepted 434. Acceptance rate 25.89%. • 4 papers from Tencent AI Lab were accepted. • NIPS 2017 received 3240 valid submissions, and accepted 678. Acceptance rate 20.9%. • 8 papers from Tencent AI Lab were accepted, including 1 oral (acceptance rate 1.2%). 1/21/18 39Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems
  • 40. Current Focus Areas at Seattle Lab Speech Processing • Mic-array processing • Speech recognition • Speaker recognition • Text to speech (TTS) Natural Language Processing • Semantic parsing and representation • Semantic reasoning • Knowledge extraction and representation • Natural language generation Dialog System • Dialog state tracking and management • Dialog strategy inference and optimization • Personalized adaptive dialog Optimization techniques, weakly supervised and reinforcement learning Multi-modal signal processing and semantic grounding 1/21/18 40Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems
  • 41. We Are Hiring Full Time Researchers • In the area of speech processing, natural language processing, and dialog system • Self-motivated, good at both theory and engineering • Principal researcher • Experienced researchers who have made significant innovative scientific contributions • Apply at https://app.jobvite.com/j?cj=oKUh5fwK&s=LinkedIn • Senior researcher • Researchers who have made innovative scientific contributions • Apply at https://app.jobvite.com/j?cj=oSUh5fwS&s=LinkedIn • Send CV to us-career@tencent.com (mention the job and location) 1/21/18 41Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems
  • 42. References and Credit of Pictures, Tables • Survey and comparison • Yu, D. and Li, J., 2017. Recent progresses in deep learning based acoustic models. IEEE/CAA Journal of Automatica Sinica, 4(3), pp.396-409. • Prabhavalkar, R., Rao, K., Sainath, T.N., Li, B., Johnson, L. and Jaitly, N., 2017. A comparison of sequence-to-sequence models for speech recognition. In Proc. Interspeech (pp. 939-943). • Battenberg, E., Chen, J., Child, R., Coates, A., Gaur, Y., Li, Y., Liu, H., Satheesh, S., Seetapun, D., Sriram, A. and Zhu, Z., 2017. Exploring Neural Transducers for End-to-End Speech Recognition. arXiv preprint arXiv:1707.07413. • Connectionist Temporal Classification (CTC) • Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J., 2006, June. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376). ACM. • Sak, H., Senior, A., Rao, K. and Beaufays, F., 2015. Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947. • RNN Transducer (RNN-T) • Graves, A., 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. • Rao, K., Prabhavalkar, R. and Sak, H., 2017. Exploring Architectures, Data and Units for Streaming End-to-End Speech Recognition with RNN-Transducer. In Proc. ASRU. • Recurrent Neural Aligner (RNA) • Sak, H., Shannon, M., Rao, K. and Beaufays, F., 2017. Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping. In Proc. of Interspeech. 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 42
  • 43. References and Credit of Pictures, Tables • Attention Model • Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P. and Bengio, Y., 2016, March. End-to-end attention-based large vocabulary speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on (pp. 4945- 4949). IEEE. • Chan, W., Jaitly, N., Le, Q. and Vinyals, O., 2016, March. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on (pp. 4960-4964). IEEE. • Kim, S., Hori, T. and Watanabe, S., 2017, March. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 4835-4839). IEEE. • Prabhavalkar, R., Sainath, T.N., Wu, Y., Nguyen, P., Chen, Z., Chiu, C.C. and Kannan, A., 2017. Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models. arXiv preprint arXiv:1712.01818. • Kannan, A., Wu, Y., Nguyen, P., Sainath, T.N., Chen, Z. and Prabhavalkar, R., 2017. An analysis of incorporating an external language model into a sequence-to-sequence model. arXiv preprint arXiv:1712.01996. • Neural Transducer (NT) • Jaitly, N., Le, Q.V., Vinyals, O., Sutskever, I., Sussillo, D. and Bengio, S., 2016. An online sequence-to-sequence model using partial conditioning. In Advances in Neural Information Processing Systems (pp. 5067-5075). • Sainath, T.N., Chiu, C.C., Prabhavalkar, R., Kannan, A., Wu, Y., Nguyen, P. and Chen, Z., 2017. Improving the Performance of Online Neural Transducer Models. arXiv preprint arXiv:1712.01807. • Joint Char-LM and Word-LM • Hori, T., Watanabe, S. and Hershey, J.R., 2017. Multi-level Language Modeling and Decoding for Open Vocabulary End-to-End Speech Recognition. • Audhkhasi, K., Kingsbury, B., Ramabhadran, B., Saon, G. and Picheny, M., 2017. Building competitive direct acoustics-to-word models for English conversational speech recognition. arXiv preprint arXiv:1712.03133. 1/21/18 Dong Yu : State-of-the-art of End-to-end Speech Recognition Systems 43