SlideShare uma empresa Scribd logo
1 de 47
Baixar para ler offline
Tutorial on end-to-end text-to-speech
synthesis
Part 2 – Tactron and related end-to-end systems
安田裕介 (Yusuke YASUDA)
National Institute of Informatics
1
エンドツーエンド音声合成
に向けたNIIにおける
ソフトウェア群 Part 2
安田裕介
National Institute of Informatics
~ TacotronとWaveNetのチュートリアル ~
2
About me
- Name: 安田裕介
- Status:
- A PhD student at Sokendai / NII
- A programmer at work
- Former geologist
- bachelor, master
- Github account: TanUkkii007
3
Table of contents
1. Text-to-speech architecture
2. End-to-end model
3. Example architectures
a. Char2Wav
b. Tacotron
c. Tacotron2
4. Our work: Japanese Tacotron
5. Implementation
4
TTS architecture: traditional pipeline
- Typical pipeline architecture
for statistical parametric
speech synthesis
- Consists of task-specific
models
- linguistic model
- alignment (duration) model
- acoustic model
- vocoder
5
TTS architecture: End-to-end model
- End-to-end model directly
converts text to waveform
- End-to-end model does not
require intermediate feature
extraction
- Pipeline models accumulate
errors across predicted
features
- End-to-end model’s internal
blocks are jointly optimized
6
[1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-EndText-to-Speech. CoRR abs/1807.07281 (2018)
End-to-end model: Encoder-Decoder with Attention
- Building blocks
- Encoder
- Attention mechanism
- Decoder
- Neural vocoder
Conventional end-to-end models may not
include a waveform generator, but some
recent full end-to-end models contain a
neural waveform generator, e.g. ClariNet[1].
7
End-to-end model: Decoding with attention
mechanism
8
Time step 1.
- Assign alignment
probabilities to encoded
inputs
- Alignment scores can be
derived from attention
layer’s output and
encoded inputs
End-to-end model: Decoding with attention
mechanism
9
Time step 1.
- Calculate context vector
- Context vector is the sum
of encoded inputs
weighted by alignment
probabilities
End-to-end model: Decoding with attention
mechanism
10
Time step 1.
- Decoder layer predicts an
output from the context
vector
End-to-end model: Decoding with attention
mechanism
11
Time step 2.
- The previous output is
fed back to next time step
- Assign alignment
probabilities to encoded
inputs
End-to-end model: Decoding with attention
mechanism
12
Time step 2.
- Calculate context vector
End-to-end model: Decoding with attention
mechanism
13
Time step 2.
- Decoder layer predict an
output from the context
vector
End-to-end model: Alignment visualization
alignment
ground truth mel spectrogram
predicted mel spectrogram
14
Example: Char2Wav
Char2Wav is one of the earliest work on end-to-end TTS. Its simple sequence-to-
sequence architecture with attention gives a good proof-of-concept for end-to-end
TTS as a start line. Char2Wav also uses advanced blocks like a neural vocoder
and the multi-speaker embedding.
Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron
Courville, Yoshua Bengio:
Char2Wav: End-to-End Speech Synthesis. ICLR 2017
https://openreview.net/forum?id=B1VWyySKx
15
Char2Wav: Architecture
Encoder: bidirectional GRU RNN
Decoder: GRU RNN
Attention: GMM attention
Neural vocoder: SampleRNN
Figure is from Sotelo et al. (2017)
16
Char2Wav: Source and target choice
- Input: character or phoneme sequence
- Output: vocoder parameters
- Objective: mean square error or negative GMM log likelihood loss
17
Char2Wav: GMM attention
- Proposed by Graves
(2013)
- Location based attention
- Alignment is based on
input location
- Alignment does not
depend on input content
18
[3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013)
[3]
Char2Wav: GMM attention
- Monotonic progress
- The mean value μ
always increases as time
progresses
- Robust to long sequence
prediction
epoch 0 epoch 130
Visualization ofGMM attention
alignmentfrom Tacotron predicting
vocoder parameters.r=2.
19
Char2Wav: Limitations
- Target features: Char2Wav uses vocoder parameters as a target. Vocoder
parameters are challenging target for sequence-to-sequence system.
- e.g. vocoder parameters for 10s speech requires 2000 iteration to predict
- Architecture: As Tacotron paper demonstrated, a vanila sequence-to-sequence
architecture gives poor alignment.
These limitations were solved by Tacotron.
20
Example: Tacotron
Tacotron is the most influential method among modern architectures because of
several cutting-edge techniques.
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis,
Rob Clark, Rif A. Saurous:
Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010
https://arxiv.org/abs/1703.10135
21
Tacotron: Architecture
- CBHG encoder
- Encoder & decoder pre-net
- Reduction factor
- Post-net
- Additive attention
Figure is from Wang et al. (2017)
22
Tacotron: Source and target choice
- Input: character sequence
- Output: mel and linear spectrogram (50ms frame length, 12.5ms frame shift)
- x2.5 shorter total sequence than that of vocoder parameters
- Objective: L1 loss
23
Tacotron: Decoder pre-net
- The most important architecture
advancement from vanila seq2seq
- 2 FFN + ReLU + dropout
- Crucial for alignment learning
24
Tacotron: Reduction factor
- Predicts multiple frames at a time
- Reduction factor r: the number of
frames to predict at one step
- Large r →
- Small number of iteration
- Easy alignment learning
- Less training time
- Less inference time
- Poor quality
25
[5] Dzmitry Bahdanau,Kyunghyun Cho, Yoshua Bengio:Neural Machine Translation by Jointly Learningto Align and Translate. CoRR abs/1409.0473 (2014)
- Proposed by Bahdanau et al. (2014)
- Content based attention
- Distance between source and target is
learned by FFN
- No structure constraint
Tacotron: Additive attention
epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 100epoch 0
26
[5]
Tacotron: Limitations
- Training time is too long: > 2M steps (mini batches) to converge
- Waveform generation: Griffin-Lim is handy but hurts the waveform quality.
- Stop condition: Tacotron predicts fixed-length spectrogram, which is inefficient
both at training and inference time.
27
Example: Tacotron2
Tacotron2 is a surprising method that achieved human level quality of synthesized
speech. Its architecture is an extension of Tacotron. Tacotron2 uses WaveNet for
high-quality waveform generation.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu:
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779-
4783
https://arxiv.org/abs/1712.05884
28
Tacotron2: Architecture
- CNN layers instead of CBHG at
encoder and post-net
- Stop token prediction
- Post-net improves predicted mel
spectrogram
- Location sensitive attention
- No reduction factor
Figure is from Shen et al. (2018)
29
[7] David Krueger, Tegan Maharaj, János Kramár, MohammadPezeshki, Nicolas Ballas, Nan Rosemary Ke, AnirudhGoyal, YoshuaBengio, HugoLarochelle,
Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.CoRR abs/1606.01305(2016)
Tacotron2: Trends of architecture changes
1. Larger model size
- character embedding: 256 → 512
- encoder: 256 → 512
- decoder pre-net:
(256, 128) → (256, 256)
- decoder: 256 → 1024
- attention: 256 → 128
- post-net: 256 → 512
2. More regularizations:
- Encoder applies dropout at every
layer. The original Tacotron applies
dropout in pre-net only at encoder.
- Zoneout regularization for encoder
bidirectional LSTM and decoder LSTM
- Post-net applies dropout at every layer
- L2 regularization
CBHG module is replaced by CNN layers at encoder and
postnet probably due to limited regularization applicability.
Large model needs heavy regularization to prevent overfitting
30
[7]
Tacotron2: Source and target choice
- Input: character sequence
- Output: mel spectrogram
- Objective: mean square error
31
Tacotron2: Waveform synthesis with WaveNet
- Upsampling rate: 200
- Mel vs Linear: no difference
- Human-level quality was achieved by WaveNet trained with ground truth
alignmed predicted spectrogram
32
[8] Jan Chorowski, Dzmitry Bahdanau,Dmitriy Serdyuk, Kyunghyun Cho,Yoshua Bengio: Attention-Based Models for SpeechRecognition. NIPS 2015: 577-585
Tacotron2: Location sensitive attention
- Proposed by Chorowski et al.,
(2015)
- Utilizes both input content and
input location
33
[8]
Tacotron2: Limitations
- WaveNet training in Tacotron2:
● WaveNet is trained with ground truth-aligned **predicted** mel spectrogram.
● Tacotron2's errors are corrected by WaveNet.
● However, this is unproductive: one WaveNet for each different Tacotron2
- WaveNet training in normal condition
● WaveNet is trained with ground truth spectrogram
● However, its MOS score still is still inferior to the human speech.
34
Our work: Tacotron for Japanese language
We extended Tacotron for Japanese language.
● To model Japanese pitch accents, we introduced accentual type embedding
as well as phoneme embedding
● To handle long term dependencies in accents: we use additional components
Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi:
Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent
language. CoRR abs/1810.11960 (2018)
https://arxiv.org/abs/1810.11960
35
Japanese Tacotron: Architecture
- Accentual type
embedding
- Forward attention
- Self-attention
- Dual source
attention
- Vocoder parameter
target
36
Japanese Tacotron: Source and target choice
- Source: phoneme and accentual type sequence
- Target:
- mel spectrogram
- vocoder parameters
- Objective:
- L1 loss (mel spectrogram, mel generalized cepstrum)
- softmax cross entropy loss (discretized F0)
37
[10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attentionin Sequence-To-SequenceAcoustic Modeling for SpeechSynthesis. ICASSP2018:
4789-4793
Japanese Tacotron: Forward attention
- Proposed by Zhang et al. (2018)
- Precluding left-to-right progress
- Fast alignment learning
epoch 1 epoch 3
epoch 7 epoch 10 epoch 50
epoch 0
38
[10]
Japanese Tacotron: Limitations
- Speech quality is still worse than that of traditional pipeline system
- Input feature limitation
- Configuration is based on Tacotron, not Tacotron2
39
Summary
40
Char2Wav Tacotron VoiceLoop
[11]
DeepVoice3
[12]
Tacotron
2
Transformer
[13]
Japanese
Tacotron
network
type
RNN RNN memory
buffer
CNN RNN self-attention RNN
input character/
phoneme
character phoneme character/
phoneme
character phoneme phoneme
seq2seq
output
vocoder mel vocoder mel mel mel mel/
vocoder
post-net
output
- linear - linear/
vocoder
mel mel -
attention
mechanism
GMM additive GMM dot-product location
sensitive
dot-product forward
waveform
synthesis
SampleRNN Griffin-Lim WORLD Griffin-Lim/
WORLD/
WaveNet
WaveNet WaveNet WaveNet
Japanese Tacotron: Implementation
URL: https://github.com/nii-yamagishilab/self-attention-tacotron
Framework: Tensorflow, Python
Supported datasets: LJSpeech, VCTK, (...coming more in the future)
Features:
41
- Tacotron, Tacotron2, Japanese Tacotron model
- Combination choices for encoder, decoder, attention mechanism
- Force alignment
- Mel spectrogram, vocoder parameter for target
- Compatible with our WaveNet implementation
Japanese Tacotron: Experimental workflow
42
training & validation
metrics
validation result test result
Japanese Tacotron: Audio samples
43
NAT TACMNPTACMAP SATMAP SATMNP SATWAP
Bibliography
- [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to-
Speech. CoRR abs/1807.07281 (2018)
- [2] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville,
Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017
- [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850
(2013)
- [4] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis,
Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017:
4006-4010
- [5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly
Learning to Align and Translate. CoRR abs/1409.0473 (2014)
44
Bibliography
- [6] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,
Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis,
Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions.
ICASSP 2018: 4779-4783
- [7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan
Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, Chris Pal:
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305
(2016)
- [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention-
Based Models for Speech Recognition. NIPS 2015: 577-585
- [9] Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron
text-to-speech synthesis systems with self-attention for pitch accent language. CoRR
abs/1810.11960 (2018)
45
Bibliography
- [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence
Acoustic Modeling for Speech Synthesis. ICASSP 2018: 4789-4793
- [11] Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani: VoiceLoop: Voice Fitting and
Synthesis via a Phonological Loop. ICLR 2018
- [12] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang,
Jonathan Raiman, John Miller: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence
Learning. ICLR 2018
- [13] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou: Close to Human Quality
TTS with Transformer. CoRR abs/1809.08895 (2018)
46
Acknowledgement
This work was partially supported by JST CREST Grant Number JPMJCR18A6,
Japan and by MEXT KAKENHI Grant Numbers (16H06302, 17H04687,
18H04120, 18H04112, 18KT0051), Japan.
47

Mais conteúdo relacionado

Mais procurados

Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice ConversionNU_I_TODALAB
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer VisionDongmin Choi
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to AutoencodersYan Xu
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning健程 杨
 
SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK Kamonasish Hore
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentationhimanshubhatti
 
Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Kenyu Uehara
 
Classification of signals
Classification of signalsClassification of signals
Classification of signalschitra raju
 
GENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALGENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALIJCSEIT Journal
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...Tomoki Hayashi
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive CodingSrishti Kakade
 

Mais procurados (20)

Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
Multrate dsp
Multrate dspMultrate dsp
Multrate dsp
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Quantum neural network
Quantum neural networkQuantum neural network
Quantum neural network
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Part 2
Part 2Part 2
Part 2
 
SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves) Modeling of EEG (Brain waves)
Modeling of EEG (Brain waves)
 
Classification of signals
Classification of signalsClassification of signals
Classification of signals
 
GENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALGENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
GENDER RECOGNITION SYSTEM USING SPEECH SIGNAL
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive Coding
 

Semelhante a End-to-End TTS Systems

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~Yamagishi Laboratory, National Institute of Informatics, Japan
 
MANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeMANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeManoj Rao
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...Alexander Decker
 
Hamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderHamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderRSIS International
 
TUKE System for MediaEval 2014 QUESST
TUKE System for MediaEval 2014 QUESSTTUKE System for MediaEval 2014 QUESST
TUKE System for MediaEval 2014 QUESSTmultimediaeval
 
Design and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined RadioDesign and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined RadioIJECEIAES
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET Journal
 
00b7d51ed81834e4d7000000
00b7d51ed81834e4d700000000b7d51ed81834e4d7000000
00b7d51ed81834e4d7000000Rahul Jain
 
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...IRJET Journal
 
2.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 212.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 21Alexander Decker
 
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip  Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip IJECEIAES
 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationIRJET Journal
 
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...ijngnjournal
 
Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes IJECEIAES
 

Semelhante a End-to-End TTS Systems (20)

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
MANOJ_H_RAO_Resume
MANOJ_H_RAO_ResumeMANOJ_H_RAO_Resume
MANOJ_H_RAO_Resume
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...On the realization of non linear pseudo-noise generator for various signal pr...
On the realization of non linear pseudo-noise generator for various signal pr...
 
Hamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar DecoderHamming net based Low Complexity Successive Cancellation Polar Decoder
Hamming net based Low Complexity Successive Cancellation Polar Decoder
 
TUKE System for MediaEval 2014 QUESST
TUKE System for MediaEval 2014 QUESSTTUKE System for MediaEval 2014 QUESST
TUKE System for MediaEval 2014 QUESST
 
Design and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined RadioDesign and Implementation of an Embedded System for Software Defined Radio
Design and Implementation of an Embedded System for Software Defined Radio
 
vorlage
vorlagevorlage
vorlage
 
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo MethodIRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
IRJET- Synchronization Scheme of MIMO-OFDM using Monte Carlo Method
 
00b7d51ed81834e4d7000000
00b7d51ed81834e4d700000000b7d51ed81834e4d7000000
00b7d51ed81834e4d7000000
 
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
A Multiple Access Technique for Differential Noise Shift Keying: A Review of ...
 
40120140505005 2
40120140505005 240120140505005 2
40120140505005 2
 
40120140505005
4012014050500540120140505005
40120140505005
 
40120140505005
4012014050500540120140505005
40120140505005
 
2.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 212.ganiyu rafiu adesina 14 21
2.ganiyu rafiu adesina 14 21
 
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip  Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
Detection and Monitoring Intra/Inter Crosstalk in Optical Network on Chip
 
A study on the techniques for speech to speech translation
A study on the techniques for speech to speech translationA study on the techniques for speech to speech translation
A study on the techniques for speech to speech translation
 
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
 
Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes Study of the operational SNR while constructing polar codes
Study of the operational SNR while constructing polar codes
 

Mais de Yamagishi Laboratory, National Institute of Informatics, Japan

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~Yamagishi Laboratory, National Institute of Informatics, Japan
 

Mais de Yamagishi Laboratory, National Institute of Informatics, Japan (14)

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
Neural Waveform Modeling
Neural Waveform Modeling Neural Waveform Modeling
Neural Waveform Modeling
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 

Último

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Último (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

End-to-End TTS Systems

  • 1. Tutorial on end-to-end text-to-speech synthesis Part 2 – Tactron and related end-to-end systems 安田裕介 (Yusuke YASUDA) National Institute of Informatics 1
  • 2. エンドツーエンド音声合成 に向けたNIIにおける ソフトウェア群 Part 2 安田裕介 National Institute of Informatics ~ TacotronとWaveNetのチュートリアル ~ 2
  • 3. About me - Name: 安田裕介 - Status: - A PhD student at Sokendai / NII - A programmer at work - Former geologist - bachelor, master - Github account: TanUkkii007 3
  • 4. Table of contents 1. Text-to-speech architecture 2. End-to-end model 3. Example architectures a. Char2Wav b. Tacotron c. Tacotron2 4. Our work: Japanese Tacotron 5. Implementation 4
  • 5. TTS architecture: traditional pipeline - Typical pipeline architecture for statistical parametric speech synthesis - Consists of task-specific models - linguistic model - alignment (duration) model - acoustic model - vocoder 5
  • 6. TTS architecture: End-to-end model - End-to-end model directly converts text to waveform - End-to-end model does not require intermediate feature extraction - Pipeline models accumulate errors across predicted features - End-to-end model’s internal blocks are jointly optimized 6
  • 7. [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-EndText-to-Speech. CoRR abs/1807.07281 (2018) End-to-end model: Encoder-Decoder with Attention - Building blocks - Encoder - Attention mechanism - Decoder - Neural vocoder Conventional end-to-end models may not include a waveform generator, but some recent full end-to-end models contain a neural waveform generator, e.g. ClariNet[1]. 7
  • 8. End-to-end model: Decoding with attention mechanism 8 Time step 1. - Assign alignment probabilities to encoded inputs - Alignment scores can be derived from attention layer’s output and encoded inputs
  • 9. End-to-end model: Decoding with attention mechanism 9 Time step 1. - Calculate context vector - Context vector is the sum of encoded inputs weighted by alignment probabilities
  • 10. End-to-end model: Decoding with attention mechanism 10 Time step 1. - Decoder layer predicts an output from the context vector
  • 11. End-to-end model: Decoding with attention mechanism 11 Time step 2. - The previous output is fed back to next time step - Assign alignment probabilities to encoded inputs
  • 12. End-to-end model: Decoding with attention mechanism 12 Time step 2. - Calculate context vector
  • 13. End-to-end model: Decoding with attention mechanism 13 Time step 2. - Decoder layer predict an output from the context vector
  • 14. End-to-end model: Alignment visualization alignment ground truth mel spectrogram predicted mel spectrogram 14
  • 15. Example: Char2Wav Char2Wav is one of the earliest work on end-to-end TTS. Its simple sequence-to- sequence architecture with attention gives a good proof-of-concept for end-to-end TTS as a start line. Char2Wav also uses advanced blocks like a neural vocoder and the multi-speaker embedding. Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017 https://openreview.net/forum?id=B1VWyySKx 15
  • 16. Char2Wav: Architecture Encoder: bidirectional GRU RNN Decoder: GRU RNN Attention: GMM attention Neural vocoder: SampleRNN Figure is from Sotelo et al. (2017) 16
  • 17. Char2Wav: Source and target choice - Input: character or phoneme sequence - Output: vocoder parameters - Objective: mean square error or negative GMM log likelihood loss 17
  • 18. Char2Wav: GMM attention - Proposed by Graves (2013) - Location based attention - Alignment is based on input location - Alignment does not depend on input content 18 [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013) [3]
  • 19. Char2Wav: GMM attention - Monotonic progress - The mean value μ always increases as time progresses - Robust to long sequence prediction epoch 0 epoch 130 Visualization ofGMM attention alignmentfrom Tacotron predicting vocoder parameters.r=2. 19
  • 20. Char2Wav: Limitations - Target features: Char2Wav uses vocoder parameters as a target. Vocoder parameters are challenging target for sequence-to-sequence system. - e.g. vocoder parameters for 10s speech requires 2000 iteration to predict - Architecture: As Tacotron paper demonstrated, a vanila sequence-to-sequence architecture gives poor alignment. These limitations were solved by Tacotron. 20
  • 21. Example: Tacotron Tacotron is the most influential method among modern architectures because of several cutting-edge techniques. Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010 https://arxiv.org/abs/1703.10135 21
  • 22. Tacotron: Architecture - CBHG encoder - Encoder & decoder pre-net - Reduction factor - Post-net - Additive attention Figure is from Wang et al. (2017) 22
  • 23. Tacotron: Source and target choice - Input: character sequence - Output: mel and linear spectrogram (50ms frame length, 12.5ms frame shift) - x2.5 shorter total sequence than that of vocoder parameters - Objective: L1 loss 23
  • 24. Tacotron: Decoder pre-net - The most important architecture advancement from vanila seq2seq - 2 FFN + ReLU + dropout - Crucial for alignment learning 24
  • 25. Tacotron: Reduction factor - Predicts multiple frames at a time - Reduction factor r: the number of frames to predict at one step - Large r → - Small number of iteration - Easy alignment learning - Less training time - Less inference time - Poor quality 25
  • 26. [5] Dzmitry Bahdanau,Kyunghyun Cho, Yoshua Bengio:Neural Machine Translation by Jointly Learningto Align and Translate. CoRR abs/1409.0473 (2014) - Proposed by Bahdanau et al. (2014) - Content based attention - Distance between source and target is learned by FFN - No structure constraint Tacotron: Additive attention epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 100epoch 0 26 [5]
  • 27. Tacotron: Limitations - Training time is too long: > 2M steps (mini batches) to converge - Waveform generation: Griffin-Lim is handy but hurts the waveform quality. - Stop condition: Tacotron predicts fixed-length spectrogram, which is inefficient both at training and inference time. 27
  • 28. Example: Tacotron2 Tacotron2 is a surprising method that achieved human level quality of synthesized speech. Its architecture is an extension of Tacotron. Tacotron2 uses WaveNet for high-quality waveform generation. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779- 4783 https://arxiv.org/abs/1712.05884 28
  • 29. Tacotron2: Architecture - CNN layers instead of CBHG at encoder and post-net - Stop token prediction - Post-net improves predicted mel spectrogram - Location sensitive attention - No reduction factor Figure is from Shen et al. (2018) 29
  • 30. [7] David Krueger, Tegan Maharaj, János Kramár, MohammadPezeshki, Nicolas Ballas, Nan Rosemary Ke, AnirudhGoyal, YoshuaBengio, HugoLarochelle, Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations.CoRR abs/1606.01305(2016) Tacotron2: Trends of architecture changes 1. Larger model size - character embedding: 256 → 512 - encoder: 256 → 512 - decoder pre-net: (256, 128) → (256, 256) - decoder: 256 → 1024 - attention: 256 → 128 - post-net: 256 → 512 2. More regularizations: - Encoder applies dropout at every layer. The original Tacotron applies dropout in pre-net only at encoder. - Zoneout regularization for encoder bidirectional LSTM and decoder LSTM - Post-net applies dropout at every layer - L2 regularization CBHG module is replaced by CNN layers at encoder and postnet probably due to limited regularization applicability. Large model needs heavy regularization to prevent overfitting 30 [7]
  • 31. Tacotron2: Source and target choice - Input: character sequence - Output: mel spectrogram - Objective: mean square error 31
  • 32. Tacotron2: Waveform synthesis with WaveNet - Upsampling rate: 200 - Mel vs Linear: no difference - Human-level quality was achieved by WaveNet trained with ground truth alignmed predicted spectrogram 32
  • 33. [8] Jan Chorowski, Dzmitry Bahdanau,Dmitriy Serdyuk, Kyunghyun Cho,Yoshua Bengio: Attention-Based Models for SpeechRecognition. NIPS 2015: 577-585 Tacotron2: Location sensitive attention - Proposed by Chorowski et al., (2015) - Utilizes both input content and input location 33 [8]
  • 34. Tacotron2: Limitations - WaveNet training in Tacotron2: ● WaveNet is trained with ground truth-aligned **predicted** mel spectrogram. ● Tacotron2's errors are corrected by WaveNet. ● However, this is unproductive: one WaveNet for each different Tacotron2 - WaveNet training in normal condition ● WaveNet is trained with ground truth spectrogram ● However, its MOS score still is still inferior to the human speech. 34
  • 35. Our work: Tacotron for Japanese language We extended Tacotron for Japanese language. ● To model Japanese pitch accents, we introduced accentual type embedding as well as phoneme embedding ● To handle long term dependencies in accents: we use additional components Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language. CoRR abs/1810.11960 (2018) https://arxiv.org/abs/1810.11960 35
  • 36. Japanese Tacotron: Architecture - Accentual type embedding - Forward attention - Self-attention - Dual source attention - Vocoder parameter target 36
  • 37. Japanese Tacotron: Source and target choice - Source: phoneme and accentual type sequence - Target: - mel spectrogram - vocoder parameters - Objective: - L1 loss (mel spectrogram, mel generalized cepstrum) - softmax cross entropy loss (discretized F0) 37
  • 38. [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attentionin Sequence-To-SequenceAcoustic Modeling for SpeechSynthesis. ICASSP2018: 4789-4793 Japanese Tacotron: Forward attention - Proposed by Zhang et al. (2018) - Precluding left-to-right progress - Fast alignment learning epoch 1 epoch 3 epoch 7 epoch 10 epoch 50 epoch 0 38 [10]
  • 39. Japanese Tacotron: Limitations - Speech quality is still worse than that of traditional pipeline system - Input feature limitation - Configuration is based on Tacotron, not Tacotron2 39
  • 40. Summary 40 Char2Wav Tacotron VoiceLoop [11] DeepVoice3 [12] Tacotron 2 Transformer [13] Japanese Tacotron network type RNN RNN memory buffer CNN RNN self-attention RNN input character/ phoneme character phoneme character/ phoneme character phoneme phoneme seq2seq output vocoder mel vocoder mel mel mel mel/ vocoder post-net output - linear - linear/ vocoder mel mel - attention mechanism GMM additive GMM dot-product location sensitive dot-product forward waveform synthesis SampleRNN Griffin-Lim WORLD Griffin-Lim/ WORLD/ WaveNet WaveNet WaveNet WaveNet
  • 41. Japanese Tacotron: Implementation URL: https://github.com/nii-yamagishilab/self-attention-tacotron Framework: Tensorflow, Python Supported datasets: LJSpeech, VCTK, (...coming more in the future) Features: 41 - Tacotron, Tacotron2, Japanese Tacotron model - Combination choices for encoder, decoder, attention mechanism - Force alignment - Mel spectrogram, vocoder parameter for target - Compatible with our WaveNet implementation
  • 42. Japanese Tacotron: Experimental workflow 42 training & validation metrics validation result test result
  • 43. Japanese Tacotron: Audio samples 43 NAT TACMNPTACMAP SATMAP SATMNP SATWAP
  • 44. Bibliography - [1] Wei Ping, Kainan Peng, Jitong Chen: ClariNet: Parallel Wave Generation in End-to-End Text-to- Speech. CoRR abs/1807.07281 (2018) - [2] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio: Char2Wav: End-to-End Speech Synthesis. ICLR 2017 - [3] Alex Graves: Generating Sequences With Recurrent Neural Networks. CoRR abs/1308.0850 (2013) - [4] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, Rif A. Saurous: Tacotron: Towards End-to-End Speech Synthesis. INTERSPEECH 2017: 4006-4010 - [5] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014) 44
  • 45. Bibliography - [6] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu: Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. ICASSP 2018: 4779-4783 - [7] David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron C. Courville, Chris Pal: Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. CoRR abs/1606.01305 (2016) - [8] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio: Attention- Based Models for Speech Recognition. NIPS 2015: 577-585 - [9] Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi: Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language. CoRR abs/1810.11960 (2018) 45
  • 46. Bibliography - [10] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Rong Dai: Forward Attention in Sequence-To-Sequence Acoustic Modeling for Speech Synthesis. ICASSP 2018: 4789-4793 - [11] Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani: VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. ICLR 2018 - [12] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, John Miller: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. ICLR 2018 - [13] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou: Close to Human Quality TTS with Transformer. CoRR abs/1809.08895 (2018) 46
  • 47. Acknowledgement This work was partially supported by JST CREST Grant Number JPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan. 47