SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
©Yuki Saito, 13/11/2018
GENERATIVE APPROACH USING THE NOISE
GENERATION MODELS FOR DNN-BASED SPEECH
SYNTHESIS TRAINED FROM NOISY SPEECH
Masakazu Une1, Yuki Saito2, Shinnosuke Takamichi2,
Daichi Kitamura3, Ryoichi Miyazaki1, and Hiroshi Saruwatari2
1NIT, Tokuyama College, Japan, 2The Univ. of Tokyo, Japan,
3NIT, Kagawa College, Japan
APSIPA-ASC 2018 TU-P1-5.1
/181
Text-To-Speech (TTS) synthesis
using Deep Neural Networks (DNNs)
 Text-To-Speech (TTS) synthesis
 TTS using Deep Neural Networks (DNNs) [Zen et al., 2013]
Text Speech
Linguistic
features
Speech
params.
Text
analysis
Speech
synthesis
Text-To-Speech (TTS)
DNN-based
acoustic models
To realize high-quality TTS,
studio-quality clean speech data is required for training the DNNs.
/18
 Goal: realizing high-quality TTS using NOISY speech data
 Common approach: noise reduction before training DNNs
– Error caused by the noise reduction is propagated to TTS training...
 Proposed: training DNNs considering noise additive process
– GAN*-based noise generation models are introduced to TTS training.
 Results: improving synthetic speech quality
2
Outline of this talk
*Generative Adversarial Network [Goodfellow et al., 2014]
Noise
reduction
Noisy
(observed)
Clean
(estimated)
TTS
Noise
addition
Noisy
(observed)
Clean
(unobserved)
TTS
Noise generation
models
/18
Noise reduction using Spectral Subtraction (SS)*
 Amplitude spectra after noise reduction 𝒚s
(SS)
is calculated as:
– 𝑦s
SS
𝑡, 𝑓 =
𝑦ns
2
𝑡, 𝑓 − 𝛽 𝑦n
2
𝑓 𝑦ns
2 𝑡, 𝑓 − 𝛽 𝑦n
2 𝑓 > 0
0 otherwise
 The estimated average power of noise 𝒚n
2 is defined as:
– 𝑦n
2
𝑓 =
1
𝑇n
𝑡=1
𝑇n
𝑦n
2
𝑡, 𝑓 (𝑇n: total frame length of the noise)
 Limitations
– Approximating the noise distribution with its expectation value 𝒚n
2
– Causing trade-off between noise reduction & speech distortion due to
setting the hyper-parameter 𝛽 (noise suppression ratio)
3*[Boll, 1979]
/18
Training TTS from noisy speech using SS
4
Mean squared error
𝐿MSE 𝒚s
SS
, 𝒚s
SS
TTS
Linguistic
features
Predicted clean
amplitude
spectra
Estimated clean
amplitude
spectra
Noisy
amplitude
spectra
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
→ Minimize𝐿MSE 𝒚s
SS
, 𝒚s
SS
=
1
𝑇
𝒚s
SS
− 𝒚s
SS
⊤
𝒚s
SS
− 𝒚s
SS
𝑇: total frame length of the features
/18
 1. Speech distortion caused by error of SS
 2. Propagation of the distortion by using 𝒚s
SS
as a target vector
Issues in training TTS using SS
5
𝐿MSE 𝒚s
SS
, 𝒚s
SS
𝒚s
(SS)
𝒚s
(SS)
𝒚ns
Noise
reduction
using SS
TTS
These issues significantly degrade synthetic speech quality...
/186
Proposed algorithm:
Training TTS using
noise generation models
based on GANs
/187
Overview of the proposed algorithm
𝐿MSE 𝒚ns, 𝒚ns
TTS
Linguistic
features
Estimated
noisy Noisy
Predicted
clean
𝒚s 𝒚ns 𝒚ns
Noise
addition
Pre-trained
noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
We want 𝐺n ⋅ to model the distribution of the observed noise.
/188
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
or
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
0: generated
Extraction of
non-speech
period
𝒚n
Observed
noise
/189
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
or
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
0: generated
Extraction of
non-speech
period
𝒚n
Observed
noise
/1810
Pre-training of noise generation models based on
GANs
Noise generation
models
𝐺n ⋅
Prior
noise
𝒚n
Generated
noise
𝒏
Discriminative
models
𝐷 ⋅
𝒚ns
Noisy
𝑉 𝐺n, 𝐷
𝑉 𝐺n, 𝐷 = min
𝐺n
max
𝐷
𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n
1: observed
*Jensen—Shannon
This minimizes the approx. JS* divergence betw. distributions of 𝒚n & 𝒚n.
Extraction of
non-speech
period
𝒚n
Observed
noise
/1811
Comparison of observed/generated noise
(generating Gaussian noise from uniform noise)
Frequency Amplitude
Freq.[kHz]Freq.[kHz]
Time [s]
Observed
Generated
Spectrogram Histogram
Our noise generation models effectively reproduce
characteristics of the observed noise!
/18
 Modeling distribution of stationary noise by using GANs
– Musical noise [Miyazaki et al., 2012] (unpleasant sound) can be reduced.
– By using recurrent networks, distribution of non-stationary noise can be
also modeled by our algorithm.
 Extending the proposed algorithm
– Distribution of context-dependent noise (e.g., pop-noise) can be
captured by using conditional GANs [Mirza et al., 2015].
– By using WaveNet [Oord et al., 2016], noise distribution can be modeled
in the waveform domain.
 Adapting TTS or noise generation models
– Pre-recorded clean speech data can be used to build initial models
used in our algorithm.
12
Discussion of proposed algorithm
/1813
Experimental evaluations
/18
Experimental conditions
14
Dataset
Japanese female speaker
(subset of JSUT corpus [Sonobe et al., 2017])
Train / evaluate data 3,000 / 53 sentences (16 kHz sampling)
Linguistic feats.
442-dimensional vector
(phoneme, accent type, F0, UV, duration, etc...)
Speech params. 257-dimensional log amplitude spectrum
Waveform synthesis Griffin & Lim’s method [Griffin et al., 1986]
Prior / observed noise Uniform / Gaussian (artificially added)
DNN architectures Feed-Forward (details are written in our manuscript)
Noise suppression ratio
of SS 𝛽
0.5, 1.0, 2.0, and 5.0
(larger value means stronger noise reduction)
Input SNR 0, 5, and 10 [dB]
Evaluation method
Preference AB test in terms of speech quality
(25 participants / evaluation)
/18
Results of subjective evaluation of speech quality
(input SNR = 0 [dB])
15In all cases, the 𝑝-values between the methods were smaller than 10−6
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00
Our algorithm significantly improves speech quality
compared with TTS using SS!
/18
Results of subjective evaluation of speech quality
(input SNR = 5 [dB])
16In all cases, the 𝑝-values between the methods were smaller than 10−6
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
0.292 0.708
0.320 0.680
0.323 0.677
0.00 0.25 0.50 0.75 1.00
Preference score
0.216 0.784
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00
Our algorithm significantly improves speech quality
compared with TTS using SS!
/18
Results of subjective evaluation of speech quality
(input SNR = 10 [dB])
17In all cases, the 𝑝-values between the methods were smaller than 10−6
.
0.368 0.632
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
0.312 0.688
0.312 0.688
0.00 0.25 0.50 0.75 1.00
Preference score
0.253 0.747
0.268 0.732
0.292 0.707
0.256 0.744
0.00 0.25 0.50 0.75 1.00
Preference score
0.288 0.712
SS+MSE
(β = 0.5)
SS+MSE
(β = 1.0)
SS+MSE
(β = 2.0)
SS+MSE
(β = 5.0)
Proposed
MSE+SS
(𝛽 = 0.5)
MSE+SS
(𝛽 = 1.0)
MSE+SS
(𝛽 = 2.0)
MSE+SS
(𝛽 = 5.0)
Proposed
Preference score
0.00 0.25 0.50 0.75 1.00
Our algorithm significantly improves speech quality
compared with TTS using SS!
/18
Conclusion
 Purpose
– Training high-quality TTS using noisy speech data
 Proposed
– Training algorithm considering noise additive process
• Our noise generation models can learn distribution of
observed noise through the GAN-based training.
 Results
– Improving synthetic speech quality compared with TTS using SS
 Future work
– Modeling non-stationary noise by the proposed algorithm
• Using richer DNN architectures (e.g., long-short term memory)
– Comparing our algorithm with state-of-the-art noise suppression
18
Thank you for your attention!

Mais conteúdo relacionado

Mais procurados

Semantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamSemantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti Spam
Tao He
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
Speech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometerSpeech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometer
I'am Ajas
 

Mais procurados (20)

Meta back translation
Meta back translationMeta back translation
Meta back translation
 
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
 
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
INFORMATIZED CAPTION ENHANCEMENT BASED ON IBM WATSON API AND SPEAKER PRONUNCI...
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues
 
A computer vision approach to speech enhancement
A computer vision approach to speech enhancementA computer vision approach to speech enhancement
A computer vision approach to speech enhancement
 
Semantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti SpamSemantic Parsing in Bayesian Anti Spam
Semantic Parsing in Bayesian Anti Spam
 
F010334548
F010334548F010334548
F010334548
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...
 
Speech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometerSpeech measurement using laser doppler vibrometer
Speech measurement using laser doppler vibrometer
 
A novel speech enhancement technique
A novel speech enhancement techniqueA novel speech enhancement technique
A novel speech enhancement technique
 
Speech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - ProjectSpeech driven gesture generation with Autoencoders - Project
Speech driven gesture generation with Autoencoders - Project
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 
A literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentA literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environment
 
Dsp2015for ss
Dsp2015for ssDsp2015for ss
Dsp2015for ss
 
Phonetic distance based accent
Phonetic distance based accentPhonetic distance based accent
Phonetic distance based accent
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
 
Audio Signal Processing
Audio Signal Processing Audio Signal Processing
Audio Signal Processing
 
Thesis
ThesisThesis
Thesis
 

Semelhante a Une18apsipa

A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
sipij
 

Semelhante a Une18apsipa (20)

International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdf
 
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio EstimatorThe Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
The Short-Time Silence of Speech Signal as Signal-To-Noise Ratio Estimator
 
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for M...
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
 
Subjective comparison of_speech_enhancement_algori (1)
Subjective comparison of_speech_enhancement_algori (1)Subjective comparison of_speech_enhancement_algori (1)
Subjective comparison of_speech_enhancement_algori (1)
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...
 
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral SubtractionSpeech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
20575-38936-1-PB.pdf
20575-38936-1-PB.pdf20575-38936-1-PB.pdf
20575-38936-1-PB.pdf
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Audio Noise Removal – The State of the Art
Audio Noise Removal – The State of the ArtAudio Noise Removal – The State of the Art
Audio Noise Removal – The State of the Art
 

Mais de Yuki Saito

Mais de Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 
釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会釧路高専情報工学科向け進学説明会
釧路高専情報工学科向け進学説明会
 
miyoshi17sp07
miyoshi17sp07miyoshi17sp07
miyoshi17sp07
 
miyoshi2017asj
miyoshi2017asjmiyoshi2017asj
miyoshi2017asj
 

Último

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 

Une18apsipa

  • 1. ©Yuki Saito, 13/11/2018 GENERATIVE APPROACH USING THE NOISE GENERATION MODELS FOR DNN-BASED SPEECH SYNTHESIS TRAINED FROM NOISY SPEECH Masakazu Une1, Yuki Saito2, Shinnosuke Takamichi2, Daichi Kitamura3, Ryoichi Miyazaki1, and Hiroshi Saruwatari2 1NIT, Tokuyama College, Japan, 2The Univ. of Tokyo, Japan, 3NIT, Kagawa College, Japan APSIPA-ASC 2018 TU-P1-5.1
  • 2. /181 Text-To-Speech (TTS) synthesis using Deep Neural Networks (DNNs)  Text-To-Speech (TTS) synthesis  TTS using Deep Neural Networks (DNNs) [Zen et al., 2013] Text Speech Linguistic features Speech params. Text analysis Speech synthesis Text-To-Speech (TTS) DNN-based acoustic models To realize high-quality TTS, studio-quality clean speech data is required for training the DNNs.
  • 3. /18  Goal: realizing high-quality TTS using NOISY speech data  Common approach: noise reduction before training DNNs – Error caused by the noise reduction is propagated to TTS training...  Proposed: training DNNs considering noise additive process – GAN*-based noise generation models are introduced to TTS training.  Results: improving synthetic speech quality 2 Outline of this talk *Generative Adversarial Network [Goodfellow et al., 2014] Noise reduction Noisy (observed) Clean (estimated) TTS Noise addition Noisy (observed) Clean (unobserved) TTS Noise generation models
  • 4. /18 Noise reduction using Spectral Subtraction (SS)*  Amplitude spectra after noise reduction 𝒚s (SS) is calculated as: – 𝑦s SS 𝑡, 𝑓 = 𝑦ns 2 𝑡, 𝑓 − 𝛽 𝑦n 2 𝑓 𝑦ns 2 𝑡, 𝑓 − 𝛽 𝑦n 2 𝑓 > 0 0 otherwise  The estimated average power of noise 𝒚n 2 is defined as: – 𝑦n 2 𝑓 = 1 𝑇n 𝑡=1 𝑇n 𝑦n 2 𝑡, 𝑓 (𝑇n: total frame length of the noise)  Limitations – Approximating the noise distribution with its expectation value 𝒚n 2 – Causing trade-off between noise reduction & speech distortion due to setting the hyper-parameter 𝛽 (noise suppression ratio) 3*[Boll, 1979]
  • 5. /18 Training TTS from noisy speech using SS 4 Mean squared error 𝐿MSE 𝒚s SS , 𝒚s SS TTS Linguistic features Predicted clean amplitude spectra Estimated clean amplitude spectra Noisy amplitude spectra 𝒚s (SS) 𝒚s (SS) 𝒚ns Noise reduction using SS → Minimize𝐿MSE 𝒚s SS , 𝒚s SS = 1 𝑇 𝒚s SS − 𝒚s SS ⊤ 𝒚s SS − 𝒚s SS 𝑇: total frame length of the features
  • 6. /18  1. Speech distortion caused by error of SS  2. Propagation of the distortion by using 𝒚s SS as a target vector Issues in training TTS using SS 5 𝐿MSE 𝒚s SS , 𝒚s SS 𝒚s (SS) 𝒚s (SS) 𝒚ns Noise reduction using SS TTS These issues significantly degrade synthetic speech quality...
  • 7. /186 Proposed algorithm: Training TTS using noise generation models based on GANs
  • 8. /187 Overview of the proposed algorithm 𝐿MSE 𝒚ns, 𝒚ns TTS Linguistic features Estimated noisy Noisy Predicted clean 𝒚s 𝒚ns 𝒚ns Noise addition Pre-trained noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 We want 𝐺n ⋅ to model the distribution of the observed noise.
  • 9. /188 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 or 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed 0: generated Extraction of non-speech period 𝒚n Observed noise
  • 10. /189 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 or 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed 0: generated Extraction of non-speech period 𝒚n Observed noise
  • 11. /1810 Pre-training of noise generation models based on GANs Noise generation models 𝐺n ⋅ Prior noise 𝒚n Generated noise 𝒏 Discriminative models 𝐷 ⋅ 𝒚ns Noisy 𝑉 𝐺n, 𝐷 𝑉 𝐺n, 𝐷 = min 𝐺n max 𝐷 𝐸 log 𝐷 𝒚n + 𝐸 log 1 − 𝐷 𝒚n 1: observed *Jensen—Shannon This minimizes the approx. JS* divergence betw. distributions of 𝒚n & 𝒚n. Extraction of non-speech period 𝒚n Observed noise
  • 12. /1811 Comparison of observed/generated noise (generating Gaussian noise from uniform noise) Frequency Amplitude Freq.[kHz]Freq.[kHz] Time [s] Observed Generated Spectrogram Histogram Our noise generation models effectively reproduce characteristics of the observed noise!
  • 13. /18  Modeling distribution of stationary noise by using GANs – Musical noise [Miyazaki et al., 2012] (unpleasant sound) can be reduced. – By using recurrent networks, distribution of non-stationary noise can be also modeled by our algorithm.  Extending the proposed algorithm – Distribution of context-dependent noise (e.g., pop-noise) can be captured by using conditional GANs [Mirza et al., 2015]. – By using WaveNet [Oord et al., 2016], noise distribution can be modeled in the waveform domain.  Adapting TTS or noise generation models – Pre-recorded clean speech data can be used to build initial models used in our algorithm. 12 Discussion of proposed algorithm
  • 15. /18 Experimental conditions 14 Dataset Japanese female speaker (subset of JSUT corpus [Sonobe et al., 2017]) Train / evaluate data 3,000 / 53 sentences (16 kHz sampling) Linguistic feats. 442-dimensional vector (phoneme, accent type, F0, UV, duration, etc...) Speech params. 257-dimensional log amplitude spectrum Waveform synthesis Griffin & Lim’s method [Griffin et al., 1986] Prior / observed noise Uniform / Gaussian (artificially added) DNN architectures Feed-Forward (details are written in our manuscript) Noise suppression ratio of SS 𝛽 0.5, 1.0, 2.0, and 5.0 (larger value means stronger noise reduction) Input SNR 0, 5, and 10 [dB] Evaluation method Preference AB test in terms of speech quality (25 participants / evaluation)
  • 16. /18 Results of subjective evaluation of speech quality (input SNR = 0 [dB]) 15In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 17. /18 Results of subjective evaluation of speech quality (input SNR = 5 [dB]) 16In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 0.292 0.708 0.320 0.680 0.323 0.677 0.00 0.25 0.50 0.75 1.00 Preference score 0.216 0.784 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 18. /18 Results of subjective evaluation of speech quality (input SNR = 10 [dB]) 17In all cases, the 𝑝-values between the methods were smaller than 10−6 . 0.368 0.632 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed 0.312 0.688 0.312 0.688 0.00 0.25 0.50 0.75 1.00 Preference score 0.253 0.747 0.268 0.732 0.292 0.707 0.256 0.744 0.00 0.25 0.50 0.75 1.00 Preference score 0.288 0.712 SS+MSE (β = 0.5) SS+MSE (β = 1.0) SS+MSE (β = 2.0) SS+MSE (β = 5.0) Proposed MSE+SS (𝛽 = 0.5) MSE+SS (𝛽 = 1.0) MSE+SS (𝛽 = 2.0) MSE+SS (𝛽 = 5.0) Proposed Preference score 0.00 0.25 0.50 0.75 1.00 Our algorithm significantly improves speech quality compared with TTS using SS!
  • 19. /18 Conclusion  Purpose – Training high-quality TTS using noisy speech data  Proposed – Training algorithm considering noise additive process • Our noise generation models can learn distribution of observed noise through the GAN-based training.  Results – Improving synthetic speech quality compared with TTS using SS  Future work – Modeling non-stationary noise by the proposed algorithm • Using richer DNN architectures (e.g., long-short term memory) – Comparing our algorithm with state-of-the-art noise suppression 18 Thank you for your attention!