Interactive voice conversion for augmented speech production

Nagoya University, Japan
Interactive Voice Conversion
for Augmented Speech Production
Tomoki TODA
July 2, 2021
Interactive VC
Physical
functions
Machine
learning
Interaction
Cooperatively augmented speech production

Physical Mechanism of Speech Production
• Produce speech signals by physically controlling speech organs
Sound source generation by
vocal folds vibration
• Quasi-periodic excitation signal
Modulation by articulation
• Resonance characteristics
• Nonlinguistic information is not controlled...
• These physical functions are hard to replace and cause limitations...
1

Can We Produce Speech Beyond Constraints?
• Possibly use voice conversion to augment our speech production by
intentionally controlling more information [Toda; 2014]
Sound source
generation
Articulation
Speech
Voice
conversion
Augmented sound
source generation
Augmented
articulation
Converted
speech
Augment speech production
beyond physical constraints
Hello…
Hello…
Hello…
Normal speech organs
would be virtually
implanted!
Even if some speech
organs were lost…
Hello!
2

Voice Conversion (VC)
Technique to convert non-/para-linguistic information
while keeping linguistic information unchanged

Basic Process of Voice Conversion
• Combining signal processing for speech analysis-synthesis and
machine learning for statistical feature conversion
Converted speech parameters
Training
data
Converted
speech
Input
speech
Feature
conversion
Converted speech
parameters
Synthesis
Analysis
Extracted speech
parameters
Highly nonlinear
function
Source and target speech data
(e.g., parallel data consisting of utterance pairs)
3
[Abe; 1990]

Demo of VC: Vocal Effector
• Convert my singing voice into a specific charactersʼ singing voice!
Realtime VC software
[Dr. Kobayashi, Nagoya Univ.]
Famous virtual singer
[Toda; 2012][Kobayashi; 2018a]
VC
4

1st VCC (VCC2016)
• Parallel training
2nd VCC (VCC2018)
• Parallel training
• Nonparallel training
3rd VCC (VCC2020)
• Semi-parallel training
• Nonparallel training across
different languages
Recent Progress of VC Techniques
http://www.vc-challenge.org/
• Progress through Voice Conversion Challenge (VCC) [Toda; 2016]
Source
speaker
Target
speaker
Freely available
baseline system
Top system
5
[Kobayashi; 2016]
[Liu; 2018]
[Toda; 2007]
[Kobayashi; 2018b]
[Zhang; 2020]
[Liu; 2020]
[Tobing; 2020]
[Huang; 2020]

Converted speech parameters
Recent Trend of VC Techniques
Training
data
Converted
speech
Input
speech
Feature
conversion
Synthesis
Analysis
Simplified
Parametric decomposition
 Resonance &
excitation parameters
No decomposition
 Power spectrogram
High-quality vocoder
 Signal processing
Deep waveform generation
 Neural vocoder
Data-driven
Frame-to-frame
 Parametric probabilistic models
 Resonance modeling
Sequence-to-sequence
 Encoder-decoder w/ attention
 Joint resonance & excitation modeling
More
complex
Supervised parallel training
 Regression using time-aligned
source & target features
Unsupervised nonparallel training
 Reconstruction through
speaker-independent features
 Pretrained models
More
flexible
6

NOTE: Risk of VC
• Need to look at a possibility that VC is misused for spoofing
• VC makes it possible for someone to speak with your voices!
• But... we should NOT stop VC research because there are
many useful applications (e.g., speaking aid)!
• What can we do?
• Collaborate with anti-spoofing research [Wu; ʼ15, Kinnunen; ʼ17, Todisco; ʼ19]
• Need to widely tell people how to use VC correctly!
VC needs to be socially recognized as a kitchen knife.
7

From VC to Interactive VC
Limitations
• Batch-type processing
• Limited controllability
• Less interpretable
To augment speech production
• Quick response
• Better controllability
• Understandable behavior

Instantaneous feedback of system output to
understand system behavior through interaction
Desired speech
free from physical
constraints
Interactive VC w/
LLRT processing
Speech produced by
physical functions
Intentional control
of system output
Multimodal
behavior signals
Acquire unconscious
control skills?
Interactive VC
• Leverage interaction between user and system to develop cooperatively
working functions for augmenting speech production
• Achieve low-latency real-time (LLRT) processing
• Incorporate physical mechanism and multimodal behavior signals
Physical
mechanism
Involuntary control to avoid
physically impossible output
8
[JST CREST, CoAugmentation Project (PI: Toda), 2019-]

Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Multimodal
behavior signals
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9

Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9
Multimodal
behavior signals

LLRT VC w/ Computationally Efficient Network
Short-time
frame analysis
Input
speech
Converted
mel-spectrogram
Speaker code
of target voice
RNN decoder
Excitation
parameters
RNN decoder
Mel-spectrogram
RNN encoder RNN encoder
Latent features Latent features
Feature conversion network [Tobing; 2021b]
• Based on VAE w/ sparse RNN
10
Encoder
Speaker-aware decoder
Speaker-independent features

LLRT VC w/ Computationally Efficient Network
Short-time
frame analysis
Input
speech
Converted
mel-spectrogram
Speaker code
of target voice
RNN decoder
Excitation
parameters
RNN decoder
Mel-spectrogram
RNN encoder RNN encoder
Latent features Latent features
Feature conversion network [Tobing; 2021b]
• Based on VAE w/ sparse RNN
Multi-band
discrete waveforms
Converted
speech
Modified
WaveRNN
Frame-wise
CNN
Full-band
waveform synthesis
Time-variant
IIR filtering
Waveform generation network [Tobing; 2021a]
• Auto-regressive neural vocoder
10
Frame-wise processing
Sample-wise
processing in
each frame

Cascaded Network Training w/ Fine-Tuning
Natural mel-
spectrogram
VAE
Reconstructed
mel-spectrogram
Converted
mel-spectrogram
VAE
Cyclically
reconstructed
mel-spectrogram
11
Converted
mel-spectrogram
Cyclic training of VAE: CycleVAE [Tobing; 2019]
• Pseudo parallel data generation
Natural mel-
spectrogram
Waveform
Neural
vocoder
Analysis
Training of universal neural vocoder
• Arbitrary speakers and languages
CycleVAE
Neural
vocoder
Natural mel-
spectrogram
Waveform
Reconstructed
mel-spectrogram
Fine-tuning of CycleVAE by propagating loss of universal neural vocoder
Freeze

Results of Listening Tests
Naturalness
(Max: 5, Min: 1)
Higher is better
Speaker similarity
(Max: 100, Min: 0)
Higher is better
Natural voice of source speakers 4.58 12.23
Natural voice of target speakers 4.60 83.39
VCC2020 seq-to-seq baseline [Huang; 2020] 4.01 81.19
VCC2020 frame-to-frame baseline [Tobing; 2020] 3.84 69.91
LLRT VC w/o fine-tuning [Tobing; 2021b] 3.34 59.56
LLRT VC w/ fine-tuning [Tobing; 2021b] 3.93 69.28
12
w/ a single core of 2.1̶2.7 GHz CPU
• Real-time processing w/ 10 ms frame shift
• Latency < 50 ms

• Improvement of controllability of unified neural vocoder by softly
implementing physical mechanism of speech production
Controllable Deep Waveform Generation
Resonance
filtering
Excitation
generation
Waveform
Features
Waveform
generation
Waveform
Features
Source-filter models
Unified models
Traditional vocoder
STRAIGHT,
WORLD, ...
WaveNet,
WaveRNN,
PWG, ...
Resonance
filtering
Excitation
generation
Waveform
Features
Proposed
vocoders
Resonance
filtering
Excitation
generation
Waveform
Features
Resonance
filtering
Excitation
generation
Waveform
Features
LPCNet,
GlotGAN,
GELP, ...
NSF, ...
Parametric models Deep neural networks
13

Quasi-Periodic Neural Vocoders
• Dilated convolution network (e.g., WaveNet [van den Oord; 2016])
• F0-dependent dilated convolution network
• Dynamically change dilation length w/ a given F0 pattern
14
[Wu; 2021a][Wu; 2021b]
𝑇
3
𝑇
2
𝑇
2
𝑇
2
𝑇
1
𝑇
3
𝑇
1
𝑇
1 𝑇 1/𝐹 ,
Fundamental
period:
𝑥
𝑥
Input
𝑥
1st layer
𝑥
𝑥 𝑥
𝑥
𝑥
2nd layer
Dilation length 𝑇
Dilation length 2𝑇
𝑥
𝑥
Input
𝑥
1st layer
𝑥
𝑥 𝑥
𝑥
𝑥
2nd layer
Dilation length 1
Dilation length 2 Waveform sample
sequence modeling
w/ fixed receptive
field
Waveform sample
sequence modeling
w/ time-varying
receptive field

Generated
from
20th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]

Generated
from
20th layer
Generated
from
5th layer
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]

Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]

Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
Generated
from
15th layer
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]

Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
Generated
from
15th layer
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
• Well factorization of a network into excitation and resonance parts
• Significantly improve F0 controllability including extrapolation performance
Generated
from
20th layer
[Wu; 2021a]

Augmented Speech Production: Speaking Aid
• Laryngectomees
• Removal of larynx
• Separated trachea from vocal tract
• Alternative speaking methods
• Electrolaryngeal (EL) speech with an electrolarynx, esophageal speech, ...
• Suffer from unnatural speech quality and less expression
Vocal folds
Esophagus
Trachea
Laryngectomy
Unable to produce
sound source in a usual
manner with vibration
of vocal folds…
Esophagus
Develop an augmented speech production system to recover lost voices!
16

Singing-Aid System with Interactive VC
• Interactive VC to convert EL speech into singing voice
• Real-time melody control by playing MIDI keyboard
• Freely sing an arbitrary song
F0 pattern
conversion
MIDI keyboard
performance
EL
speech
Resonance
conversion
Resonance
features of
EL speech
Resonance
features of
singing voice
Singing
voice
Waveform
generation
MIDI melody
pattern
F0 pattern of
singing voice
[Morikawa; 2017][Li; 2019]
17

Expression Control w/ Multimodal Signals
• “Karaoke”-type singing aid system with interactive VC
• Sing a song to background music without playing a music instrument
• Control vibrato by moving an arm
F0 pattern
conversion
Arm
movements
EL
speech
Resonance
conversion
Singing
voice
Waveform
generation
MIDI melody
pattern
Vibrato control
parameters
Arm position
detection
Background
music
[Okawa; 2021]
19

Summary
• Voice Conversion (VC)
• Technique to convert non-/para-linguistic information
• Significant progress through recent Voice Conversion Challenges (VCCs)
• Need to be recognized as “kitchen knife”
• From VC to Interactive VC towards augmented speech production
• Low-latency real-time conversion to achieve quick response
• Incorporate physical mechanism to network and additional use of
multimodal behavior signals to achieve better controllability
• Immediate goal: achieve high-quality instantaneous feedback to help
users to understand system behavior through interaction
20
Interactive VC
Physical
functions
Machine
learning
Interaction
Cooperatively augmented speech production

[Abe; 1990] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara. Voice conversion through vector quantization.
J. Acoust. Soc. Jpn (E), Vol. 11, No. 2, pp. 71‒76, 1990.
[Huang; 2020] W.-C. Huang, T. Hayashi, S. Watanabe, T. Toda. The sequence-to-sequence baseline for the
Voice Conversion Challenge 2020: cascading ASR and TTS. Proc. Joint workshop for the Blizzard Challenge
and Voice Conversion Challenge 2020, pp. 160‒164, 2020.
[Kinnunen; 2017] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, K.A. Lee.
The ASVspoof 2017 Challenge: assessing the limits of replay spoofing attack detection. Proc.
INTERSPEECH, pp. 2‒6, 2017.
[Kobayashi; 2016] K. Kobayashi, S. Takamichi, S. Nakamura, T. Toda. The NU-NAIST voice conversion
system for the Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1667‒1671, 2016.
[Kobayashi; 2018a] K. Kobayashi, T. Toda, S. Nakamura. Intra-gender statistical singing voice conversion
with direct waveform modification using log-spectral differential. Speech Commun., Vol. 99, pp. 211‒220,
2018.
[Kobayashi; 2018b] K. Kobayashi, T. Toda. sprocket: open-source voice conversion software. Proc.
Odyssey, pp. 203‒210, 2018.
[Li; 2019] L. Li, T. Toda, K. Morikawa, K. Kobayashi, S. Makino. Improving singing aid system for
laryngectomees with statistical voice conversion and VAE-SPACE. Proc. ISMIR, pp. 784‒790, 2019.
[Liu; 2018] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, L.-R. Dai. WaveNet Vocoder with Limited Training Data
for Voice Conversion. Proc. INTERSPEECH, pp. 1983‒1987, 2018.
[Liu; 2020] L.-J. Liu, Y.-N. Chen, J.-X. Zhang, Y. Jiang, Y.-J. Hu, Z.-H. Ling, L.-R. Dai. Non-parallel voice
conversion with autoregressive conversion model and duration adjustment. Proc. Joint workshop for the
Blizzard Challenge and Voice Conversion Challenge 2020, pp. 126‒130, 2020.
[Morikawa; 2017] K. Morikawa, T. Toda. Electrolaryngeal speech modification towards singing aid system
for laryngectomees. Proc. APSIPA ASC, 4 pages, 2017.
[Okawa; 2021] ⼤川舜平, ⽯⿊祥⽣, ⼤⾕健登, ⻄野隆典, ⼩林和弘, ⼾⽥智基, 武⽥⼀哉. 電気式⼈⼯喉頭を⽤い
た歌唱システムにおける⾃然な⾝体動作を利⽤した歌唱表現付与の提案. 第25回情報処理学会シンポジウム
INTERACTION 2021, 6 pages, Mar. 2021.
[Tobing; 2019] P.L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion with
cyclic variational autoencoder. Proc. INTERSPEECH, pp. 674‒678, 2019. References: 1

[Tobing; 2020] P.L. Tobing, Y. Wu, T. Toda. Baseline system of Voice Conversion Challenge 2020 with
cyclic variational autoencoder and parallel WaveGAN. Proc. Joint workshop for the Blizzard Challenge and
Voice Conversion Challenge 2020, pp. 155‒159, 2020.
[Tobing; 2021a] P.L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion
with cyclic variational autoencoder. Proc. INTERSPEECH, 5 pages, 2021 (to appear).
[Tobing; 2021b] P.L. Tobing, T. Toda. Non-parallel voice conversion with cyclic variational autoencoder.
Proc. 11th ISCA Speech Synthesis Workshop (SSW11), 6 pages, 2021 (to appear).
[Toda, 2007] T. Toda, A.W. Black, K. Tokuda. Voice conversion based on maximum likelihood estimation of
spectral parameter trajectory. IEEE Trans. Audio, Speech & Lang. Process., Vol. 15, No. 8, pp. 2222‒2235,
2007.
[Toda; 2012] T. Toda, T. Muramatsu, H. Banno. Implementation of computationally efficient real-time voice
conversion. Proc. INTERSPEECH, 4 pages, 2012.
[Toda, 2014] T. Toda. Augmented speech production based on real-time statistical voice conversion. Proc.
GlobalSIP, pp. 755‒759, 2014.
[Toda; 2016] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice
Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016.
[Todisco; 2019] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N.
Evans, T.H. Kinnunen, K.A. Lee ASVspoof 2019: future horizons in spoofed and fake audio detection. Proc.
INTERSPEECH, pp. 1008‒1012, 2019.
[van den Oord; 2016] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. W. Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint,
arXiv:1609.03499, 15 pages, 2016.
[Wu; 2015] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li. Spoofing and countermeasures for
speaker verification: A survey. Speech Commun. Vol. 66, pp. 130‒153, 2015.
[Wu; 2021a] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic parallel WaveGAN: a
non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural
network. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 792‒806, 2021.
[Wu; 2021b] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic WaveNet: an
autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network.
IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 1134‒1148, 2021. References: 2

[Zhang; 2020] J.-X. Zhang, L.-J. Liu, Y.-N. Chen, Y.-J. Hu, Y. Jiang, Z.-H. Ling, L.-R. Dai. Voice conversion by
cascading automatic speech recognition and text-to-speech synthesis with prosody transfer. Proc. Joint
workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 121‒125, 2020.
* VCC series
[VCC2016 Summary] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The
Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016.
[VCC2016 Analysis] M. Wester, Z. Wu, J. Yamagishi. Analysis of the Voice Conversion Challenge 2016
evaluation results. Proc. INTERSPEECH, pp. 1637‒1641, 2016.
[VCC2018 Summary] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z.
Ling. The voice conversion challenge 2018: promoting development of parallel and nonparallel methods.
Proc. Odyssey, pp. 195‒202, 2018.
[VCC2018 Analysis] T. Kinnunen, J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, Z. Ling.
A spoofing benchmark for the 2018 voice conversion challenge: leveraging from spoofing countermeasures
for speech artifact assessment. Proc. Odyssey, pp. 187‒194, 2018.
[VCC2020 Summary] Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R.K. Das, T. Kinnunen, Z. Ling, T. Toda.
Voice Conversion Challenge 2020 ‒ intra-lingual semi-parallel and cross-lingual voice conversion ‒. Proc.
Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 80‒98, 2020.
[VCC2020 Analysis] R.K. Das, T. Kinnunen, W.-C. Huang, Z. Ling, J. Yamagishi, Z. Yi, X. Tian, T. Toda.
Predictions of subjective ratings and spoofing assessments of Voice Conversion Challenge 2020
submissions. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 99‒
120, 2020.
* Nonparallel VC w/ speaker-independent representations
[PPG] L. Sun, K. Li, H. Wang, S. Kang, H.M. Meng. Phonetic posteriorgrams for many-to-one voice
conversion without parallel data training. Proc. IEEE ICME, 6 pages, 2016.
[VAE] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang. Voice conversion from non-parallel corpora
using variational auto-encoder. Prof. APSIPA ASC, 6 pages, 2016.
[VQVAE] A. van den Oord, O. Vinyals, K. Kavukcuoglu. Neural discrete representation learning. arXiv
preprint, arXiv:1711.00937, 11 pages, 2017.
References: 3

* Vocoder
[STRAIGHT] H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne. Restructuring speech representations using
a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible
role of a repetitive structure in sounds. Speech Commun., Vol. 27, No. 3‒4, pp. 187‒207, 1999.
[WORLD] M. Morise, F. Yokomori, K. Ozawa. WORLD: a vocoder-based high-quality speech synthesis system
for real-time applications. IEICE Trans. Inf. & Syst., Vol. E99-D, No. 7, pp. 1877‒1884, 2016.
[LPCNet] J.-M. Valin, J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. Proc.
IEEE ICASSP, pp. 5891‒5895, 2019.
[GlotGAN] L. Juvela, B. Bollepalli, V. Tsiaras, P. Alku. GlotNet ̶ a raw waveform model for the glottal
excitation in statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol.
27, No. 6, pp. 1019‒1030, 2019.
[GELP] L. Juvela, B. Bollepalli, J. Yamagishi, P. Alku. GELP: GAN-excited linear prediction for speech
synthesis from mel-spectrogram. Proc. INTERSPEECH, pp. 694‒698, 2019.
[NSF] X. Wang, S. Takaki J. Yamagishi. Neural source-filter waveform models for statistical parametric
speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 28, pp. 402‒415, 2019.
[WaveNet] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W.
Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint, arXiv:1609.03499, 15
pages, 2016.
[Parallel WaveNet] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den
Driessche, E. Lockhart, L.C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N.
Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, D. Hassabis. Parallel WaveNet: fast high-
fidelity speech synthesis. arXiv preprint, arXiv:1711.10433, 11 pages, 2017.
[WaveRNN] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van
den Oord, S. Dieleman, K. Kavukcuoglu. Efficient neural audio synthesis. Proc. ICML, pp. 2410‒2419, 2018.
[PWG] R. Yamamoto, E. Song, J.-M. Kim. Parallel WaveGAN: A fast waveform generation model based on
generative adversarial networks with multi-resolution spectrogram. Proc. IEEE ICASSP, pp. 6199‒6203, 2020.
[aHM] G. Degottex, Y. Stylianou. Analysis and synthesis of speech using an adaptive full-band harmonic
model. IEEE Trans. Audio, Speech & Lang. Process., Vol. 21, No. 10, pp. 2085‒2095, 2013.
[DDSP] J. Engel, L. Hantrakul, C. Gu, A. Roberts. DDSP: differentiable digital signal processing. Proc. ICLR,
16 pages, 2020. References: 4

Interactive voice conversion for augmented speech production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Interactive voice conversion for augmented speech production

Similar to Interactive voice conversion for augmented speech production (20)

More from NU_I_TODALAB

More from NU_I_TODALAB (13)

Recently uploaded

Recently uploaded (20)

Interactive voice conversion for augmented speech production