This document discusses recent progress in interactive voice conversion techniques for augmenting speech production. It begins by explaining the physical limitations of normal speech production and how voice conversion can augment speech by controlling more information. It then discusses how interactive voice conversion allows for quick response times, better controllability through real-time feedback, and understanding user intent from multimodal behavior signals. Recent advances discussed include low-latency voice conversion networks, controllable waveform generation respecting the source-filter model of speech, and expression control using signals like arm movements. The goal is to develop cooperatively augmented speech that can help users with lost speech abilities.
Interactive voice conversion for augmented speech production
1. Nagoya University, Japan
Interactive Voice Conversion
for Augmented Speech Production
Tomoki TODA
July 2, 2021
Interactive VC
Physical
functions
Machine
learning
Interaction
Cooperatively augmented speech production
2. Physical Mechanism of Speech Production
• Produce speech signals by physically controlling speech organs
Sound source generation by
vocal folds vibration
• Quasi-periodic excitation signal
Modulation by articulation
• Resonance characteristics
• Nonlinguistic information is not controlled...
• These physical functions are hard to replace and cause limitations...
1
3. Can We Produce Speech Beyond Constraints?
• Possibly use voice conversion to augment our speech production by
intentionally controlling more information [Toda; 2014]
Sound source
generation
Articulation
Speech
Voice
conversion
Augmented sound
source generation
Augmented
articulation
Converted
speech
Augment speech production
beyond physical constraints
Hello…
Hello…
Hello…
Normal speech organs
would be virtually
implanted!
Even if some speech
organs were lost…
Hello!
2
5. Basic Process of Voice Conversion
• Combining signal processing for speech analysis-synthesis and
machine learning for statistical feature conversion
Converted speech parameters
Training
data
Converted
speech
Input
speech
Feature
conversion
Converted speech
parameters
Synthesis
Analysis
Extracted speech
parameters
Highly nonlinear
function
Source and target speech data
(e.g., parallel data consisting of utterance pairs)
3
[Abe; 1990]
6. Demo of VC: Vocal Effector
• Convert my singing voice into a specific charactersʼ singing voice!
Realtime VC software
[Dr. Kobayashi, Nagoya Univ.]
Famous virtual singer
[Toda; 2012][Kobayashi; 2018a]
VC
4
7. 1st VCC (VCC2016)
• Parallel training
2nd VCC (VCC2018)
• Parallel training
• Nonparallel training
3rd VCC (VCC2020)
• Semi-parallel training
• Nonparallel training across
different languages
Recent Progress of VC Techniques
http://www.vc-challenge.org/
• Progress through Voice Conversion Challenge (VCC) [Toda; 2016]
Source
speaker
Target
speaker
Freely available
baseline system
Top system
5
[Kobayashi; 2016]
[Liu; 2018]
[Toda; 2007]
[Kobayashi; 2018b]
[Zhang; 2020]
[Liu; 2020]
[Tobing; 2020]
[Huang; 2020]
8. Converted speech parameters
Recent Trend of VC Techniques
Training
data
Converted
speech
Input
speech
Feature
conversion
Synthesis
Analysis
Simplified
Parametric decomposition
Resonance &
excitation parameters
No decomposition
Power spectrogram
High-quality vocoder
Signal processing
Deep waveform generation
Neural vocoder
Data-driven
Frame-to-frame
Parametric probabilistic models
Resonance modeling
Sequence-to-sequence
Encoder-decoder w/ attention
Joint resonance & excitation modeling
More
complex
Supervised parallel training
Regression using time-aligned
source & target features
Unsupervised nonparallel training
Reconstruction through
speaker-independent features
Pretrained models
More
flexible
6
9. NOTE: Risk of VC
• Need to look at a possibility that VC is misused for spoofing
• VC makes it possible for someone to speak with your voices!
• But... we should NOT stop VC research because there are
many useful applications (e.g., speaking aid)!
• What can we do?
• Collaborate with anti-spoofing research [Wu; ʼ15, Kinnunen; ʼ17, Todisco; ʼ19]
• Need to widely tell people how to use VC correctly!
VC needs to be socially recognized as a kitchen knife.
7
11. Instantaneous feedback of system output to
understand system behavior through interaction
Desired speech
free from physical
constraints
Interactive VC w/
LLRT processing
Speech produced by
physical functions
Intentional control
of system output
Multimodal
behavior signals
Acquire unconscious
control skills?
Interactive VC
• Leverage interaction between user and system to develop cooperatively
working functions for augmenting speech production
• Achieve low-latency real-time (LLRT) processing
• Incorporate physical mechanism and multimodal behavior signals
Physical
mechanism
Involuntary control to avoid
physically impossible output
8
[JST CREST, CoAugmentation Project (PI: Toda), 2019-]
12. Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Multimodal
behavior signals
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9
13. Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9
Multimodal
behavior signals
14. LLRT VC w/ Computationally Efficient Network
Short-time
frame analysis
Input
speech
Converted
mel-spectrogram
Speaker code
of target voice
RNN decoder
Excitation
parameters
RNN decoder
Mel-spectrogram
RNN encoder RNN encoder
Latent features Latent features
Feature conversion network [Tobing; 2021b]
• Based on VAE w/ sparse RNN
10
Encoder
Speaker-aware decoder
Speaker-independent features
21. Generated
from
20th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]
22. Generated
from
20th layer
Generated
from
5th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]
23. Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]
24. Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
Generated
from
15th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
[Wu; 2021a]
25. Generated
from
20th layer
Generated
from
5th layer
Generated
from
10th layer
Generated
from
15th layer
Behavior of Dilated Convolution Networks
F0-dependent
dilated convolution
Fixed dilated
convolution
Fixed dilated
convolution
F0-dependent
dilated convolution
Fixed dilated
convolution
Noise signal
Waveform Waveform
Noise signal
Waveform
Noise signal
15
Resonance
filtering
Excitation
generation
• Well factorization of a network into excitation and resonance parts
• Significantly improve F0 controllability including extrapolation performance
Generated
from
20th layer
[Wu; 2021a]
26. Recent Progress of Interactive VC Techniques
1. LLRT VC with computationally efficient network architecture
2. Controllable waveform generation considering physical mechanism
3. Speech expression control with multimodal behavior signals
Produced
speech
Desired
speech
Excitation
conversion
Resonance
conversion
Waveform
generation
LLRT conversion processing
Controllability
9
Multimodal
behavior signals
27. Augmented Speech Production: Speaking Aid
• Laryngectomees
• Removal of larynx
• Separated trachea from vocal tract
• Alternative speaking methods
• Electrolaryngeal (EL) speech with an electrolarynx, esophageal speech, ...
• Suffer from unnatural speech quality and less expression
Vocal folds
Esophagus
Trachea
Laryngectomy
Unable to produce
sound source in a usual
manner with vibration
of vocal folds…
Esophagus
Develop an augmented speech production system to recover lost voices!
16
28. Singing-Aid System with Interactive VC
• Interactive VC to convert EL speech into singing voice
• Real-time melody control by playing MIDI keyboard
• Freely sing an arbitrary song
F0 pattern
conversion
MIDI keyboard
performance
EL
speech
Resonance
conversion
Resonance
features of
EL speech
Resonance
features of
singing voice
Singing
voice
Waveform
generation
MIDI melody
pattern
F0 pattern of
singing voice
[Morikawa; 2017][Li; 2019]
17
30. Expression Control w/ Multimodal Signals
• “Karaoke”-type singing aid system with interactive VC
• Sing a song to background music without playing a music instrument
• Control vibrato by moving an arm
F0 pattern
conversion
Arm
movements
EL
speech
Resonance
conversion
Singing
voice
Waveform
generation
MIDI melody
pattern
Vibrato control
parameters
Arm position
detection
Background
music
[Okawa; 2021]
19
31. Summary
• Voice Conversion (VC)
• Technique to convert non-/para-linguistic information
• Significant progress through recent Voice Conversion Challenges (VCCs)
• Need to be recognized as “kitchen knife”
• From VC to Interactive VC towards augmented speech production
• Low-latency real-time conversion to achieve quick response
• Incorporate physical mechanism to network and additional use of
multimodal behavior signals to achieve better controllability
• Immediate goal: achieve high-quality instantaneous feedback to help
users to understand system behavior through interaction
20
Interactive VC
Physical
functions
Machine
learning
Interaction
Cooperatively augmented speech production
33. [Abe; 1990] M. Abe, S. Nakamura, K. Shikano, H. Kuwabara. Voice conversion through vector quantization.
J. Acoust. Soc. Jpn (E), Vol. 11, No. 2, pp. 71‒76, 1990.
[Huang; 2020] W.-C. Huang, T. Hayashi, S. Watanabe, T. Toda. The sequence-to-sequence baseline for the
Voice Conversion Challenge 2020: cascading ASR and TTS. Proc. Joint workshop for the Blizzard Challenge
and Voice Conversion Challenge 2020, pp. 160‒164, 2020.
[Kinnunen; 2017] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, K.A. Lee.
The ASVspoof 2017 Challenge: assessing the limits of replay spoofing attack detection. Proc.
INTERSPEECH, pp. 2‒6, 2017.
[Kobayashi; 2016] K. Kobayashi, S. Takamichi, S. Nakamura, T. Toda. The NU-NAIST voice conversion
system for the Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1667‒1671, 2016.
[Kobayashi; 2018a] K. Kobayashi, T. Toda, S. Nakamura. Intra-gender statistical singing voice conversion
with direct waveform modification using log-spectral differential. Speech Commun., Vol. 99, pp. 211‒220,
2018.
[Kobayashi; 2018b] K. Kobayashi, T. Toda. sprocket: open-source voice conversion software. Proc.
Odyssey, pp. 203‒210, 2018.
[Li; 2019] L. Li, T. Toda, K. Morikawa, K. Kobayashi, S. Makino. Improving singing aid system for
laryngectomees with statistical voice conversion and VAE-SPACE. Proc. ISMIR, pp. 784‒790, 2019.
[Liu; 2018] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, L.-R. Dai. WaveNet Vocoder with Limited Training Data
for Voice Conversion. Proc. INTERSPEECH, pp. 1983‒1987, 2018.
[Liu; 2020] L.-J. Liu, Y.-N. Chen, J.-X. Zhang, Y. Jiang, Y.-J. Hu, Z.-H. Ling, L.-R. Dai. Non-parallel voice
conversion with autoregressive conversion model and duration adjustment. Proc. Joint workshop for the
Blizzard Challenge and Voice Conversion Challenge 2020, pp. 126‒130, 2020.
[Morikawa; 2017] K. Morikawa, T. Toda. Electrolaryngeal speech modification towards singing aid system
for laryngectomees. Proc. APSIPA ASC, 4 pages, 2017.
[Okawa; 2021] ⼤川舜平, ⽯⿊祥⽣, ⼤⾕健登, ⻄野隆典, ⼩林和弘, ⼾⽥智基, 武⽥⼀哉. 電気式⼈⼯喉頭を⽤い
た歌唱システムにおける⾃然な⾝体動作を利⽤した歌唱表現付与の提案. 第25回情報処理学会シンポジウム
INTERACTION 2021, 6 pages, Mar. 2021.
[Tobing; 2019] P.L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion with
cyclic variational autoencoder. Proc. INTERSPEECH, pp. 674‒678, 2019. References: 1
34. [Tobing; 2020] P.L. Tobing, Y. Wu, T. Toda. Baseline system of Voice Conversion Challenge 2020 with
cyclic variational autoencoder and parallel WaveGAN. Proc. Joint workshop for the Blizzard Challenge and
Voice Conversion Challenge 2020, pp. 155‒159, 2020.
[Tobing; 2021a] P.L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, T. Toda. Non-parallel voice conversion
with cyclic variational autoencoder. Proc. INTERSPEECH, 5 pages, 2021 (to appear).
[Tobing; 2021b] P.L. Tobing, T. Toda. Non-parallel voice conversion with cyclic variational autoencoder.
Proc. 11th ISCA Speech Synthesis Workshop (SSW11), 6 pages, 2021 (to appear).
[Toda, 2007] T. Toda, A.W. Black, K. Tokuda. Voice conversion based on maximum likelihood estimation of
spectral parameter trajectory. IEEE Trans. Audio, Speech & Lang. Process., Vol. 15, No. 8, pp. 2222‒2235,
2007.
[Toda; 2012] T. Toda, T. Muramatsu, H. Banno. Implementation of computationally efficient real-time voice
conversion. Proc. INTERSPEECH, 4 pages, 2012.
[Toda, 2014] T. Toda. Augmented speech production based on real-time statistical voice conversion. Proc.
GlobalSIP, pp. 755‒759, 2014.
[Toda; 2016] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice
Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016.
[Todisco; 2019] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N.
Evans, T.H. Kinnunen, K.A. Lee ASVspoof 2019: future horizons in spoofed and fake audio detection. Proc.
INTERSPEECH, pp. 1008‒1012, 2019.
[van den Oord; 2016] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. W. Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint,
arXiv:1609.03499, 15 pages, 2016.
[Wu; 2015] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li. Spoofing and countermeasures for
speaker verification: A survey. Speech Commun. Vol. 66, pp. 130‒153, 2015.
[Wu; 2021a] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic parallel WaveGAN: a
non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural
network. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 792‒806, 2021.
[Wu; 2021b] Y.-C. Wu, T. Hayashi, T. Okamoto, H. Kawai, T. Toda. Quasi-periodic WaveNet: an
autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network.
IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 29, pp. 1134‒1148, 2021. References: 2
35. [Zhang; 2020] J.-X. Zhang, L.-J. Liu, Y.-N. Chen, Y.-J. Hu, Y. Jiang, Z.-H. Ling, L.-R. Dai. Voice conversion by
cascading automatic speech recognition and text-to-speech synthesis with prosody transfer. Proc. Joint
workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 121‒125, 2020.
* VCC series
[VCC2016 Summary] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The
Voice Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632‒1636, 2016.
[VCC2016 Analysis] M. Wester, Z. Wu, J. Yamagishi. Analysis of the Voice Conversion Challenge 2016
evaluation results. Proc. INTERSPEECH, pp. 1637‒1641, 2016.
[VCC2018 Summary] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z.
Ling. The voice conversion challenge 2018: promoting development of parallel and nonparallel methods.
Proc. Odyssey, pp. 195‒202, 2018.
[VCC2018 Analysis] T. Kinnunen, J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, Z. Ling.
A spoofing benchmark for the 2018 voice conversion challenge: leveraging from spoofing countermeasures
for speech artifact assessment. Proc. Odyssey, pp. 187‒194, 2018.
[VCC2020 Summary] Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R.K. Das, T. Kinnunen, Z. Ling, T. Toda.
Voice Conversion Challenge 2020 ‒ intra-lingual semi-parallel and cross-lingual voice conversion ‒. Proc.
Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 80‒98, 2020.
[VCC2020 Analysis] R.K. Das, T. Kinnunen, W.-C. Huang, Z. Ling, J. Yamagishi, Z. Yi, X. Tian, T. Toda.
Predictions of subjective ratings and spoofing assessments of Voice Conversion Challenge 2020
submissions. Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 99‒
120, 2020.
* Nonparallel VC w/ speaker-independent representations
[PPG] L. Sun, K. Li, H. Wang, S. Kang, H.M. Meng. Phonetic posteriorgrams for many-to-one voice
conversion without parallel data training. Proc. IEEE ICME, 6 pages, 2016.
[VAE] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, H.-M. Wang. Voice conversion from non-parallel corpora
using variational auto-encoder. Prof. APSIPA ASC, 6 pages, 2016.
[VQVAE] A. van den Oord, O. Vinyals, K. Kavukcuoglu. Neural discrete representation learning. arXiv
preprint, arXiv:1711.00937, 11 pages, 2017.
References: 3
36. * Vocoder
[STRAIGHT] H. Kawahara, I. Masuda-Katsuse, A. de Cheveigne. Restructuring speech representations using
a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible
role of a repetitive structure in sounds. Speech Commun., Vol. 27, No. 3‒4, pp. 187‒207, 1999.
[WORLD] M. Morise, F. Yokomori, K. Ozawa. WORLD: a vocoder-based high-quality speech synthesis system
for real-time applications. IEICE Trans. Inf. & Syst., Vol. E99-D, No. 7, pp. 1877‒1884, 2016.
[LPCNet] J.-M. Valin, J. Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. Proc.
IEEE ICASSP, pp. 5891‒5895, 2019.
[GlotGAN] L. Juvela, B. Bollepalli, V. Tsiaras, P. Alku. GlotNet ̶ a raw waveform model for the glottal
excitation in statistical parametric speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol.
27, No. 6, pp. 1019‒1030, 2019.
[GELP] L. Juvela, B. Bollepalli, J. Yamagishi, P. Alku. GELP: GAN-excited linear prediction for speech
synthesis from mel-spectrogram. Proc. INTERSPEECH, pp. 694‒698, 2019.
[NSF] X. Wang, S. Takaki J. Yamagishi. Neural source-filter waveform models for statistical parametric
speech synthesis. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 28, pp. 402‒415, 2019.
[WaveNet] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W.
Senior, K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint, arXiv:1609.03499, 15
pages, 2016.
[Parallel WaveNet] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den
Driessche, E. Lockhart, L.C. Cobo, F. Stimberg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N.
Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, D. Hassabis. Parallel WaveNet: fast high-
fidelity speech synthesis. arXiv preprint, arXiv:1711.10433, 11 pages, 2017.
[WaveRNN] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van
den Oord, S. Dieleman, K. Kavukcuoglu. Efficient neural audio synthesis. Proc. ICML, pp. 2410‒2419, 2018.
[PWG] R. Yamamoto, E. Song, J.-M. Kim. Parallel WaveGAN: A fast waveform generation model based on
generative adversarial networks with multi-resolution spectrogram. Proc. IEEE ICASSP, pp. 6199‒6203, 2020.
[aHM] G. Degottex, Y. Stylianou. Analysis and synthesis of speech using an adaptive full-band harmonic
model. IEEE Trans. Audio, Speech & Lang. Process., Vol. 21, No. 10, pp. 2085‒2095, 2013.
[DDSP] J. Engel, L. Hantrakul, C. Gu, A. Roberts. DDSP: differentiable digital signal processing. Proc. ICLR,
16 pages, 2020. References: 4