2018 Speech Processing Courses in Crete (SPCC2018)
"Toawrds flexible and intelligible end-to-end speech synthesis systems"
Lecture slides
Tomoki Toda: Advanced Voice Conversion, July 26, 2018
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
2. Outline
• What is voice conversion (VC)?
• Why is VC needed?
• How to do VC?
• Tell us VC research history and recent progress.
• How to improve a conversion model?
• How to improve an objective function?
• How to generate a converted waveform?
• How to make training more flexible?
• How to compare different techniques?
• How to develop applications?
• Summary
Outline
3. Outline
• What is voice conversion (VC)?
• Why is VC needed?
• How to do VC?
• Tell us VC research history and recent progress!
• How to improve a conversion model?
• How to improve an objective function?
• How to generate a converted waveform?
• How to make training more flexible?
• How to compare different techniques?
• How to develop applications?
• Summary
VC is a technique to generate
speech sounds conveying desired
para‐/non‐linguistic information!
Outline: What’s VC?
9. • Described as a regression problem
• Supervised training using utterance pairs of source & target speech
Basic Framework of Statistical VC
Target speaker
Conversion model
Please say
the same thing.
Please say
the same thing.
Let’s convert
my voice.
Let’s convert
my voice.
Source speech Target speech
1. Training with parallel data (around 50 utterance pairs)
2. Conversion of any utterance while keeping linguistic contents unchanged
Source speaker
[Abe; ’90]
Example: speaker conversion
What’s VC?: 6
14. Speech
waveform
a r a y u rsil u g e N j i ts u
Phoneme
sequence
あらゆる 現実Silence
Word
sequence
Difficulty in Handling Speech Waveform
Sentence 「あらゆる現実を全て自分の方へ・・・」
• Need to properly model characteristics of speech waveform
• How to model long‐term dependency over a sequence?
• How to model fluctuation components?
* Sorry for Japanese example
What’s VC?: 11
15. Outline
• What is voice conversion (VC)?
• Why is VC needed?
• How to do VC?
• Tell us VC research history and recent progress!
• How to improve a conversion model?
• How to improve an objective function?
• How to generate a converted waveform?
• How to make training more flexible?
• How to compare different techniques?
• How to develop applications?
• Summary
There are several research topics.
Let’s look at them one by one.
Outline: VC progress
21. From Discontinuous to Continuous Conversion
[Stylianou; ’98]
VQ‐based conversion GMM‐based conversion
1. VC progress on conversion model: 4
• Model feature correlation more accurately
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
Original feature, x
4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2
Targetfeature,y
Codebook mapping
0.005
0.001
0.0001
Input feature
Target feature
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
Original feature, x
4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2Targetfeature,y
0.005
0.001
0.0001
Input featureTarget feature
Discrete function w/ hard
clustering
Ignore feature correlation w/
discrete mapping
Continuous function w/ soft
clustering
Directly model feature correlation
w/ linear regression
22. GMM‐based Conversion
M
m
yy
m
yx
m
xy
m
xx
m
y
m
x
m
t
t
m
1
)()(
)()(
)(
)(
,;
ΣΣ
ΣΣ
μ
μ
y
x
N λyx |, ttp
Parameter set
Joint feature
vector tx
ty
Mean
vector
Component weight
Covariance
matrix
m
)(x
mμ )(xx
mΣ
)( y
mμ )( yx
mΣ
)(xy
mΣ
)( yy
mΣ
M
m
xy
m
xy
tmtt
ttt
tt
tt mp
p
p
p
1
)|()|(
, ,;,|
d|,
|,
,| Σμyλx
yλyx
λyx
λxy N
Conversion w/ conditional p.d.f. (also modeled by a GMM)
Training of joint p.d.f. (modeled by a GMM) [Kain; ’98]
1. VC progress on conversion model: 5
[Stylianou; ’98]
M
m
xy
tmtttttt mpp
1
)|(
,,|d,|ˆ μλxyλxyyyMMSE estimate:
25. |(),|(,|maxargˆ,,ˆ )(
1
,,
1
1
y
tt
T
t
ttT PPP
T
vλXyλXyyy
yy
Source feature
sequence TXtX2X1X
Converted static
feature sequence Tyˆ1
ˆy 2
ˆy tyˆ
[Toda; ’07]
Conditional p.d.f.
for static features
Conditional p.d.f. for
dynamic features
(= linearly transformed)
Function of
static features
GMM
Converted
static features
• Simultaneously convert all frames over a time sequence (e.g., utterance)
1. VC progress on conversion model: 8
Conversion w/ MLPG(Maxim Likelihood Parameter Generation [Tokuda; ’00])
31. 2. Improve an Objective Function
MMSE [Stylianou; ’98]
Key ideas are
how to keep or reproduce natural speech fluctuation!
how to handle errors of time alignment!
Divergence
‐based
Minimization of
regression error
Maximization of
likelihood
MLE [Toda; ’07]
GV [Toda; ’07]
Regularization w/ feature to
capture oversmoothing effects
MS [Takamichi; ’16]
More generalized features
Distance‐
based
GAN [Saito; ’18]
Data‐driven regularization
2. VC progress on objective function: 1
MLE w/ DP‐GMM
[Nankaku; ’07]
Hidden
alignment
Sequence
mapping GAN w/ Gated
CNN [Kaneko; ’17]
RobustSensitive
Misalignment
33. • Use GV likelihood as a regularization term in conversion [Toda; ’07]
• Also possible to use it in training [Zen; ’12][Hwang; ’13]
• Simpler way: design postfilter to enhance GV [Toda; ’12b]
Regularization w/ GV
GV p.d.f.
)|(),|(,|maxargˆ,,ˆ )()(
1
,,
1
1
vy
tt
T
t
ttT PPP
T
λvλXyλXyyy
yy
Conditional p.d.f.
for static features
p.d.f. of GV
(= nonlinearly
transformed)
Conditional p.d.f. for
dynamic features
(= linearly transformed)
Function of
static features
GMM
Converted
static features
2. VC progress on objective function: 3
tyˆ )GV(
ˆtyPostfilter (simple linear
transformation w/ mean & var)
Converted feature w/o GV
Too small GV… GV close to natural one!
Enhanced feature
34. Modulation frequency
components
0 Hz
0.25 Hz
0.5 Hz
~ Hz
=
……
From GV to Modulation Spectrum
0 1 2 3
0
1
‐1
)( y
dv
dy
Decompose a parameter
sequence into individual
modulation frequency
components
Time [sec]
)(
1,
y
dv
)(
2,
y
dv
)(
,
y
fdv
)(
0,
y
dv
p.d.f. modeling of their
power values (i.e., their GVs)
Parameter
sequence
Incorporate them into
the objective function
[Takamichi; ’15] or design
postfilter [Takamichi; ’16]
[Takamichi; ’16]
2. VC progress on objective function: 4
38. 3. Improve Waveform Generation
HNM [Stylianou, ’96]
STRAIGHT [Kawahara; ’99]
AHOCODER [Erro; ’14]
WORLD [Morise; ’16]
Key ideas are
how to leverage source waveform!
how to avoid assumptions in source‐filter model!
High‐quality vocoder
PSOLA [Valbret; ’92]
Waveform modification
Direct waveform
filtering [Kobayashi; ’18a]
Time‐variant log‐spectral
differential filter estimation
Mixed excitation
[Ohtani; ’06]
Phase modeling
[Kain; ’01][Ye; ’06]
Residual selection
[Suendermann; ’05a]
Excitation
modeling
Excitation pulse
generation
[Juvela; ’18]
GAN
WaveNet vocoder
[Kobayashi; ’17]Neural vocoder
Leverage source waveform
Directly improve vocoder
3. VC progress on waveform generation: 1
39. Input speech
waveform
Time-variant filter Converted speech
waveform
Direct Waveform Modification
• Apply time‐variant filtering to input speech waveform to convert its
spectral envelope only
[Kobayashi; ’18a]
)(ˆ )/(
zH xy
t
][*][*][ˆ
][*][ˆ][ˆ
)(1)()(
)()/()(
nsnhnh
nsnhns
xx
t
y
t
xxy
t
y
][)(
ns x
)(
)(ˆ
)(ˆ
)(
)(
)/(
zH
zH
zH x
t
y
txy
t
3. VC progress on waveform generation: 2
• Keep natural phase components!
• Alleviate the over‐smoothing effects!
• But hard to convert excitation parameters (e.g., F0)
Tddd ˆ,,ˆ,ˆ
21
Sequence of log‐spectral differentials
(e.g., mel‐cepstrum differentials )
Converted
parameters =
λyx |, ttp
GMM
ttt xyd
λdx |, ttp
DIFFGMM
Variable transformation
40. Excitation Modeling
• Hard to generate natural excitation waveforms by using traditional
excitation models of source filter!
• Two important components need to be modeled…
• Stochastic component
Parameterized as frequency‐dependent aperiodicicy and statistically
converted in a mixed‐excitation framework [Ohtani; ’06]
• Phase component
Modeled with templates [Kain; ’01] or waveform reshaping filter [Ye; ’06]
Develop one pitch residual waveform dataset and select the best one using
other speech parameters (e.g., F0 & spectral parameter) [Suendermann; ’05a]
3. VC progress on waveform generation: 3
Excitation
Pulse train
Gaussian noise
Synthetic speech
Synthesis filter
)(zH
][*][][ nenhnx
][ne
42. VC w/ WaveNet [van den Oord; ’16b]
• Implementation of WaveNet vocoder for VC
• Target speaker‐dependent WaveNet vocoder [Tamamori; ’17] can generate
speech waveform almost indistinguishable from natural one [Hayashi; ’17]!
• Use target speaker‐dependent WaveNet vocoder to generate speech
waveform from converted speech parameters [Kobayashi; ’17]
Can significantly improve conversion accuracy on speaker identity!
Could also reduce adverse effects of some errors on converted speech
e.g., by training WaveNet vocoder w/ the converted speech parameters.
• Possible to directly use WaveNet for VC [Niwa; ’18]
Input
speech
Statistical
conversion
Converted speech
parameters
Analysis
Speech
parameters
Feature
extraction error
Conversion error Converted
speech
Synthesis w/
WaveNet
vocoder
Less affected by errors
3. VC progress on waveform generation: 5
46. Nonparallel Training w/ CycleGAN[Zhu; ’17]
• Simultaneously train two networks between two speakers.
4. VC progress on flexible framework: 3
Target
data 𝒚
Source
data 𝒙
Conversion
network
from 𝒙 to 𝒚
Conversion
network
from 𝒚 to 𝒙
Converted
data 𝒙 ⇒ 𝒚
Converted data
𝒙 ⇒ 𝒚 ⇒ 𝒙
Converted
data 𝒚 ⇒ 𝒙
Converted data
𝒚 ⇒ 𝒙 ⇒ 𝒚
Cycle loss
𝐿 𝒙, 𝒙
Cycle loss
𝐿 𝒚, 𝒚
Adversarial loss
𝐿 𝒙
Adversarial loss
𝐿 𝒚
Trained by
minimizing
𝐿 𝒙 𝐿 𝒚
𝐿 𝒙, 𝒙
𝐿 𝒚, 𝒚
Discriminator
network for 𝒙
0: Converted, 1: Natural
Discriminator
network for 𝒚
0: Converted, 1: Natural
[Fang; ’18][Kaneko; ’18]
47. One‐to‐Many (or Many‐to‐One) VC
[Toda; ’06]
• Convert reference speaker’s voice into arbitrary speaker’s voice
tX )(s
tY
sTt :1
tz
)(s
w
Ss :1
Speaker
info
Context
info
Model frame‐dependent contextual
factor and utterance‐dependent speaker
factor w/ different latent variables
Factorize speaker and context using the reference speaker as an anchor point!
)1(
:1 1TY
tX )2(
:1 2TY
)(
:1
S
TS
Y
Reference
speaker
Prestored speakers
1st speaker
2nd speaker
Sth speaker
Use multiple parallel datasets
between the reference speaker &
individuals of prestored speakers
Training datasets Model training
4. VC progress on flexible framework: 4
48. Eigenvoice Conversion (EVC)
Super vectors
= concatenated means
(context & speaker
dependent)
)(
)(
2
)(
1
)1(
)1(
2
)1(
1
,,
J
M
J
J
M b
b
b
b
b
b
)(
)(
1
s
J
s
w
w
)0(
)0(
2
)0(
1
Mb
b
b
Bias vector
= average speaker
(context dependent)
+
Factors
(speaker
dependent)
×
Basis vectors
= typical speaker
variations
(context dependent)
=
Used as speaker‐adaptive parameters
)(
)(
2
)(
1
s
M
s
s
μ
μ
μ
= +
• Factorize GMM mean vectors into context‐ and speaker‐dependent
components using Eigenvoice technique [Kuhn; ’00]
[Toda; ’06]
4. VC progress on flexible framework: 5
49. Demo: Many‐to‐One EVC
• Convert arbitrary speaker’s voice into pretrained
target speakers’ voice
tX tY
Tt :1
tz
wSpeaker
Context
Adaptation: Unsupervised estimation
of speaker‐adaptive parameter from
given input speech
tY
tz
wˆSpeaker
Context
Conversion: Use of the model adapted
with the estimated speaker‐adaptive
parameter
Tt :1
tX
Sorry for old demo
system developed more
than 10 years ago…
T:1x T:1y T:1x T:1y
4. VC progress on flexible framework: 6
51. Speaker‐Independent Feature Extraction
• Extract phoneme posteriorgram (PPG) as speaker‐independent contextual
features and use them as input of the conversion network
4. VC progress on flexible framework: 8
Phone recognizer
1x 2x Tx3xSource feature
sequence
Target feature
sequence
𝒚 𝒚 𝒚 𝒚
𝒑 𝒑 𝒑 𝒑
PPG
Target‐dependent
conversion network
No longer need to use
parallel data!
Target
speech data
PPG data
Phone
recognizer
Conversion
network
Remove speaker‐
dependencies!
Add speaker‐
dependencies!
[Sun; ’16]
52. Encoder
network
tz
Decoder
network
Target
speaker
Context
tX
𝑡 1: 𝑇
Gaussian
prior 𝑁 𝟎, 𝑰
Conversion step
Unsupervised Factorization
• Use multiple nonparallel datasets to develop a factorized conversion
model (e.g. CVAE [Hsu; ’16] or ARBM [Nakashika; ’16]) without any other models
4. VC progress on flexible framework: 9
)(s
tY
Speaker
Context
Decoder
network
Encoder
network
𝑡 1: 𝑇
𝑠 1: 𝑆
Gaussian
prior 𝑁 𝟎, 𝑰
Conditional Variational Autoencoder (CVAE)
Training step
Remove speaker
dependencies
Speaker‐independent
encoder network is trained!
Speaker‐adaptive
decoder is trained!
GAN can be used in
VAW‐GAN [Hsu; ’17].
Add speaker
dependencies
tz
)(s
w
)(s
w
)(s
tY
58. Overall Result of VCC2016 Listening Tests
1 2 3 4 5
0
20
40
60
80
100
Mean opinion score (MOS) on naturalness
Correct rate [%] on speaker similarity
Target
Source
Baseline A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Better
Better
P
Q
Correct = 75%
MOS = 3.5
• 22 submitted systems + 1 baseline system were evaluated.
5. VC progress on comparison: 3
60. Voice Conversion Challenge 2018 (VCC2018)
• Two tasks
• HUB task (main): Parallel training
• SPOKE task (optional): Nonparallel training
• Evaluation
• Naturalness and speaker similarity by listening tests
• Word error rate and spoofing results
• Design of VCC 2018 dataset (using DAPS [Mysore, ’15] )
• Down‐sampled to 22.05 kHz
# of speakers # of sentences
Source
speakers
2 females & 2 males 81 for training
& 35 for evaluation
Target
speakers
2 females & 2 males 81 for training
Other source
speakers
2 females & 2 males Other 81 for training
& 35 for evaluation
HUB tskSPOKE task
5. VC progress on comparison: 5
[Lorenzo‐Trueba; ’18]
61. Overall Results of VCC2018 Listening Tests
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
100
80
60
40
20
0
1 2 3 4 5
MOS on naturalness
Similarity score [%]
Results of Hub task Results of Spoke task
Baseline
system
N17 system
N10 system
Baseline
system
N17 system
N10 system
• 23 submitted systems + 1 baseline
system were evaluated in HUB task.
• 11 submitted systems + 1 baseline
system were evaluated in SPOKE task.
5. VC progress on comparison: 6
66. 6. Develop Various Applications
Cross‐lingual VC
[Abe; ’91]
Key ideas are
how to apply VC techniques to various mapping tasks!
development of real‐time VC (RT‐VC) applications!
Tellecommunication
Bandwidth
extension [Jax; ’03]
Speaking‐aid
Intelligibility
enhancement of
disordered speech
[Kain; ’07][Aihara; ’14]
Inversion & production
mapping [Richmond; ’03]
[Toda; ’08]
Articulatory
controllable
waveform
modification
[Tobing; ’17]
Singing VC
[Villavicencio; ’10]
[Doi; ’12]
Speech translation
[Hattori; ’11]
Entertainment
Articulatory modification
TTS
Silent speech
communication [Toda; ’12a]
Voice changer &
vocal effector [Kobayashi; ’14]
Alaryngeal speech
enhancement
[Nakamura; ’12][Doi; ’14]
Augmented speech production
[Toda, ’14]
F0‐controlled
electrolarynx [Tanaka; ’17]
RT‐VC
RT‐VC
RT‐VC
6. VC progress on application: 1
* NOTE: More applications have been studied
Accent
conversion
[Felps; ’09]
67. Real‐Time Statistical Voice Conversion
Source feature sequence
Converted feature sequence
Batch‐type conversion
Lttt xxxy λ ,,,,fˆ 1
1tx
1
ˆ ty
tx
tyˆ
1x
1
ˆy
2x
2
ˆy
Tx
Tyˆ
3x
3
ˆy
TT xxxyyy λ ,,,fˆ,,ˆ,ˆ 2121
Low‐delay frame‐wise
conversion
Sequence‐based conversion
Source feature sequence
Converted feature sequence
1tx2tx3tx
1
ˆ ty1
ˆ ty
• Approximate sequence‐based conversion with low‐delay conversion by
propagating all past info and also looking at near future info
2tx
6. VC progress on application: 2
[Toda; ’12b]
68. 1. Speaking‐Aid: Alaryngeal Speech Enhancement
• Real‐time conversion from alaryngeal speech into normal speech
ES
l
spg.data
Time [s]
Frequency[Hz]
1 1.5 2 2.5 3
0
1000
2000
3000
4000
5000
6000
7000
8000
ESl
spg.data
Time [s]
Frequency[Hz]
1 1.5 2 2.5 3
0
1000
2000
3000
4000
5000
6000
7000
8000
Esophageal speech Enhanced speech
Waveform
F0 pattern
Aperiodicity
Spectral
envelope
Time Time
VC
Laryngectomee Spectrum
Aperiodicity
F0
Spectral
segment
[Doi; ’14]
6. VC progress on application: 3
Augmented speech production beyond physical constraints!
69. 2. Silent Speech Communication
[Toda; ’12a]
6. VC progress on application: 4
Speaking side
・・・
Speak something private in
non‐audible murmur
or soft voice
Present more naturally sounding
speech to only a specific listener
My account
number is …
Listening side
VC
• Real‐time conversion from non‐audible murmur (very soft whispered voice)
[Nakajima; ’06] detected w/ body‐conductive microphone to natural voices
Augmented speech production to develop telepathy‐like communication!
Non‐audible
murmur
microphone
Converted to air‐conducted voices
from non‐audible murmur
Normal voice ( )
Whispered voice ( )
from body‐conducted soft voice
Soft voice ( )
71. Outline
• What is voice conversion (VC)?
• Why is VC needed?
• How to do VC?
• Tell us VC research history and recent progress!
• How to improve a conversion model?
• How to improve an objective function?
• How to generate a converted waveform?
• How to make training more flexible?
• How to compare different techniques?
• How to develop applications?
• Summary Let me tell you about one more
important thing.
Outline: Summary
75. [Erro; ’14] D. Erro, I. Sainz, E. Navas, I. Hernaez. Harmonics plus noise model based vocoder for statistical
parametric speech synthesis. IEEE J. Sel. Topics in Signal Process., Vol. 8, No. 2, pp. 184–194, 2014.
[Fang; ’18] F. Fang, J. Yamagishi, I. Echizen, J. Lorenzo‐Trueba. High‐quality nonparallel voice conversion
based on cycle‐consistent adversarial network. Proc. IEEE ICASSP, pp. 5279–5283, 2018.
[Felps; ’09] D. Felps, H. Bortfeld, R. Gutierrez‐Osuna. Foreign accent conversion in computer assisted
pronunciation training. Speech Commun., Vol. 51, No. 10, pp. 920–932, 2009.
[Goodfellow; ’14] I. Goodfellow, J. Pouget‐Abadie, M. Mirza, B. Xu, D. Warde‐Farley, S. Ozair, A. Courville, Y.
Bengio. Generative adversarial nets. Proc. NIPS, pp. 2672–2680, 2014.
[Hattori; ’11] N. Hattori, T. Toda, Hisashi Kawai, H. Saruwatari, K. Shikano. Speaker‐adaptive speech
synthesis based on eigenvoice conversion and language‐dependent prosodic conversion in speech‐to‐
speech translation. Proc. INTERSPEECH, pp. 2769–2772, 2011.
[Hayashi; ’17] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, T. Toda. An investigation of multi‐speaker
training for WaveNet vocoder. Proc. IEEE ASRU, pp. 712–718, 2017.
[Hsu; ’16] C.‐C. Hsu, H.‐T. Hwang, Y.‐C. Wu, Y. Tsao, H.‐M. Wang. Voice conversion from non‐parallel corpora
using variational auto‐encoder. Proc. APSIPA ASC, 6 pages, 2016.
[Hsu; ’17] C.‐C. Hsu, H.‐T. Hwang, Y.‐C. Wu, Y. Tsao, H.‐M. Wang. Voice conversion from unaligned corpora
using variational autoencoding Wasserstein generative adversarial networks. Proc. INTERSPEECH, pp.
3364–3368, 2017.
[Hwang; ’13] H. Hwang, Y. Tsao, H. Wang, Y. Wang, S. Chen. Incorporating global variance in the training
phase of GMM‐based voice conversion. Proc. APSIPA ASC, 6 pages, 2013.
[Jax; ’03] P. Jax, P. Vary. On artificial bandwidth extension of telephone speech. Signal Processing, Vol. 83,
pp. 1707–1719, 2003.
[Jin; ’16] Z. Jin, A. Finkelstein, S. DiVerdi, J. Lu, G.J. Mysore. CUTE: a concatenative method for voice
conversion using exemplar‐based unit selection. Proc. IEEE ICASSP, pp. 5660–5664, 2016.
References: 2
81. [Toda; ’16] T. Toda, L.‐H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi. The Voice
Conversion Challenge 2016. Proc. INTERSPEECH, pp. 1632–1636, 2016.
[Tokuda; ’00] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura. Speech parameter generation
algorithms for HMM‐based speech synthesis. Proc. IEEE ICASSP, pp. 1315–1318, 2000.
[Valbret; ’92] H. Valbret, E. Moulines and J. P. Tubach. Voice transformation using PSOLA technique. Speech
Commun., Vol. 11, No. 2–3, pp. 175–187, 1992.
[van den Oord; ’16a] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, K. Kavukcuoglu.
Conditional image generation with PixelCNN decoders. arXiv preprint, arXiv:1606.05328, 13 pages, 2016.
[van den Oord; ’16b] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N.
Kalchbrenner, A. W. Senior, and K. Kavukcuoglu. WaveNet: a generative model for raw audio. arXiv preprint,
arXiv:1609.03499, 15 pages, 2016.
[van den Oord; ’17] A. van den Oord, O. Vinyals, K. Kavukcuoglu. Neural discrete representation learning.
arXiv preprint, arXiv:1711.00937, 11 pages, 2017.
[Villavicencio; ’10] F. Villavicencio, J. Bonada. Applying voice conversion to concatenative singing‐voice
synthesis. Proc. INTERSPEECH, pp. 2162–2165, 2010.
[Wu; ’14] Z. Wu, T. Virtanen, E. Chng, H. Li. Exemplar‐based sparse representation with residual
compensation for voice conversion. IEEE/ACM Trans. Audio, Speech & Lang. Process., Vol. 22, No. 10, pp.
1506–1521, 2014.
[Wu; ’15] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, H. Li. Spoofing and countermeasures for
speaker verification: A survey. Speech Commun. Vol. 66, pp. 130–153, 2015.
[Wu; ’17] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, H.
Delgado. ASVspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE J.
Sel. Topics in Signal Process., Vol. 11, No. 4, pp. 588–604, 2017.
[Xu; ’14] N. Xu, Y. Tang, J. Bao, A. Jiang, X. Liu, Z. Yang. Voice conversion based on Gaussian processes by
coherent and asymmetric training with limited training data. Speech Commun., Vol. 58, pp. 124–138, 2014.
References: 8