This document describes a study on voice conversion using sequence-to-sequence learning. The researchers propose converting context posterior probabilities from the source to target speaker using sequence-to-sequence learning to allow for variable-length conversion. They also propose jointly training the recognition and synthesis models to better relate recognition accuracy to synthesis accuracy. Experimental results found that sequence-to-sequence learning enabled variable-length conversion and joint training improved speaker similarity and quality of converted speech over conventional methods.
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
Sequence-to-Sequence Voice Conversion Using Context Posterior Probabilities
1. Hiroyuki Miyoshi, Yuki Saito,
Shinnosuke Takamichi, and Hiroshi Saruwatari
(The University of Tokyo)
Voice Conversion Using
Sequence-to-Sequence Learning
of Context Posterior Probabilities
INTERSPEECH Tue-O-4-10-1
Stockholm, Sweden
Aug. 22, 2017
2. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 1/15
Outline of This Talk
Issue:
Voice conversion needs parallel data of source and target speakers.
Conventional method
Voice conversion using context posterior probabilities (CPPs). [Sun et al., 2016]
1. Recognition: source speech feats. → source CPPs.
2. Synthesis: copied source CPPs. → target speech feats.
Pros. : Non-parallel voice conversion
Cons. : Difficulty of converting speaker individuality included in CPPs
Proposed:
Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target
CPPs
Joint training of recognition and synthesis to increase conversion performance
Results:
Seq2Seq learning achieved variable-length voice conversion.
Joint training improved speaker similarity and quality of converted speech.
3. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 2/15
Conventional Voice Conversion Algorithm:
Copied Context Posterior Probability
[Sun et al., 2016] Training
Target
speech feats.
LSTM
LSTM
Source
speech feats.
a
i
u
Target
CPP
𝒙
𝑹(⋅)
CPP
Context
label
𝒍 𝑥ෝ𝒑 𝒙 ෝ𝒑 𝑦
𝑮(⋅)
𝒚
𝑮(ෝ𝒑 𝑦)
1. Recognition 2. Synthesis
Time
Recognition Error
(Softmax cross entropy)
𝐿 𝐶(ෝ𝒑 𝒙, 𝒍 𝑥)
Synthesis Error
(Mean squared error)
𝐿 𝐺(𝑮(ෝ𝒑 𝑦), 𝒚)
Separated training
5. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 4/15
Time
Issues of Conventional Voice Conversion
1. CPPs’ shapes and lengths are significantly different betw. speakers.
Shapes are different.
Lengths of each phoneme are different.
2. Improving recognition accuracy ≠ improving synthesis accuracy
Conventional method separately trains speech recognition/synthesis.
6. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 5/15
Proposed Algorithms
1. Sequence-to-Sequence Conversion from
Source CPP to Target CPP
2. Joint Training of Recognition and Synthesis
(like auto-encoding)
7. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 6/15
Sequence-to-Sequence Learning [Sutskever et al., 2014]
Sequence-to-Sequence Learning: variable-length conversion
雨 が 降る
It rainsInput sequence (Encoder)
Output sequence (Decoder)
Japanese-to-English translation using Seq2Seq learning
Constraints
Phoneme duration is given.
Conversion is done phoneme by phoneme.
Problems of Seq2Seq conversion of CPPs
・Determining duration is difficult.
・Conversion failures propagate if the number of frames to be generated is large.
[Weng et al., 2016]
9. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 8/15
Effect of the Proposed Algorithm
Variable-length voice conversion
0
1
Variable-length
conversion of CPPs is achieved!
Source CPP Target CPP
Frame
CPP after Seq2Seq conversion
Time
10. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 9/15
Joint Training of
Speech Recognition and Synthesis
Training
Source
speech feats.
LSTM
LSTM
Source
speech feats.
Source CPP
𝒙
𝑹(⋅)
𝒍 𝑥
ෝ𝒑 𝒙
𝑮(⋅)
𝒙
𝑮(ෝ𝒑 𝑥)
1. Recognition 2. Synthesis
Time Joint training
Recognition Error
𝐿 𝐶 𝑹 𝒙 , 𝒍 𝑥 + 𝐿 𝐺(𝑮(ෝ𝒑 𝑥), 𝒙)
(Conventional term) + Synthesis error using predicted CPP
12. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 11/15
Experimental Setup
Dataset ATR Japanese speech database
(phonetically balanced 503 sentences)
Training/Test 450 sentences / 53 sentences (16 kHz sampling)
Linguistic feats. 224-dimensional vectors (phonemes)
Speech feats. Mel-cepstrum (1st-through-24th) + Delta
Optimization algorithm AdaGrad (learning rate = 0.01) [Duchi et al., 2011.]
Recognition/ Synthesis Model Bidirectional LSTM (256 units)
Encoder / Decoder Bidirectional LSTM / LSTM (256 units each)
Number of Speakers 8 people including source and target speaker
13. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 12/15
Objective and Subjective Evaluations of
Seq2Seq Learning
Objective Eval.
Subjective Eval.
Better!
Better!
Worse
Error bars denote
95 % confidence
intervals.
Source Target
Voice samples are
available online.
14. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 13/15
Objective Evaluation of Joint Training
Better!
Joint Training got better score on mel-cepstral distortion!
Auto-encoding case
Calculates reconstruction error
after recognition and synthesis.
15. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 14/15
Subjective Evaluation of Joint Training
Better!
Subjective Eval.
Better!
Joint Training made both speaker similarity and speech quality better!
16. INTERSPEECH 2017 @Stockholm Aug. 22, 2017 15/15
Conclusion
Issue:
Difficulty of converting speaker individuality included in CPPs.
Improving recognition accuracy ≠ improving synthesis accuracy.
Proposed:
Sequence-to-sequence (Seq2Seq) conversion from source CPPs to target
CPPs.
Joint training of recognition and synthesis.
Results:
Seq2Seq learning achieved variable-length voice conversion.
Joint training improved speaker similarity and quality of converted speech.