The document proposes a phrase break prediction model that combines BERT and BiLSTM models. It achieves an absolute improvement of 3.2 points in F1 score over previous models and a MOS of 4.39 for prosody naturalness, competitive with ground-truth speech. The model extracts implicit features from BERT and explicit features from BiLSTM with linguistic inputs to better predict phrase break positions for text-to-speech synthesis.
Why Teams call analytics are critical to your entire business
Phrase break prediction with bidirectional encoder representations in Japanese TTS
1. Phrase break prediction with bidirectional encoder
representations in Japanese TTS
Kosuke Futamata, Byeongseon Park,
Ryuichi Yamamoto, Kentaro Tachibana
2. Summary
Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
• Proposed a phrase break prediction
model that combines BERT and BiLSTM
• Achieves an absolute improvement of
3.2 points of F1 score compared with
previous models[1][2]
• Achieves a MOS of 4.39 in prosody
naturalness, which is competitive with
ground-truth synthesized speech
[1]: A. Vadapalli and S. V. Gangashetty, “An investigation of recurrent neural network architectures using word embeddings for phrase break prediction", in Proceedings of INTERSPEECH 2016 .
[2]: V. Klimkov, A. Nadolski, A. Moinet, B. Putrycz, R. BarraChicote, T. Merritt, and T. Drugman, “Phrase break prediction for long-form reading tts: Exploiting text structure information",
in Proceedings of INTERSPEECH 2017 2
3. Phrase break prediction in Japanese
Phrase break is a phonetic pause inserted between phrases
• Occurs due to breathing and/or an intonational reset
• Inserting phrase breaks into suitable positions increase the naturalness of generated speech[3,4]
3
without phrase breaks with phrase breaks
知らぬ間に自分名義で契約され
届いたスマホを開封せず詐欺グループに転
送させられる消費者被害が全国の国民生活
センターに寄せられている
知らぬ間に/自分名義で契約され、/
届いたスマホを開封せず/詐欺グループに
転送させられる/消費者被害が、/全国の
国民生活センターに寄せられている
[3] T.Fujimoto, K.Hashimoto, K.Oura, Y.Nankaku, and K.Tokuda, “Impacts of input linguistic feature representation on japanese end-to-end speech synthesis,” in Proc. SSW, 2019, pp. 166–171.
[4] K.Kurihara, N.Seiyama, and T.Kumano, “Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural tts,” IEICE Transactions on Information and Systems, vol.
E104.D, no. 2, pp. 302–311, 2021.
4. Goal: Predict suitable positions of phrase breaks
Predict the positions of superficial and latent phrase breaks from the input text
• Appears not only after punctuations and rule-based approach doesn’t work[5]
The task can be formulated as a sequential labeling
• labels a phrase break <BR> or non-phrase break <NB> for each token
4
Phrase break predictor
<BR>
<NB> <NB> <NB>
Y
X 知ら
θ
…
…
ぬ 間 に 自分
<NB>
[5] P. Taylor, Text-to-Speech Synthesis. New York: Cambridge University press, 2009.
5. Previous methods
5
Bidirectional LSTM has been used with additional linguistic features to improve
the performance even with small data
• part-of-speech(POS)
• Dependency tree features
• Character embedding
• Unsupervised pre-trained representations, word and/or sentence embedding [6, 7]
Pre-trained representations are weak context representations
• Pre-trained word embedding contains semantic information for only a single word
• language model lacks sufficient contextual information for a much longer sequence
BERT model is used to capture complex semantic structure in many NLP tasks
[6] A.Vadapalli and S.V.Gangashetty,“An investigation of recurrent neural network architectures using word embeddings for phrase break prediction,” in Proc. INTERSPEECH, 2016, pp. 2308– 2312.
[7] Y.Zheng, J.Tao, Z.Wen, and Y.Li, “Blstm-crf based end-to-end prosodic boundary prediction with context sensitive embeddings in a text-to-speech front-end,” in Proc. INTERSPEECH, 2018, pp. 47–51.
6. Overview: Proposed method
Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
Use both BiLSTM and BERT[8] models
• Explicit features from BiLSTM
• Implicit features from BERT
Use both labeled data and unlabeled data
• Labeled data to model the relations of phrase breaks
• Unlabeled data to pre-train embeddings and BERT
Combining explicit and implicit features helps the
output to be much more rich representations
6
[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language under- standing,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
7. Pre-processing
Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
The input Japanese text is firstly pre-processed
• Text normalization
• Tokenization
• Grapheme-to-phoneme(G2P)
• Part-of-speech(POS) tagging
• Dependency (DEP) parsing structure
POS tags and DEP features are fed into BiLSTM
7
8. Explicit features
Explicit features extracted from BiLSTM
• Capture the long-term dependencies in
a source sequence directly
• Help the implicit features to be much
more rich representations
linguistic features are used to work even
with a small data
• Part-of-speech (POS) tag
• Dependency (DEP) parsing structure
• Character embedding
• Word embedding Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
8
9. Implicit features
Implicit features extracted from BERT
• provide a better representation than simple word
embedding or sentence embedding like W2V, LM
• Learn the rich syntactic and semantic information,
which cannot be captured by only adding BiLSTM
Use a weighted average of all of the layers of BERT
• Weights are the trainable parameters
• Different layer has different types of information[9]
• Preliminary experiments showed that is much
more beneficial than using only the last layer of BERT
Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
9
[9] R. Anna, K. Olga, and R. Anna, “A primer in BERTology: What we know about how BERT works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020.
10. Objective evaluation
10
Compared the performance based on F1 score with different systems
Dataset: 99,907 transcribed utterances of a single Japanese speaker
• 98,807/500/500 for train/val/test
• Word transitions with a silence of more than 200 ms were marked as phrase breaks.
Word embedding and BERT were pre-trained with Japanese Wikipedia articles
BiLSTM (Tokens) Conventional BiLSTM-based method that using only a source sequence[10]
BiLSTM (Features)
BiLSTM (Tokens)-based method that adds designed linguistic features[11], including
POS tags, dependency tree features (DEP), and pre-trained word embedding
BERT Method that uses BERT model
BiLSTM (Tokens) + BERT Method that combines BiLSTM (Tokens) and BERT
BiLSTM (Features) + BERT The proposed method that combines BiLSTM (Features)and BERT
[10] A.VadapalliandS.V.Gangashetty,“Aninvestigationofrecurrent neural network architectures using word embeddings for phrase break prediction,” in Proc. INTERSPEECH, 2016, pp. 2308– 2312.
[11] V. Klimkov, A. Nadolski, A. Moinet, B. Putrycz, R. Barra- Chicote, T. Merritt, and T. Drugman, “Phrase break prediction for long-form reading tts: Exploiting text structure information,” in Proc.
INTERSPEECH, 2017, pp. 1064–1068.
11. Experimental results of objective evaluation
11
Obtains absolute improvements of 4.7 points and 3.2 points compared with
BiLSTM(Tokens) and BiLSTM (Features)
• Implicit features extracted from BERT are much more effective than explicit features
Combining Explicit features and Implicit features is also effective
• BiLSTM (Tokens) + BERT = BERT < BiLSTM (Features) + BERT
Model Input features
True
positives
False
negatives
False
positive
F1 Precision Recall
BiLSTM (Tokens) Tokens 653 114 53 88.7 92.5 85.1
BiLSTM (Features) Tokens+POS+DEP+W2V 676 91 56 90.2 92.3 88.1
BERT Tokens 688 79 39 92.1 94.6 89.7
BiLSTM (Tokens) + BERT Tokens 690 77 42 92.1 94.3 90.0
BiLSTM (Features) + BERT Tokens+POS+DEP+W2V 709 58 43 93.4 94.3 92.4
12. Subjective evaluation
12
Ask prosody naturalness of different systems based on the perceptual listening test
• MOS test: make quality judgments between 1(Bad) to 5 (Excellent)
• AB preference test: choose the audio sample A or B that was more natural than the other
Reference (Natural) Recorded speech in the test set
Reference (TTS) Synthesized speech from the test set (the position of phrase breaks is the same
with recorded speech)
Rule-based A rule-based method that inserts phrase breaks only after punctuation
BiLSTM (Tokens) Conventional BiLSTM-based method that using only a source sequence
BiLSTM (Features)
BiLSTM (Tokens)-based method that adds designed linguistic features, including
POS tags, dependency tree features (DEP), and pre-trained word embedding
BERT Method that uses BERT model
BiLSTM (Features) + BERT The proposed method that combines BiLSTM (Features)and BERT
13. Experimental conditions
13
Training data
• Same corpus used in the objective
evaluation
• 118.9 hours, 99,907 utterances
• Single Japanese professional speaker
TTS conditions
• Acoustic model: Fastspeech2[12]
• Neural vocoder: Parallel WaveGAN[13]
TTS Inputs
• Predicted phrase breaks along with
phoneme sequence of the input text
Evaluation procedure
• 29 native Japanese speakers
• MOS test
§ Randomly selected 30 utterances for each
system
§ 30 utterances * 7 systems = 210 in total
• AB preference test
§ The same utterances used for MOS test
§ 30 utterances * 12 systems = 360 in total
[12] R. Yi, H. Chenxu, Q. Tao, Z. Sheng, Z. Zhou, and L. Tie-Yan, “FastSpeech 2: Fast and high-quality end-to-end text-to-speech,” in Proc. ICLR (in press), 2021.
[13] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020,
pp. 6199–6203.
14. MOS results on prosody naturalness
14
The proposed method achieved almost the same quality as Reference (TTS)
• Significant improvement compared with conventional methods
• Using both explicit and implicit features is effective
• BiLSTM (Features) + BERT < BERT
System Model MOS
1 Reference (Natural) 4.70 ± 0.05
2 Reference (TTS) 4.37 ± 0.05
3 Rule-based 3.62 ± 0.07
4 BiLSTM (Tokens) 3.80 ± 0.07
5 BiLSTM (Features) 4.05 ± 0.06
6 BERT 4.26 ± 0.05
7 BiLSTM (Features) + BERT 4.39 ± 0.05
15. AB preference results on prosody naturalness
15
Model A Model B Preference A Preference B Neutral Binomial test
Rule-based BiLSTM (Tokens) 125 335 410 ***
BiLSTM (Tokens) BiLSTM (Features) 172 279 419 ***
BiLSTM (Features) BERT 138 329 403 ***
BERT BiLSTM (Features) + BERT 144 238 488 ***
BERT Reference (TTS) 156 269 445 ***
BiLSTM (Features) + BERT Reference (TTS) 151 156 563 n.s.
Significant differences between the proposed method and BERT
• Significant differences for five pairs
• No difference for the pair between BiLSTM (Features) + BERT and Reference (TTS)
The proposed method achieves the same quality as Reference (TTS)
*** : statistically significant difference at the 1% level as a result of a one-tailed binomial test
n.s. : no significant difference
17. Conclusion
17
Proposed a phrase break prediction method that combines BERT and BiLSTM.
Improved 3.2 points of F1 score compared with previous models
Achieves a MOS of 4.39, which is competitive with ground-truth synthesized speech
The naturalness of prosody is improved by
1. Implicit features by BERT
2. Explicit features plus linguistic features by BiLSTM
Future work
• Modeling the duration of phrase breaks to improve the naturalness of the prosody