SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Phrase break prediction with bidirectional encoder
representations in Japanese TTS
Kosuke Futamata, Byeongseon Park,
Ryuichi Yamamoto, Kentaro Tachibana
Summary
Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
• Proposed a phrase break prediction
model that combines BERT and BiLSTM
• Achieves an absolute improvement of
3.2 points of F1 score compared with
previous models[1][2]
• Achieves a MOS of 4.39 in prosody
naturalness, which is competitive with
ground-truth synthesized speech
[1]: A. Vadapalli and S. V. Gangashetty, “An investigation of recurrent neural network architectures using word embeddings for phrase break prediction", in Proceedings of INTERSPEECH 2016 .
[2]: V. Klimkov, A. Nadolski, A. Moinet, B. Putrycz, R. BarraChicote, T. Merritt, and T. Drugman, “Phrase break prediction for long-form reading tts: Exploiting text structure information",
in Proceedings of INTERSPEECH 2017 2
Phrase break prediction in Japanese
Phrase break is a phonetic pause inserted between phrases
• Occurs due to breathing and/or an intonational reset
• Inserting phrase breaks into suitable positions increase the naturalness of generated speech[3,4]
3
without phrase breaks with phrase breaks
知らぬ間に自分名義で契約され
届いたスマホを開封せず詐欺グループに転
送させられる消費者被害が全国の国民生活
センターに寄せられている
知らぬ間に/自分名義で契約され、/
届いたスマホを開封せず/詐欺グループに
転送させられる/消費者被害が、/全国の
国民生活センターに寄せられている
[3] T.Fujimoto, K.Hashimoto, K.Oura, Y.Nankaku, and K.Tokuda, “Impacts of input linguistic feature representation on japanese end-to-end speech synthesis,” in Proc. SSW, 2019, pp. 166–171.
[4] K.Kurihara, N.Seiyama, and T.Kumano, “Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural tts,” IEICE Transactions on Information and Systems, vol.
E104.D, no. 2, pp. 302–311, 2021.
Goal: Predict suitable positions of phrase breaks
Predict the positions of superficial and latent phrase breaks from the input text
• Appears not only after punctuations and rule-based approach doesn’t work[5]
The task can be formulated as a sequential labeling
• labels a phrase break <BR> or non-phrase break <NB> for each token
4
Phrase break predictor
<BR>
<NB> <NB> <NB>
Y
X 知ら
θ
…
…
ぬ 間 に 自分
<NB>
[5] P. Taylor, Text-to-Speech Synthesis. New York: Cambridge University press, 2009.
Previous methods
5
Bidirectional LSTM has been used with additional linguistic features to improve
the performance even with small data
• part-of-speech(POS)
• Dependency tree features
• Character embedding
• Unsupervised pre-trained representations, word and/or sentence embedding [6, 7]
Pre-trained representations are weak context representations
• Pre-trained word embedding contains semantic information for only a single word
• language model lacks sufficient contextual information for a much longer sequence
BERT model is used to capture complex semantic structure in many NLP tasks
[6] A.Vadapalli and S.V.Gangashetty,“An investigation of recurrent neural network architectures using word embeddings for phrase break prediction,” in Proc. INTERSPEECH, 2016, pp. 2308– 2312.
[7] Y.Zheng, J.Tao, Z.Wen, and Y.Li, “Blstm-crf based end-to-end prosodic boundary prediction with context sensitive embeddings in a text-to-speech front-end,” in Proc. INTERSPEECH, 2018, pp. 47–51.
Overview: Proposed method
Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
Use both BiLSTM and BERT[8] models
• Explicit features from BiLSTM
• Implicit features from BERT
Use both labeled data and unlabeled data
• Labeled data to model the relations of phrase breaks
• Unlabeled data to pre-train embeddings and BERT
Combining explicit and implicit features helps the
output to be much more rich representations
6
[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language under- standing,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
Pre-processing
Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
The input Japanese text is firstly pre-processed
• Text normalization
• Tokenization
• Grapheme-to-phoneme(G2P)
• Part-of-speech(POS) tagging
• Dependency (DEP) parsing structure
POS tags and DEP features are fed into BiLSTM
7
Explicit features
Explicit features extracted from BiLSTM
• Capture the long-term dependencies in
a source sequence directly
• Help the implicit features to be much
more rich representations
linguistic features are used to work even
with a small data
• Part-of-speech (POS) tag
• Dependency (DEP) parsing structure
• Character embedding
• Word embedding Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
8
Implicit features
Implicit features extracted from BERT
• provide a better representation than simple word
embedding or sentence embedding like W2V, LM
• Learn the rich syntactic and semantic information,
which cannot be captured by only adding BiLSTM
Use a weighted average of all of the layers of BERT
• Weights are the trainable parameters
• Different layer has different types of information[9]
• Preliminary experiments showed that is much
more beneficial than using only the last layer of BERT
Raw Text
..
Word seq
..
Feature seq
..
Subword seq
Concat
MLP
Output phrase break information
BiLSTM
Embed
Implicit features
Explicit features
Pre processing
Embed Embed
BERT
9
[9] R. Anna, K. Olga, and R. Anna, “A primer in BERTology: What we know about how BERT works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020.
Objective evaluation
10
Compared the performance based on F1 score with different systems
Dataset: 99,907 transcribed utterances of a single Japanese speaker
• 98,807/500/500 for train/val/test
• Word transitions with a silence of more than 200 ms were marked as phrase breaks.
Word embedding and BERT were pre-trained with Japanese Wikipedia articles
BiLSTM (Tokens) Conventional BiLSTM-based method that using only a source sequence[10]
BiLSTM (Features)
BiLSTM (Tokens)-based method that adds designed linguistic features[11], including
POS tags, dependency tree features (DEP), and pre-trained word embedding
BERT Method that uses BERT model
BiLSTM (Tokens) + BERT Method that combines BiLSTM (Tokens) and BERT
BiLSTM (Features) + BERT The proposed method that combines BiLSTM (Features)and BERT
[10] A.VadapalliandS.V.Gangashetty,“Aninvestigationofrecurrent neural network architectures using word embeddings for phrase break prediction,” in Proc. INTERSPEECH, 2016, pp. 2308– 2312.
[11] V. Klimkov, A. Nadolski, A. Moinet, B. Putrycz, R. Barra- Chicote, T. Merritt, and T. Drugman, “Phrase break prediction for long-form reading tts: Exploiting text structure information,” in Proc.
INTERSPEECH, 2017, pp. 1064–1068.
Experimental results of objective evaluation
11
Obtains absolute improvements of 4.7 points and 3.2 points compared with
BiLSTM(Tokens) and BiLSTM (Features)
• Implicit features extracted from BERT are much more effective than explicit features
Combining Explicit features and Implicit features is also effective
• BiLSTM (Tokens) + BERT = BERT < BiLSTM (Features) + BERT
Model Input features
True
positives
False
negatives
False
positive
F1 Precision Recall
BiLSTM (Tokens) Tokens 653 114 53 88.7 92.5 85.1
BiLSTM (Features) Tokens+POS+DEP+W2V 676 91 56 90.2 92.3 88.1
BERT Tokens 688 79 39 92.1 94.6 89.7
BiLSTM (Tokens) + BERT Tokens 690 77 42 92.1 94.3 90.0
BiLSTM (Features) + BERT Tokens+POS+DEP+W2V 709 58 43 93.4 94.3 92.4
Subjective evaluation
12
Ask prosody naturalness of different systems based on the perceptual listening test
• MOS test: make quality judgments between 1(Bad) to 5 (Excellent)
• AB preference test: choose the audio sample A or B that was more natural than the other
Reference (Natural) Recorded speech in the test set
Reference (TTS) Synthesized speech from the test set (the position of phrase breaks is the same
with recorded speech)
Rule-based A rule-based method that inserts phrase breaks only after punctuation
BiLSTM (Tokens) Conventional BiLSTM-based method that using only a source sequence
BiLSTM (Features)
BiLSTM (Tokens)-based method that adds designed linguistic features, including
POS tags, dependency tree features (DEP), and pre-trained word embedding
BERT Method that uses BERT model
BiLSTM (Features) + BERT The proposed method that combines BiLSTM (Features)and BERT
Experimental conditions
13
Training data
• Same corpus used in the objective
evaluation
• 118.9 hours, 99,907 utterances
• Single Japanese professional speaker
TTS conditions
• Acoustic model: Fastspeech2[12]
• Neural vocoder: Parallel WaveGAN[13]
TTS Inputs
• Predicted phrase breaks along with
phoneme sequence of the input text
Evaluation procedure
• 29 native Japanese speakers
• MOS test
§ Randomly selected 30 utterances for each
system
§ 30 utterances * 7 systems = 210 in total
• AB preference test
§ The same utterances used for MOS test
§ 30 utterances * 12 systems = 360 in total
[12] R. Yi, H. Chenxu, Q. Tao, Z. Sheng, Z. Zhou, and L. Tie-Yan, “FastSpeech 2: Fast and high-quality end-to-end text-to-speech,” in Proc. ICLR (in press), 2021.
[13] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020,
pp. 6199–6203.
MOS results on prosody naturalness
14
The proposed method achieved almost the same quality as Reference (TTS)
• Significant improvement compared with conventional methods
• Using both explicit and implicit features is effective
• BiLSTM (Features) + BERT < BERT
System Model MOS
1 Reference (Natural) 4.70 ± 0.05
2 Reference (TTS) 4.37 ± 0.05
3 Rule-based 3.62 ± 0.07
4 BiLSTM (Tokens) 3.80 ± 0.07
5 BiLSTM (Features) 4.05 ± 0.06
6 BERT 4.26 ± 0.05
7 BiLSTM (Features) + BERT 4.39 ± 0.05
AB preference results on prosody naturalness
15
Model A Model B Preference A Preference B Neutral Binomial test
Rule-based BiLSTM (Tokens) 125 335 410 ***
BiLSTM (Tokens) BiLSTM (Features) 172 279 419 ***
BiLSTM (Features) BERT 138 329 403 ***
BERT BiLSTM (Features) + BERT 144 238 488 ***
BERT Reference (TTS) 156 269 445 ***
BiLSTM (Features) + BERT Reference (TTS) 151 156 563 n.s.
Significant differences between the proposed method and BERT
• Significant differences for five pairs
• No difference for the pair between BiLSTM (Features) + BERT and Reference (TTS)
The proposed method achieves the same quality as Reference (TTS)
*** : statistically significant difference at the 1% level as a result of a one-tailed binomial test
n.s. : no significant difference
Audio samples used for subjective evaluation
16
Reference (Natural)
Reference (TTS)
Rule-based
BiLSTM (Tokens)
BiLSTM (Features)
BERT
BiLSTM (Features) + BERT
Conclusion
17
Proposed a phrase break prediction method that combines BERT and BiLSTM.
Improved 3.2 points of F1 score compared with previous models
Achieves a MOS of 4.39, which is competitive with ground-truth synthesized speech
The naturalness of prosody is improved by
1. Implicit features by BERT
2. Explicit features plus linguistic features by BiLSTM
Future work
• Modeling the duration of phrase breaks to improve the naturalness of the prosody

Mais conteúdo relacionado

Semelhante a Phrase break prediction with bidirectional encoder representations in Japanese TTS

Source side pre-ordering using recurrent neural networks for English-Myanmar ...
Source side pre-ordering using recurrent neural networks for English-Myanmar ...Source side pre-ordering using recurrent neural networks for English-Myanmar ...
Source side pre-ordering using recurrent neural networks for English-Myanmar ...IJECEIAES
 
Translating phrases in neural machine translation
Translating phrases in neural machine translationTranslating phrases in neural machine translation
Translating phrases in neural machine translation sekizawayuuki
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...iosrjce
 
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...IJERA Editor
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
Token classification using Bengali Tokenizer
Token classification using Bengali TokenizerToken classification using Bengali Tokenizer
Token classification using Bengali TokenizerJeet Das
 

Semelhante a Phrase break prediction with bidirectional encoder representations in Japanese TTS (20)

Source side pre-ordering using recurrent neural networks for English-Myanmar ...
Source side pre-ordering using recurrent neural networks for English-Myanmar ...Source side pre-ordering using recurrent neural networks for English-Myanmar ...
Source side pre-ordering using recurrent neural networks for English-Myanmar ...
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
NLP
NLPNLP
NLP
 
Translating phrases in neural machine translation
Translating phrases in neural machine translationTranslating phrases in neural machine translation
Translating phrases in neural machine translation
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
Y24168171
Y24168171Y24168171
Y24168171
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
Eg25807814
Eg25807814Eg25807814
Eg25807814
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Token classification using Bengali Tokenizer
Token classification using Bengali TokenizerToken classification using Bengali Tokenizer
Token classification using Bengali Tokenizer
 

Último

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Phrase break prediction with bidirectional encoder representations in Japanese TTS

  • 1. Phrase break prediction with bidirectional encoder representations in Japanese TTS Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana
  • 2. Summary Raw Text .. Word seq .. Feature seq .. Subword seq Concat MLP Output phrase break information BiLSTM Embed Implicit features Explicit features Pre processing Embed Embed BERT • Proposed a phrase break prediction model that combines BERT and BiLSTM • Achieves an absolute improvement of 3.2 points of F1 score compared with previous models[1][2] • Achieves a MOS of 4.39 in prosody naturalness, which is competitive with ground-truth synthesized speech [1]: A. Vadapalli and S. V. Gangashetty, “An investigation of recurrent neural network architectures using word embeddings for phrase break prediction", in Proceedings of INTERSPEECH 2016 . [2]: V. Klimkov, A. Nadolski, A. Moinet, B. Putrycz, R. BarraChicote, T. Merritt, and T. Drugman, “Phrase break prediction for long-form reading tts: Exploiting text structure information", in Proceedings of INTERSPEECH 2017 2
  • 3. Phrase break prediction in Japanese Phrase break is a phonetic pause inserted between phrases • Occurs due to breathing and/or an intonational reset • Inserting phrase breaks into suitable positions increase the naturalness of generated speech[3,4] 3 without phrase breaks with phrase breaks 知らぬ間に自分名義で契約され 届いたスマホを開封せず詐欺グループに転 送させられる消費者被害が全国の国民生活 センターに寄せられている 知らぬ間に/自分名義で契約され、/ 届いたスマホを開封せず/詐欺グループに 転送させられる/消費者被害が、/全国の 国民生活センターに寄せられている [3] T.Fujimoto, K.Hashimoto, K.Oura, Y.Nankaku, and K.Tokuda, “Impacts of input linguistic feature representation on japanese end-to-end speech synthesis,” in Proc. SSW, 2019, pp. 166–171. [4] K.Kurihara, N.Seiyama, and T.Kumano, “Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural tts,” IEICE Transactions on Information and Systems, vol. E104.D, no. 2, pp. 302–311, 2021.
  • 4. Goal: Predict suitable positions of phrase breaks Predict the positions of superficial and latent phrase breaks from the input text • Appears not only after punctuations and rule-based approach doesn’t work[5] The task can be formulated as a sequential labeling • labels a phrase break <BR> or non-phrase break <NB> for each token 4 Phrase break predictor <BR> <NB> <NB> <NB> Y X 知ら θ … … ぬ 間 に 自分 <NB> [5] P. Taylor, Text-to-Speech Synthesis. New York: Cambridge University press, 2009.
  • 5. Previous methods 5 Bidirectional LSTM has been used with additional linguistic features to improve the performance even with small data • part-of-speech(POS) • Dependency tree features • Character embedding • Unsupervised pre-trained representations, word and/or sentence embedding [6, 7] Pre-trained representations are weak context representations • Pre-trained word embedding contains semantic information for only a single word • language model lacks sufficient contextual information for a much longer sequence BERT model is used to capture complex semantic structure in many NLP tasks [6] A.Vadapalli and S.V.Gangashetty,“An investigation of recurrent neural network architectures using word embeddings for phrase break prediction,” in Proc. INTERSPEECH, 2016, pp. 2308– 2312. [7] Y.Zheng, J.Tao, Z.Wen, and Y.Li, “Blstm-crf based end-to-end prosodic boundary prediction with context sensitive embeddings in a text-to-speech front-end,” in Proc. INTERSPEECH, 2018, pp. 47–51.
  • 6. Overview: Proposed method Raw Text .. Word seq .. Feature seq .. Subword seq Concat MLP Output phrase break information BiLSTM Embed Implicit features Explicit features Pre processing Embed Embed BERT Use both BiLSTM and BERT[8] models • Explicit features from BiLSTM • Implicit features from BERT Use both labeled data and unlabeled data • Labeled data to model the relations of phrase breaks • Unlabeled data to pre-train embeddings and BERT Combining explicit and implicit features helps the output to be much more rich representations 6 [8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language under- standing,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
  • 7. Pre-processing Raw Text .. Word seq .. Feature seq .. Subword seq Concat MLP Output phrase break information BiLSTM Embed Implicit features Explicit features Pre processing Embed Embed BERT The input Japanese text is firstly pre-processed • Text normalization • Tokenization • Grapheme-to-phoneme(G2P) • Part-of-speech(POS) tagging • Dependency (DEP) parsing structure POS tags and DEP features are fed into BiLSTM 7
  • 8. Explicit features Explicit features extracted from BiLSTM • Capture the long-term dependencies in a source sequence directly • Help the implicit features to be much more rich representations linguistic features are used to work even with a small data • Part-of-speech (POS) tag • Dependency (DEP) parsing structure • Character embedding • Word embedding Raw Text .. Word seq .. Feature seq .. Subword seq Concat MLP Output phrase break information BiLSTM Embed Implicit features Explicit features Pre processing Embed Embed BERT 8
  • 9. Implicit features Implicit features extracted from BERT • provide a better representation than simple word embedding or sentence embedding like W2V, LM • Learn the rich syntactic and semantic information, which cannot be captured by only adding BiLSTM Use a weighted average of all of the layers of BERT • Weights are the trainable parameters • Different layer has different types of information[9] • Preliminary experiments showed that is much more beneficial than using only the last layer of BERT Raw Text .. Word seq .. Feature seq .. Subword seq Concat MLP Output phrase break information BiLSTM Embed Implicit features Explicit features Pre processing Embed Embed BERT 9 [9] R. Anna, K. Olga, and R. Anna, “A primer in BERTology: What we know about how BERT works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020.
  • 10. Objective evaluation 10 Compared the performance based on F1 score with different systems Dataset: 99,907 transcribed utterances of a single Japanese speaker • 98,807/500/500 for train/val/test • Word transitions with a silence of more than 200 ms were marked as phrase breaks. Word embedding and BERT were pre-trained with Japanese Wikipedia articles BiLSTM (Tokens) Conventional BiLSTM-based method that using only a source sequence[10] BiLSTM (Features) BiLSTM (Tokens)-based method that adds designed linguistic features[11], including POS tags, dependency tree features (DEP), and pre-trained word embedding BERT Method that uses BERT model BiLSTM (Tokens) + BERT Method that combines BiLSTM (Tokens) and BERT BiLSTM (Features) + BERT The proposed method that combines BiLSTM (Features)and BERT [10] A.VadapalliandS.V.Gangashetty,“Aninvestigationofrecurrent neural network architectures using word embeddings for phrase break prediction,” in Proc. INTERSPEECH, 2016, pp. 2308– 2312. [11] V. Klimkov, A. Nadolski, A. Moinet, B. Putrycz, R. Barra- Chicote, T. Merritt, and T. Drugman, “Phrase break prediction for long-form reading tts: Exploiting text structure information,” in Proc. INTERSPEECH, 2017, pp. 1064–1068.
  • 11. Experimental results of objective evaluation 11 Obtains absolute improvements of 4.7 points and 3.2 points compared with BiLSTM(Tokens) and BiLSTM (Features) • Implicit features extracted from BERT are much more effective than explicit features Combining Explicit features and Implicit features is also effective • BiLSTM (Tokens) + BERT = BERT < BiLSTM (Features) + BERT Model Input features True positives False negatives False positive F1 Precision Recall BiLSTM (Tokens) Tokens 653 114 53 88.7 92.5 85.1 BiLSTM (Features) Tokens+POS+DEP+W2V 676 91 56 90.2 92.3 88.1 BERT Tokens 688 79 39 92.1 94.6 89.7 BiLSTM (Tokens) + BERT Tokens 690 77 42 92.1 94.3 90.0 BiLSTM (Features) + BERT Tokens+POS+DEP+W2V 709 58 43 93.4 94.3 92.4
  • 12. Subjective evaluation 12 Ask prosody naturalness of different systems based on the perceptual listening test • MOS test: make quality judgments between 1(Bad) to 5 (Excellent) • AB preference test: choose the audio sample A or B that was more natural than the other Reference (Natural) Recorded speech in the test set Reference (TTS) Synthesized speech from the test set (the position of phrase breaks is the same with recorded speech) Rule-based A rule-based method that inserts phrase breaks only after punctuation BiLSTM (Tokens) Conventional BiLSTM-based method that using only a source sequence BiLSTM (Features) BiLSTM (Tokens)-based method that adds designed linguistic features, including POS tags, dependency tree features (DEP), and pre-trained word embedding BERT Method that uses BERT model BiLSTM (Features) + BERT The proposed method that combines BiLSTM (Features)and BERT
  • 13. Experimental conditions 13 Training data • Same corpus used in the objective evaluation • 118.9 hours, 99,907 utterances • Single Japanese professional speaker TTS conditions • Acoustic model: Fastspeech2[12] • Neural vocoder: Parallel WaveGAN[13] TTS Inputs • Predicted phrase breaks along with phoneme sequence of the input text Evaluation procedure • 29 native Japanese speakers • MOS test § Randomly selected 30 utterances for each system § 30 utterances * 7 systems = 210 in total • AB preference test § The same utterances used for MOS test § 30 utterances * 12 systems = 360 in total [12] R. Yi, H. Chenxu, Q. Tao, Z. Sheng, Z. Zhou, and L. Tie-Yan, “FastSpeech 2: Fast and high-quality end-to-end text-to-speech,” in Proc. ICLR (in press), 2021. [13] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020, pp. 6199–6203.
  • 14. MOS results on prosody naturalness 14 The proposed method achieved almost the same quality as Reference (TTS) • Significant improvement compared with conventional methods • Using both explicit and implicit features is effective • BiLSTM (Features) + BERT < BERT System Model MOS 1 Reference (Natural) 4.70 ± 0.05 2 Reference (TTS) 4.37 ± 0.05 3 Rule-based 3.62 ± 0.07 4 BiLSTM (Tokens) 3.80 ± 0.07 5 BiLSTM (Features) 4.05 ± 0.06 6 BERT 4.26 ± 0.05 7 BiLSTM (Features) + BERT 4.39 ± 0.05
  • 15. AB preference results on prosody naturalness 15 Model A Model B Preference A Preference B Neutral Binomial test Rule-based BiLSTM (Tokens) 125 335 410 *** BiLSTM (Tokens) BiLSTM (Features) 172 279 419 *** BiLSTM (Features) BERT 138 329 403 *** BERT BiLSTM (Features) + BERT 144 238 488 *** BERT Reference (TTS) 156 269 445 *** BiLSTM (Features) + BERT Reference (TTS) 151 156 563 n.s. Significant differences between the proposed method and BERT • Significant differences for five pairs • No difference for the pair between BiLSTM (Features) + BERT and Reference (TTS) The proposed method achieves the same quality as Reference (TTS) *** : statistically significant difference at the 1% level as a result of a one-tailed binomial test n.s. : no significant difference
  • 16. Audio samples used for subjective evaluation 16 Reference (Natural) Reference (TTS) Rule-based BiLSTM (Tokens) BiLSTM (Features) BERT BiLSTM (Features) + BERT
  • 17. Conclusion 17 Proposed a phrase break prediction method that combines BERT and BiLSTM. Improved 3.2 points of F1 score compared with previous models Achieves a MOS of 4.39, which is competitive with ground-truth synthesized speech The naturalness of prosody is improved by 1. Implicit features by BERT 2. Explicit features plus linguistic features by BiLSTM Future work • Modeling the duration of phrase breaks to improve the naturalness of the prosody