ICASSP2020音声＆音響読み会Mellotron

Copyright (C) 2020 DeNA Co.,Ltd. All Rights Reserved.Copyright (C) 2020 DeNA Co.,Ltd. All Rights Reserved.
Jun. 19, 2020
橘健太郎
DeNA Co., Ltd.
ICASSP2020音声＆音響読み会
MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY
CONDITIONING ON RHYTHM, PITCH AND GLOBAL STYLE TOKENS

Copyright (C) 2020 DeNA Co.,Ltd. All Rights Reserved.
自己紹介
 名前
⁃ 橘健太郎
 略歴
⁃ 2008年奈良先端大学院大学修士卒
⁃ 2008年〜17年東芝研究開発センター
⁃ 2014年〜17年9月情報通信研究機構（NICT）出向
⁃ 2017年10月〜 DeNA
 専門分野
⁃ 音声変換・音声合成
https://twitter.com/KentaroTachiba
https://www.slideshare.net/KentaroTachibana1

紹介論文
 MELLOTRON: MULTISPEAKER EXPRESSIVE VOICE SYNTHESIS BY CONDITIONING ON
RHYTHM, PITCH AND GLOBAL STYLE TOKENS
⁃ Rafael Valle, Jason Li, Ryan Prenger, Bryan Catanzaro (NVIDIA Corporation)
 概要
⁃ End-to-end 音声合成の亜種。純粋なTTSでは無い
⁃ Attentionを用いたencoder-decoder型のアルゴリズムTacotron2の派生
⁃ テキストに加え、参照音声も入力することで、表現力が向上！
⁃ 学習データが読み上げ音声だけでも感情合成、歌声合成が可能
⁃ これらを複数話者で実現

紹介論文の概要
 従来の音声合成
⁃ 読み上げ音声合成
• テキスト入力
• 学習データ：読み上げ音声
⁃ 感情音声合成
• テキスト入力
• 学習データ：感情音声
⁃ 歌声合成
• MusicXML or MIDI入力
• 学習データ：歌声
• Mellotron
• 読み上げ、感情、歌声合成
• テキスト or musciXML入力 + 参照音声
• 学習データ：読み上げ音声
従来の音声合成を内包するようなシステム

音声合成 (Text-to-speech; TTS) とは？
 テキストを音声に変換する技術
 任意のテキストを所望の声質で音声を生成できる
TTS今日もいい天気だね
音声波形テキスト

音声合成を構成するモジュール
テキスト
解析器
音響
モデル
ボコーダーテキスト音声波形
言語
特徴量
音響
特徴量
朝の空気は
爽やかです。
日本語の文章から
読み・アクセント
などの言語特徴量を抽出
言語特徴量を
音響特徴量に変換
音響特徴量を
音声波形に変換
人が設計した特徴量を排除し、
テキストから音声波形を直接生成する = End-to-End TTS
Tacotron [Wang+, 2017][Shen+, 2018]
声の高さ (F0)
声質 (メルケプストラム)
かすれ具合 (bap)
アサノ/クーキハ/
サワヤカデス’.
モデル学習しやすいように、人が特徴量を設計

Tacotron1, 2
 Encoder-decoder型のend-to-end 音声合成
⁃ 可変長の入力系列と出力系列を学習アルゴリズム（seq2seqとも呼ばれる）
⁃ 明示的なalignmentが不要
 Tacotron関連論文
⁃ Tacotron: Towards End-to-End Speech Synthesis [Wang+, 2017]
⁃ Tacotron2: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Prediction
⁃ Tacotron2 GST [Skerry-Ryan+, 2018] 、Mellotronなど派生手法が提案されている
 特にTacotron2は非常に高品質で、人間と同レベルの品質を達成している
 MellotronのベースともなっているTacotron2とTacotron2 GSTについて取り上げる

Tacotron2の概要
 テキストからmel-spectrogramへの変換、mel-spectrogramから音声波形へは別々に
学習
Embedding
Encoder
Attention
layer
Decoder
Hello, everyone!
WaveNet
vocoder
Mel-spectrogram

Tacotron2
 学習時
5 Conv Layer
Post-Net
2 Layer
Pre-Net
2 LSTM
Layers
Location
Sensitive
Attention
Stop Token
WaveNet
MoL
Waveform
Samples
Linear
Projection
Linear
Projection
Input Text
Character
Embedding
3 Conv
Layers
Bidirectional
LSTM
Text embedding
文字数
Mel-spec.
終了位置を
示す
Predicted
MSE loss
Gate loss
1時刻前の
mel-spec.の
embedding
Attetion
Decoder
Encoder
Embedding
Hello, everyone!
1frameずつ生成

Decoder + attention
 1時刻前のmel-spec.と文字列系列のembeddingから次のmel-spec.を予測
 生成mel-spec.が文字系列のどこを注視しているか
 各文字列の継続長が分かる
Attention map
1時刻ごとに
生成
出力mel-spec.系列 [frame]
入力文字系列[文字数]

Tacotron 2 GST
 Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech
Synthesis Training [Wang+, 2018]
 話速、styleを制御するため、Style tokenを導入
 Audio-book読み上げなど表現力を伴う応用に対応することが目的
 音声から表現空間をdata drivenで自動学習
 推論時は、参照音声もしくはstyle空間からサンプルすることでstyleを獲得できる
Melspec.

Mellotron
 Tacotron2 GSTの課題
⁃ Style空間を自動獲得するため、制御粒度が粗い
⁃ Styleの任意で指定できない
 Mellotronの目的
⁃ 緻密な制御により表現力向上を目指す
 手法
⁃ Tacotron 2 GSTにprosody情報を追加することで、より緻密なrhythmやpitch制御を実現する
⁃ 読み上げ音声からでも、感情音声や歌声合成が可能に！

Mellotron: 入力
 参照音声の韻律を元に緻密に韻律を実現
Mellotron
Text
Speaker id
Pitch contour
Mel-spectrogram
1
Hello, everyone!
発話：Hello, everyone!
声質：Speker id2
韻律：入力韻律
Style：入力mel-spec.

Mellotron: 全体構成
 Speaker id、F0、mel-spec.を与えることで、明示的に制御する
5 Conv Layer
Post-Net
2 Layer
Pre-Net
2 LSTM
Layers
Location
Sensitive
Attention
Ref: https://tosaka-mn.hatenablog.com/entry/2020/04/29/160050
Stop Token
WaveGlow
Waveform
Samples
Linear
Projection
Linear
Projection
Input Text
Character
Embedding
3 Conv
Layers
Bidirectional
LSTM
1 Layer
F0 Pre-Net
Speaker
Embedding
F0
Speaker id
Reference
Encoder
Mel-
spectrogram
Attention
: Tacotron 2 GSTで追加
: Mellotronで追加
Style embedding
Speaker embedding
Text embedding
文字数
Mel-spec.

Mellotron: 推論 (1/2)
 参照音声のみからでも音声生成が可能！
Mellotron
Speaker id
1
F0 extractor
ASR Hello, everyone!
作成したattention mapを入力することで、
rhythmも制御

Mellotron: 推論 (2/2)
 MusicXML入力も可能
⁃ XML形式で書かれた楽譜フォーマット（MIDIのようなもの）
⁃ 歌詞、声の高さ、長さ、強弱といった情報が含まれる
MusicXML
Hello, everyone!
Mellotron
Speaker id
1
Pitch contour
Attention map

実験的評価
 学習データ
⁃ LibriTTS（train-clean-100）53.78 [h]
⁃ LJSpeech 24 [h]
⁃ Sally (NVIDIA internal?) 20 [h]
 定量評価
⁃ F0の誤差を測定
⁃ 既存手法 E2E-Prosody [Skerry-Ryan+, 2018] を大きく上回る精度を達成
• Gross Pitch Error (GPE) : 予測F0の精度
• Voicing Decision Error (VDE) : Voiced/Unvoicedの一致率
• F0 Frame Error (FFE) : GPEとVDEを合わせた精度

デモストレーション
 デモサイト（ https://nv-adlr.github.io/Mellotron ）

日本語でも試してみました
 学習データセット
⁃ JSUTコーパス BASIC5000文（読み上げ文）
 VocoderはMelGANを利用
 サンプル
参照音声生成音声
Rhythm Transfer
Pitch Transfer
Expressive speech
原音
angry happy
-1.5cents +1.5cents
x1.25x0.75 x1.0 x1.25x0.75 x1.0
angry happy
-1.5cents +1.5cents

まとめ
 Mellotronについてご紹介しました
 参照音声を用いて、それに非常に近いrhythm、pitchを生成でき、直接制御できた
 学習データに歌声や感情音声を含まずとも、これらの音声を生成できた
⁃ 学習データを大きく外れる入力には対応できていないかも
 アニメ調の音声やストーリテラースタイルを学習データとした時にどうなるかを
調査したい

参考文献
[Wang+, 2017] “Tacotron: Towards end-to-end speech synthesis,” Interspeech, 2017.
[Shen+, 2018] “NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS,”
ICASSP, 2018.
[Wang+, 2018] “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech
Synthesis,” arXiv preprint arXiv: 1803.09017, 2018.
[Skerry-Ryan+, 2018] “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,”
arXiv preprint arXiv:1803.09047, 2018.

ICASSP2020音声＆音響読み会Mellotron

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a ICASSP2020音声＆音響読み会Mellotron

Semelhante a ICASSP2020音声＆音響読み会Mellotron (7)

Último

Último (9)

ICASSP2020音声＆音響読み会Mellotron

Notas do Editor