SlideShare uma empresa Scribd logo
1 de 27
Bilgin Aksoy 18 Dec 2021
Intro to Text to Speech
Synthesis
Using Deep Learning
whoami
Bilgin Aksoy
• B.Sc. KHO 2003
• M.Sc. METU 2018
• 2003-2018 TAF
(Officer)
• 2018-2020 DataBoss
(Head of Data Science Department)
• 2020- ARINLABS
(Data Scientist)
• Linkedin: https://www.linkedin.com/in/bilgin-aksoy-a61a90110/
• Twitter: @blgnksy
Speech Synthesis / Text to Speech
Definiton
• Synthesizing intelligible, and natural speech from text.
• A research topic in natural language and speech processing.
• Requires knowledge about languages and human speech production.
• Involves multiple disciplines including linguistics, acoustics, digital signal
processing, and machine learning.
Speech Synthesis / Text to Speech
A Brief History
• Wolfgang von Kempelen had
constructed a speaking machine.
• Early methods: articulatory synthesis,
formant synthesis, and concatenative
synthesis.
• Later methods: statistical parametric
(spectrum, fundamental frequency, and
duration) speech synthesis (SPSS).
• From 2010s: neural network-based
speech synthesis.
Speech Synthesis / Text to Speech
Glossary
• Prosody: Intonation, stress, and rhythm.
• Phonemes: Units of sounds.
• Part-of-Speech: nouns, pronouns, verbs, adjectives, adverbs, prepositions,
conjunctions, articles/determiners, interjections.
• Vocoder: Decodes from features to audio signals.
• Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform.
• Alignment: Associating character/graphemes to phonemes.
• Duration: Represents how long the speech voice sound.
Speech Synthesis / Text to Speech
Glossary
• Mean Opinion Score (MOS): The most frequently used method to evaluate
the quality of the generated speech. MOS has a range from 0 to 5 where
real human speech is between 4.5 to 4.8
Speech Synthesis / Text to Speech
Sound Signal / Waveform
• Sampling rate: Sampling is the reduction of a continuous-time signal to a
discrete-time signal. Sampling rate is the number of total samples in a
second. (16/22 kHz)
• Sample Depth: The number of bits to represent of a sample’s value.
Speech Synthesis / Text to Speech
Spectrum of Sound Signal
Harmonics
Pitch
Human voice
ranges between
125 Hz to 8 kHz
Male F0 = 125 Hz
Female F0 = 200 Hz
Child F0 = 300 Hz
Speech Synthesis / Text to Speech
MEL Spectrum
• MEL spectrum: The mel-frequency cepstrum (MFC) is a representation of
the short-term power spectrum of a sound, based on a linear cosine
transform of a log power spectrum on a nonlinear mel scale of frequency.
Usually 80
Speech Synthesis / Text to Speech
Key Components
* Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
Speech Synthesis / Text to Speech
Text Analysis
• Text normalization,
• Word segmentation,
• Part-of-speech(POS) tagging,
• Prosody prediction,
• Character/grapheme-to-phoneme conversion (alignment).
Speech Synthesis / Text to Speech
Acoustic Model
• Inputs: Linguistic features or directly from phonemes or characters.
• Outputs: Acoustic features.
• RNN-based, CNN-based, Transformer-based.
Speech Synthesis / Text to Speech
Vocoder
• Part of the system decoding from acoustic features to audio signals/waveform.
Speech Synthesis / Text to Speech
Different Structures
* Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
Speech Synthesis / Text to Speech
Different Choices
• Single or multi speaker,
• Single or multi language,
• Single or multi gender.
Speech Synthesis / Text to Speech
WaveNet
Speech Synthesis / Text to Speech
DeepVoice 1/2/3
Added
speaker
embeddings
Speech Synthesis / Text to Speech
Tacotron 1/2
Speech Synthesis / Text to Speech
FastSpeech 1/2/2s
Speech Synthesis / Text to Speech
WaveGlow
Speech Synthesis / Text to Speech
HiFi-GAN
• GAN Architecture
• Generator: Fully Convolutional
• Discriminator:
• Multi-Period Discriminator
• Multi-Scale Discriminator
Speech Synthesis / Text to Speech
Other Models
• End-to-End Adversarial Text-to-Speech (EATS)
• WaveGAN
• MelGAN
• GAN-TTS
• Char2Wav
• ClariNet
• FastPitch
Speech Synthesis / Text to Speech
Datasets
• ARCTIC, VCTK, Blizzard-2011, Blizzard-2013, LJSpeech, LibriSpeech,
LibriTTS, VCC, HiFi-TTS, TED-LIUM, CALLHOME, RyanSpeech (English)
• CSMSC, HKUST, AISHELL-1, AISHELL-2, AISHELL-3, DiDiSpeech-1,
DiDiSpeech-2 (Mandarin)
• India Corpus, M-AILABS, MLS, CSS10, CommonVoice (Multilingual)
Speech Synthesis / Text to Speech
CommonVoice
Speech Synthesis / Text to Speech
CommonVoice
Speech Synthesis / Text to Speech
Resources
• DeepMind
• Google
• Microsoft
• Nvidia
• Coqui AI
• Mozilla TTS
• Nuance
Questions?

Mais conteúdo relacionado

Mais procurados

TEXT-SPEECH PPT.pptx
TEXT-SPEECH PPT.pptxTEXT-SPEECH PPT.pptx
TEXT-SPEECH PPT.pptx
Nsaroj kumar
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
ankit_saluja
 

Mais procurados (20)

Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Unit 1 speech processing
Unit 1 speech processingUnit 1 speech processing
Unit 1 speech processing
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
 
TEXT-SPEECH PPT.pptx
TEXT-SPEECH PPT.pptxTEXT-SPEECH PPT.pptx
TEXT-SPEECH PPT.pptx
 
Speech processing
Speech processingSpeech processing
Speech processing
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Gujarati Text-to-Speech Presentation
Gujarati Text-to-Speech PresentationGujarati Text-to-Speech Presentation
Gujarati Text-to-Speech Presentation
 
Text to speech converter in C#.NET
Text to speech converter in C#.NETText to speech converter in C#.NET
Text to speech converter in C#.NET
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
speech processing basics
speech processing basicsspeech processing basics
speech processing basics
 
SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information Retrieval
 
Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)
 
An Introduction To Speech Recognition
An Introduction To Speech RecognitionAn Introduction To Speech Recognition
An Introduction To Speech Recognition
 
Artificial intelligence in speech recognition
Artificial intelligence in speech recognitionArtificial intelligence in speech recognition
Artificial intelligence in speech recognition
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Voicemorphing
VoicemorphingVoicemorphing
Voicemorphing
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technology
 

Semelhante a Introduction to text to speech

Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...
Guy De Pauw
 
Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interface
Jeevitha Reddy
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
IJCI JOURNAL
 

Semelhante a Introduction to text to speech (20)

Speech-Recognition.pptx
Speech-Recognition.pptxSpeech-Recognition.pptx
Speech-Recognition.pptx
 
Survey On Speech Synthesis
Survey On Speech SynthesisSurvey On Speech Synthesis
Survey On Speech Synthesis
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...
 
Theories of speech perception.pptx
Theories of speech perception.pptxTheories of speech perception.pptx
Theories of speech perception.pptx
 
FYPReport
FYPReportFYPReport
FYPReport
 
Silent sound interface
Silent sound interfaceSilent sound interface
Silent sound interface
 
Powerpoint on Linear Predictive coding.pptx
Powerpoint on Linear Predictive coding.pptxPowerpoint on Linear Predictive coding.pptx
Powerpoint on Linear Predictive coding.pptx
 
SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
 
Speech and Language Processing
Speech and Language ProcessingSpeech and Language Processing
Speech and Language Processing
 
Automatic Speech Recognion
Automatic Speech RecognionAutomatic Speech Recognion
Automatic Speech Recognion
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionary
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
visH (fin).pptx
visH (fin).pptxvisH (fin).pptx
visH (fin).pptx
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
 
Isolated English Word Recognition System: Appropriate for Bengali-accented En...
Isolated English Word Recognition System: Appropriate for Bengali-accented En...Isolated English Word Recognition System: Appropriate for Bengali-accented En...
Isolated English Word Recognition System: Appropriate for Bengali-accented En...
 
Voice
VoiceVoice
Voice
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
 

Último

一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
mikehavy0
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 

Último (20)

一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
Abortion Clinic in Randfontein +27791653574 Randfontein WhatsApp Abortion Cli...
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 

Introduction to text to speech

  • 1. Bilgin Aksoy 18 Dec 2021 Intro to Text to Speech Synthesis Using Deep Learning
  • 2. whoami Bilgin Aksoy • B.Sc. KHO 2003 • M.Sc. METU 2018 • 2003-2018 TAF (Officer) • 2018-2020 DataBoss (Head of Data Science Department) • 2020- ARINLABS (Data Scientist) • Linkedin: https://www.linkedin.com/in/bilgin-aksoy-a61a90110/ • Twitter: @blgnksy
  • 3. Speech Synthesis / Text to Speech Definiton • Synthesizing intelligible, and natural speech from text. • A research topic in natural language and speech processing. • Requires knowledge about languages and human speech production. • Involves multiple disciplines including linguistics, acoustics, digital signal processing, and machine learning.
  • 4. Speech Synthesis / Text to Speech A Brief History • Wolfgang von Kempelen had constructed a speaking machine. • Early methods: articulatory synthesis, formant synthesis, and concatenative synthesis. • Later methods: statistical parametric (spectrum, fundamental frequency, and duration) speech synthesis (SPSS). • From 2010s: neural network-based speech synthesis.
  • 5. Speech Synthesis / Text to Speech Glossary • Prosody: Intonation, stress, and rhythm. • Phonemes: Units of sounds. • Part-of-Speech: nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions, articles/determiners, interjections. • Vocoder: Decodes from features to audio signals. • Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform. • Alignment: Associating character/graphemes to phonemes. • Duration: Represents how long the speech voice sound.
  • 6. Speech Synthesis / Text to Speech Glossary • Mean Opinion Score (MOS): The most frequently used method to evaluate the quality of the generated speech. MOS has a range from 0 to 5 where real human speech is between 4.5 to 4.8
  • 7. Speech Synthesis / Text to Speech Sound Signal / Waveform • Sampling rate: Sampling is the reduction of a continuous-time signal to a discrete-time signal. Sampling rate is the number of total samples in a second. (16/22 kHz) • Sample Depth: The number of bits to represent of a sample’s value.
  • 8. Speech Synthesis / Text to Speech Spectrum of Sound Signal Harmonics Pitch Human voice ranges between 125 Hz to 8 kHz Male F0 = 125 Hz Female F0 = 200 Hz Child F0 = 300 Hz
  • 9. Speech Synthesis / Text to Speech MEL Spectrum • MEL spectrum: The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. Usually 80
  • 10. Speech Synthesis / Text to Speech Key Components * Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
  • 11. Speech Synthesis / Text to Speech Text Analysis • Text normalization, • Word segmentation, • Part-of-speech(POS) tagging, • Prosody prediction, • Character/grapheme-to-phoneme conversion (alignment).
  • 12. Speech Synthesis / Text to Speech Acoustic Model • Inputs: Linguistic features or directly from phonemes or characters. • Outputs: Acoustic features. • RNN-based, CNN-based, Transformer-based.
  • 13. Speech Synthesis / Text to Speech Vocoder • Part of the system decoding from acoustic features to audio signals/waveform.
  • 14. Speech Synthesis / Text to Speech Different Structures * Tan, Xu, et al. "A survey on neural speech synthesis." arXiv preprint arXiv:2106.15561 (2021).
  • 15. Speech Synthesis / Text to Speech Different Choices • Single or multi speaker, • Single or multi language, • Single or multi gender.
  • 16. Speech Synthesis / Text to Speech WaveNet
  • 17. Speech Synthesis / Text to Speech DeepVoice 1/2/3 Added speaker embeddings
  • 18. Speech Synthesis / Text to Speech Tacotron 1/2
  • 19. Speech Synthesis / Text to Speech FastSpeech 1/2/2s
  • 20. Speech Synthesis / Text to Speech WaveGlow
  • 21. Speech Synthesis / Text to Speech HiFi-GAN • GAN Architecture • Generator: Fully Convolutional • Discriminator: • Multi-Period Discriminator • Multi-Scale Discriminator
  • 22. Speech Synthesis / Text to Speech Other Models • End-to-End Adversarial Text-to-Speech (EATS) • WaveGAN • MelGAN • GAN-TTS • Char2Wav • ClariNet • FastPitch
  • 23. Speech Synthesis / Text to Speech Datasets • ARCTIC, VCTK, Blizzard-2011, Blizzard-2013, LJSpeech, LibriSpeech, LibriTTS, VCC, HiFi-TTS, TED-LIUM, CALLHOME, RyanSpeech (English) • CSMSC, HKUST, AISHELL-1, AISHELL-2, AISHELL-3, DiDiSpeech-1, DiDiSpeech-2 (Mandarin) • India Corpus, M-AILABS, MLS, CSS10, CommonVoice (Multilingual)
  • 24. Speech Synthesis / Text to Speech CommonVoice
  • 25. Speech Synthesis / Text to Speech CommonVoice
  • 26. Speech Synthesis / Text to Speech Resources • DeepMind • Google • Microsoft • Nvidia • Coqui AI • Mozilla TTS • Nuance

Notas do Editor

  1. Text to speech (TTS), also known as speech synthesis, which aims to synthesize intelligible and natural speech from text [346], has broad applications in human communication [1] and has long been a research topic in artificial intelligence, natural language and speech processing. Developing a TTS system requires knowledge about languages and human speech production, and involves multiple disciplines including linguistics [63], acoustics [170], digital signal processing [320],and machine learning.
  2. In the 2nd half of the 18th century, the Hungarian scientist, Wolfgang von Kempelen, had constructed a speaking machine with a series of bellows, springs, bagpipes and resonance boxes to produce some simple words and short sentences. The first speech synthesis system that built upon computer came out in the latter half of the 20th century. The early computer-based speech synthesis methods include articulatory synthesis, formant synthesis, and concatenative synthesis. Articulatory Synthesis: Articulatory synthesis produces speech by simulating the behavior of human articulator such as lips, tongue, glottis and moving vocal tract. Formant Synthesis: Formant synthesis produces speech based on a set of rules that control a simplified source-filter model. These rules are usually developed by linguists to mimic the formant structure and other spectral properties of speech as closely as possible. The speech is synthesized by an additive synthesis module and an acoustic model with varying parameters like fundamental frequency, voicing, and noise levels. Concatenative Synthesis: Concatenative synthesis relies on the concatenation of pieces of speech that are stored in a database. Usually, the database consists of speech units ranging from whole sentence to syllables that are recorded by voice actors. Later, as the development of statistics machine learning, statistical parametric speech synthesis (SPSS) is proposed which predicts parameters such as spectrum, fundamental frequency and duration for speech synthesis. Statistical Parametric SynthesisTo address the drawbacks of concatenative TTS, statistical para-metric speech synthesis (SPSS) is proposed [416,356,415,425,357]. The basic idea is that instead of direct generating waveform through concatenation, we can first generate the acoustic parameters [82,355,156] that are necessary to produce speech and then recover speech from the generated acoustic parameters using some algorithms From 2010s, neural network-based speech synthesis has gradually become the dominant methods and achieved much better voice quality. Neural Speech Synthesis: As the development of deep learning, neural network-based TTS (neural TTS for short) is proposed, which adopts (deep) neural networks as the model backbone for speech synthesis. Some early neural models are adopted in SPSS to replace HMM for acoustic modeling. Later, WaveNet is proposed to directly generate waveform from linguistic features, which can be regarded as the first modern neural TTS model. Other models like DeepVoice 1/2 still follow the three components in statistical parametric synthesis, but upgrade them with the corresponding neural network based models. Furthermore, some end-to-end models (e.g., Tacotron1/2, Deep Voice 3, and FastSpeech 1/2) are proposed to simplify text analysis modules and directly take character/phoneme sequences as input, and simplify acoustic features with mel-spectrograms. Later, fully end-to-end TTS systems are developed to directly generate waveform from text, such as ClariNet, FastSpeech 2s and (DeepMind Introduces EATS – An End-to-End Adversarial Text-To-Speech) EATS. Compared to previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the advantages of neural network based speech synthesis include high voice quality in terms of both intelligibility and naturalness, and less requirement on human preprocessing and feature development.
  3. Prosody: intonation, stress, and rhythm. Phonemes: units of sounds. Kahır - ahır. Part-of-speech: Vocoder: Decodes from features to audio signals. Pitch/Fundamental Frequency – F0: lowest frequency of a periodic waveform. Alignment: Associating character/graphemes to phonemes. Duration: Represents how long the speech voice sound.
  4. Mean Opinion Score (MOS): The most frequently used method to evaluate the quality of the generated speech. MOS has a range from 0 to 5 where real human speech is between 4.5 to 4.8.
  5. Text normalization. The raw written text (non-standard words) should be converted into spoken-form words through text normalization, which can make the words easy to pronounce for TTS models. For example, the year “1989” is normalized into “nineteen eighty nine”, “Jan. 24” isnormalized into “Janunary twenty-fourth”. Word segmentation. For character-based languages such as Chinese, word segmentation is necessary to detect the word boundary from raw text Part-of-speech tagging. The part-of-speech (POS) of each word, such as noun, verb, preposition, is also important for grapheme-to-phoneme conversion and prosody prediction in TTS. Prosody prediction. The prosody information, such as rhythm, stress, and intonation of speech, corresponds to the variations in syllable duration, loudness and pitch, which plays an important perceptual role in human speech communication.
  6. Acoustic models, which generate acoustic features from linguistic features or directly from phonemes or characters. As the development of TTS, different kinds of acoustic models have been adopted, including the early HMM and DNN based models in statistical parametric speech synthesis (SPSS), and then the sequence to sequence models based on encoder-attention-decoder framework (including LSTM, CNN and self-attention), and the latest feed-forward networks (CNN or self-attention) for parallel generation. Acoustic models aim to generate acoustic features that are further converted into waveform using vocoders. RNN-based Models (e.g., Tacotron Series) CNN-based Models (e.g., DeepVoice Series) DeepVoice [8] is actually an SPSS system enhanced with convolutional neural networks. After obtaining linguistic features through neural networks, DeepVoice leverages a WaveNet [254] based vocoder to generate waveform. Transformer-based Models (e.g., FastSpeech Series)
  7. Early neural vocoders such as WaveNet, Char2Wav, WaveRNN directly take linguistic features as input and generate waveform. Later, Prenger et al., Kim et al., Kumaret al., Yamamoto et al. take mel-spectrograms as input and generate waveform.
  8. fully end-to-end TTS models can generate speech waveform from character or phoneme sequence directly, which have the following advantages: 1) It requires less human annotation and feature development (e.g., alignment information between text and speech); 2) The joint and end-to-end optimization can avoid error propagation in cascaded models (e.g., Text Analysis + Acoustic Model +Vocoder); 3) It can also reduce the training, development and deployment cost. 1) Simplifying text analysis module and linguistic features. 2) Simplifying acoustic features, where the complicated acoustic features are simplified into mel-spectrograms. 3) Replacing two or three modules with a single end-to-end model. However, there are big challenges to train TTS models in an end-to-end way, mainly due to the different modalities between text and speech waveform, as well as the huge length mismatch between character/phoneme sequence and waveform sequence. For example, for a speech with a length of 5 seconds and about 20 words, the length of the phoneme sequence is just about 100, while the length of the waveform sequence is 110k (if the sample rate is 22kHz).
  9. Some TTS systems explicitly model the speaker representations through a speaker lookup table or speaker encoder.
  10. a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%. - Auto regressive - Casual convolution - Dilated convolution - Really slow for a real-life application. WaveNet was inspired by PixelCNN and PixelRNN, which are able to generate very complex natural images. Fast Wavenet Parallel Wavenet
  11. Deep Voice by Baidu, It consists of 4 different neural networks that together form an end-to-pipeline. A segmentation model that locates boundaries between phonemes. It is a hybrid CNN and RNN network that is trained to predict the alignment between vocal sounds and the target phoneme. A model that converts graphemes to phonemes. A model to predict phonemes duration and the fundamental frequencies. The same phoneme might hold different durations in different words. We need to predict the duration. Fundamental frequency for the pitch of each phoneme. A model to synthesize the final audio. Here the authors implemented a modified WaveNet. As you can see still follow the three components in statistical parametric synthesis, but upgrade them with the corresponding neural network based models. Deepvoice 2 Speaker embedding Deepvoice 3 a single model instead of four different ones. More specifically, the authors proposed a fully-convolutional character-to-spectrogram architecture which is ideal for parallel computation. As opposed to RNN-based models. They were also experimenting with different waveform synthesis methods with the WaveNet achieving the best results once again. 2000 speaker
  12. Tacotron was released by Google in 2017 as an end-to-end system. It is basically a sequence to sequence model that follows the familiar encoder-decoder architecture. An attention mechanism was also utilized. End2End Faster than WaveNet Character sequence => Audio Spectrogram => Synthesized Audio The encoder’s goal is to extract robust sequential representations of text. It receives a character sequence represented as one-hot encoding and through a stack of PreNets and CHBG modules, it outputs the final representation. PreNet is used to describe the non-linear transformations applied to each embedding. Content-based attention is used to pass the representation to the decoder, where a recurrent layer produces the attention query at each time step. The query is concatenated with the context vector and passed to a stack of GRU cells with residual connections. The output of the decoder is converted to the end waveform with a separate post-processing network, containing a CBHG module. No support for multi-speaker. Tacotron 2 Tacotron 2 improves and simplifies the original architecture. While there are no major differences, let’s see its key points: The encoder now consists of 3 convolutional layers and a bidirectional LSTM replacing PreNets and CHBG modules Location sensitive attention improved the original additive attention mechanism The decoder is now an autoregressive RNN formed by a Pre-Net, 2 uni-directional LSTMs, and a 5-layer Convolutional Post-Net A modified WaveNet is used as the Vocoder that follows PixelCNN++ and Parallel WaveNet Mel spectrograms are generated and passed to the Vocoder as opposed to Linear-scale spectrograms WaveNet replaced the Griffin-Lin algorithm used in Tacotron 1
  13. Through parallel mel-spectrogram generation, FastSpeech greatly speeds up the synthesis process Phoneme duration predictor ensures hard alignments between a phoneme and its mel- spectrograms, which is very different from soft and automatic attention alignments in the autoregressive models. he length regulator (Figure 1c) is used to solve the problem of length mismatch between the phoneme and spectrogram sequence The length regulator can easily adjust voice speed (voice speed or prosody control) 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accu- rate enough fastspeeech2/2s Same encoder transformer fft First, we remove the teacher-student distillation pipeline, and directly use ground-truth mel-spectrograms as target for model training, which can avoid the information loss in distilled mel-spectrograms and increase the upper bound of the voice quality. Second, our variance adaptor consists of not only duration predictor but also pitch and energy predictors.
  14. WaveGlow by Nvidia is one of the most popular flow-based TTS models. It essentially tries to combine insights from Glow and WaveNet in order to achieve fast and efficient audio synthesis without utilizing auto-regression. Note that WaveGlow is used strictly to generated speech from mel spectograms replacing WaveNets.
  15. The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Multi-Period Discriminator MPD is a mixture of sub-discriminators, each of which only accepts equally spaced samples of an input audio Multi-Scale Discriminator Because each sub-discriminator in MPD only accepts disjoint samples, we add MSD to consecutively evaluate the audio sequence. The architecture of MSD is drawn from that of MelGAN (Kumar et al., 2019). MSD is a mixture of three sub-discriminators operating on different input scales: raw audio, ×2 average-pooled audio, and ×4 average-pooled audio. GAN Loss Mel-Spectrogram Loss Feature Matching Loss similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. HiFi-GAN V 1 4.3