SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior,
Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti
1. Introduction p
1.1 Motivation
– Recently, normalizing flows have been successfully applied in the TTS field. When the flow-based models FlowTron (Valle et
al., 2020) and Glow-TTS (Kim et al., 2020) achieved state-of-the-art results. Despite this, current zero-shot multi-speaker
TTS models were heavily based on the Tacotron 2 model.
1.2 Highlights
– As far as we know, this is the first work to explore flow-based models in a zero-shot multi-speaker TTS scenario.
– We show that fine-tuning a GAN-based vocoder with the Mel-spectrograms predicted by the TTS model in the training
speakers can significantly improve speech similarity and quality for new speakers.
– Our approach achieves promising results using only 11 speakers for training.
2. Methodology: Proposed Method and Dataset
2.1 Speaker Encoder
– Stack of 3 LSTM layers with a linear output layer.
– Trained using the Angular Prototypical loss function with approximately 25k speakers.
– Train datasets: LibriSpeech dataset, VoxCeleb V1 and V2, English version of Common Voice and VCTK.
2.2 Vocoder: HiFi-GAN V2
− VCTK dataset for training and validation.
− Fine-tuning with Mel-spectrograms predicted by TTS models
(HiFi-GAN-FT).
2.3 SC-GlowTTS Model: Glow-TTS based
− Phonemes instead of graphemes as input.
− Explore 3 different encoders:
 The original transformer based encoder;
 Residual convolutional based;
 Gated convolutional based.
− External speaker embeddings conditioned in:
 Affine coupling layers in all decoder blocks;
 Duration predictor input.
2.4 Dataset: VCTK
− Training: composed of 97 speakers.
− Development: composed by samples from the 97 training speakers.
− Test: composed of 11 speakers not present in the training set.
Input Text Phonemizer Encoder
Duration Predictor
Conv Projection
Speaker Embedding
Aligment Generation
Ceil
Flow-Based Decoder
UnSqueeze
Affine Coupling Layer
Invertible 1x1 Conv
ActNorm
Squeeze
x 12
Predicted Mel spectrogram
HiFi-GAN
Waveform
3. Experiments: Setup and Results
3.1 Proposed Experiments
1. Tacotron 2 baseline following Jia et al. (2018) and Cooper et al. (2020);
2. SC-GlowTTS with transformer based encoder;
3. SC-GlowTTS with residual convolutional based encoder;
4. SC-GlowTTS with gated convolutional based encoder.
3.2 Experiments Setup
– All experiments were implemented on the Coqui TTS:
github.com/coqui-ai/TTS
– Coqui TTS is an open source TTS framework. Contributions are welcome.
– Audio samples and checkpoints of all experiments are available on:
github.com/Edresson/SC-GlowTTS
3.3 Results
Table 1. Real Time Factor, MOS and Sim-MOS with 95% confidence intervals and the SECS for all our experiments.
Experiment - Model Vocoder RTF (CPU - GPU) SECS MOS Sim-MOS
Ground Truth – – 0.9236 4.12 ± 0.06 4.127 ± 0.06
Attentron ZS (Choi et al., 2020) WaveRNN – (0.731) (3.86 ± 0.05) (3.30 ± 0.06)
1 - Tacotron 2
HiFi-GAN 0.5782 - 0.2485 0.7589 3.57 ± 0.08 3.867 ± 0.08
HiFi-GAN-FT - 0.7791 3.74 ± 0.08 3.951 ± 0.07
2 - SC-GlowTTS-Trans
HiFi-GAN 0.3612 - 0.1557 0.7641 3.65 ± 0.07 3.905 ± 0.07
HiFi-GAN-FT - 0.8046 3.78 ± 0.07 3.999 ± 0.07
3 - SC-GlowTTS-Res
HiFi-GAN 0.3597 - 0.1545 0.7440 3.45 ± 0.09 3.828 ± 0.08
HiFi-GAN-FT - 0.7969 3.70 ± 0.07 3.916 ± 0.07
4 - SC-GlowTTS-Gated
HiFi-GAN 0.3474 - 0.1437 0.7432 3.55 ± 0.08 3.852 ± 0.08
HiFi-GAN-FT - 0.7849 3.82 ± 0.07 3.952 ± 0.07
4. SC-GlowTTS performance with few speakers
– To emulate a scenario with few speakers we selected 11 speakers from the training subset of the VCTK dataset.
– We trained the SC-GlowTTS-Trans model on the single speaker dataset, LJ Speech, after we continued the training, in this
dataset composed of 11 speakers and we calculated the metrics for the test set.
– The model achieved a similarity MOS of 3.93±0.08 and a MOS of 3.71±0.07. These results are comparable to those achieved
by the Tacotron 2 baseline trained with 98 speakers which achieved a similarity MOS of 3.95±0.07 and a MOS of 3.74±0.08.
– We believe that this is an important step forward, especially for zero-shot multi speaker TTS in
low-resource languages.

Mais conteúdo relacionado

Mais procurados

"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
Edge AI and Vision Alliance
 
Microinstruction sequencing new
Microinstruction sequencing newMicroinstruction sequencing new
Microinstruction sequencing new
Mahesh Kumar Attri
 

Mais procurados (20)

Case based reasoning
Case based reasoningCase based reasoning
Case based reasoning
 
Operating System : Ch14.tertiary storage structure
Operating System : Ch14.tertiary storage structureOperating System : Ch14.tertiary storage structure
Operating System : Ch14.tertiary storage structure
 
JAVA Tutorial- Do's and Don'ts of Java programming
JAVA Tutorial- Do's and Don'ts of Java programmingJAVA Tutorial- Do's and Don'ts of Java programming
JAVA Tutorial- Do's and Don'ts of Java programming
 
Memory management
Memory managementMemory management
Memory management
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching Techniques
 
Unit II - 2 - Operating System - Threads
Unit II - 2 - Operating System - ThreadsUnit II - 2 - Operating System - Threads
Unit II - 2 - Operating System - Threads
 
Scheduling Definition, objectives and types
Scheduling Definition, objectives and types Scheduling Definition, objectives and types
Scheduling Definition, objectives and types
 
Opetating System Memory management
Opetating System Memory managementOpetating System Memory management
Opetating System Memory management
 
File System in Operating System
File System in Operating SystemFile System in Operating System
File System in Operating System
 
x86 architecture
x86 architecturex86 architecture
x86 architecture
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
 
Microinstruction sequencing new
Microinstruction sequencing newMicroinstruction sequencing new
Microinstruction sequencing new
 
Operating System-Ch8 memory management
Operating System-Ch8 memory managementOperating System-Ch8 memory management
Operating System-Ch8 memory management
 
Swapping | Computer Science
Swapping | Computer ScienceSwapping | Computer Science
Swapping | Computer Science
 
OS - Process Concepts
OS - Process ConceptsOS - Process Concepts
OS - Process Concepts
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
 
Bat algorithm and applications
Bat algorithm and applicationsBat algorithm and applications
Bat algorithm and applications
 
Symbol table in compiler Design
Symbol table in compiler DesignSymbol table in compiler Design
Symbol table in compiler Design
 

Semelhante a Poster SCGlowTTS Interspeech 2021

Shah Md Zobair(063560056)
Shah Md Zobair(063560056)Shah Md Zobair(063560056)
Shah Md Zobair(063560056)
mashiur
 
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_lastSlow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Hany Fahmy
 
Mohammed_Defense_July13th2011
Mohammed_Defense_July13th2011Mohammed_Defense_July13th2011
Mohammed_Defense_July13th2011
mohdmohsen
 
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
KevinYangYang
 
Final presentation
Final presentationFinal presentation
Final presentation
Rohan Lad
 

Semelhante a Poster SCGlowTTS Interspeech 2021 (20)

Shah Md Zobair(063560056)
Shah Md Zobair(063560056)Shah Md Zobair(063560056)
Shah Md Zobair(063560056)
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_lastSlow dancing pdn on memory-controller-packages may-10th_2012_hf_last
Slow dancing pdn on memory-controller-packages may-10th_2012_hf_last
 
NR_Frame_Structure_and_Air_Interface_Resources.pptx
NR_Frame_Structure_and_Air_Interface_Resources.pptxNR_Frame_Structure_and_Air_Interface_Resources.pptx
NR_Frame_Structure_and_Air_Interface_Resources.pptx
 
Mohammed_Defense_July13th2011
Mohammed_Defense_July13th2011Mohammed_Defense_July13th2011
Mohammed_Defense_July13th2011
 
College ADSL Presentation
College ADSL PresentationCollege ADSL Presentation
College ADSL Presentation
 
Orthogonal Frequency Division Multiplexing.ppt
Orthogonal Frequency Division Multiplexing.pptOrthogonal Frequency Division Multiplexing.ppt
Orthogonal Frequency Division Multiplexing.ppt
 
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
Bluetooth Technology-Introduction to Bluetooth, Technical Specifications, Blu...
 
LTE Air Interface
LTE Air InterfaceLTE Air Interface
LTE Air Interface
 
OIF 112G Panel at DesignCon 2017
OIF 112G Panel at DesignCon 2017OIF 112G Panel at DesignCon 2017
OIF 112G Panel at DesignCon 2017
 
Encrypted Traffic Mining
Encrypted Traffic MiningEncrypted Traffic Mining
Encrypted Traffic Mining
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Ofdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE studentsOfdm sim-matlab-code-tutorial web for EE students
Ofdm sim-matlab-code-tutorial web for EE students
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Finalreport
FinalreportFinalreport
Finalreport
 
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriais
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriaisWebinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriais
Webinar: BlueNRG-LP - Bluetooth 5.2 de longo alcance para aplicações industriais
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Automatic Speech Recognition Incorporating Modulation Domain Enhancement
Automatic Speech Recognition Incorporating Modulation Domain EnhancementAutomatic Speech Recognition Incorporating Modulation Domain Enhancement
Automatic Speech Recognition Incorporating Modulation Domain Enhancement
 
Speech coding techniques
Speech coding techniquesSpeech coding techniques
Speech coding techniques
 
moip
moipmoip
moip
 

Mais de Bilkent University (6)

RNNs for Speech
RNNs for SpeechRNNs for Speech
RNNs for Speech
 
Qualcomm research-imagenet2015
Qualcomm research-imagenet2015Qualcomm research-imagenet2015
Qualcomm research-imagenet2015
 
Fame cvpr
Fame cvprFame cvpr
Fame cvpr
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
 
Eren_Golge_MS_Thesis_2014
Eren_Golge_MS_Thesis_2014Eren_Golge_MS_Thesis_2014
Eren_Golge_MS_Thesis_2014
 
Cmap presentation
Cmap presentationCmap presentation
Cmap presentation
 

Último

Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
IJECEIAES
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
Madan Karki
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
IJECEIAES
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
drjose256
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 

Último (20)

Worksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptxWorksharing and 3D Modeling with Revit.pptx
Worksharing and 3D Modeling with Revit.pptx
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligence
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
Piping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdfPiping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdf
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference Modal
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 
The Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxThe Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptx
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 

Poster SCGlowTTS Interspeech 2021

  • 1. SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior, Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti 1. Introduction p 1.1 Motivation – Recently, normalizing flows have been successfully applied in the TTS field. When the flow-based models FlowTron (Valle et al., 2020) and Glow-TTS (Kim et al., 2020) achieved state-of-the-art results. Despite this, current zero-shot multi-speaker TTS models were heavily based on the Tacotron 2 model. 1.2 Highlights – As far as we know, this is the first work to explore flow-based models in a zero-shot multi-speaker TTS scenario. – We show that fine-tuning a GAN-based vocoder with the Mel-spectrograms predicted by the TTS model in the training speakers can significantly improve speech similarity and quality for new speakers. – Our approach achieves promising results using only 11 speakers for training. 2. Methodology: Proposed Method and Dataset 2.1 Speaker Encoder – Stack of 3 LSTM layers with a linear output layer. – Trained using the Angular Prototypical loss function with approximately 25k speakers. – Train datasets: LibriSpeech dataset, VoxCeleb V1 and V2, English version of Common Voice and VCTK. 2.2 Vocoder: HiFi-GAN V2 − VCTK dataset for training and validation. − Fine-tuning with Mel-spectrograms predicted by TTS models (HiFi-GAN-FT). 2.3 SC-GlowTTS Model: Glow-TTS based − Phonemes instead of graphemes as input. − Explore 3 different encoders: The original transformer based encoder; Residual convolutional based; Gated convolutional based. − External speaker embeddings conditioned in: Affine coupling layers in all decoder blocks; Duration predictor input. 2.4 Dataset: VCTK − Training: composed of 97 speakers. − Development: composed by samples from the 97 training speakers. − Test: composed of 11 speakers not present in the training set. Input Text Phonemizer Encoder Duration Predictor Conv Projection Speaker Embedding Aligment Generation Ceil Flow-Based Decoder UnSqueeze Affine Coupling Layer Invertible 1x1 Conv ActNorm Squeeze x 12 Predicted Mel spectrogram HiFi-GAN Waveform 3. Experiments: Setup and Results 3.1 Proposed Experiments 1. Tacotron 2 baseline following Jia et al. (2018) and Cooper et al. (2020); 2. SC-GlowTTS with transformer based encoder; 3. SC-GlowTTS with residual convolutional based encoder; 4. SC-GlowTTS with gated convolutional based encoder. 3.2 Experiments Setup – All experiments were implemented on the Coqui TTS: github.com/coqui-ai/TTS – Coqui TTS is an open source TTS framework. Contributions are welcome. – Audio samples and checkpoints of all experiments are available on: github.com/Edresson/SC-GlowTTS 3.3 Results Table 1. Real Time Factor, MOS and Sim-MOS with 95% confidence intervals and the SECS for all our experiments. Experiment - Model Vocoder RTF (CPU - GPU) SECS MOS Sim-MOS Ground Truth – – 0.9236 4.12 ± 0.06 4.127 ± 0.06 Attentron ZS (Choi et al., 2020) WaveRNN – (0.731) (3.86 ± 0.05) (3.30 ± 0.06) 1 - Tacotron 2 HiFi-GAN 0.5782 - 0.2485 0.7589 3.57 ± 0.08 3.867 ± 0.08 HiFi-GAN-FT - 0.7791 3.74 ± 0.08 3.951 ± 0.07 2 - SC-GlowTTS-Trans HiFi-GAN 0.3612 - 0.1557 0.7641 3.65 ± 0.07 3.905 ± 0.07 HiFi-GAN-FT - 0.8046 3.78 ± 0.07 3.999 ± 0.07 3 - SC-GlowTTS-Res HiFi-GAN 0.3597 - 0.1545 0.7440 3.45 ± 0.09 3.828 ± 0.08 HiFi-GAN-FT - 0.7969 3.70 ± 0.07 3.916 ± 0.07 4 - SC-GlowTTS-Gated HiFi-GAN 0.3474 - 0.1437 0.7432 3.55 ± 0.08 3.852 ± 0.08 HiFi-GAN-FT - 0.7849 3.82 ± 0.07 3.952 ± 0.07 4. SC-GlowTTS performance with few speakers – To emulate a scenario with few speakers we selected 11 speakers from the training subset of the VCTK dataset. – We trained the SC-GlowTTS-Trans model on the single speaker dataset, LJ Speech, after we continued the training, in this dataset composed of 11 speakers and we calculated the metrics for the test set. – The model achieved a similarity MOS of 3.93±0.08 and a MOS of 3.71±0.07. These results are comparable to those achieved by the Tacotron 2 baseline trained with 98 speakers which achieved a similarity MOS of 3.95±0.07 and a MOS of 3.74±0.08. – We believe that this is an important step forward, especially for zero-shot multi speaker TTS in low-resource languages.