A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
A Study on Improving
Speaker Diarization System
TUNG LAM NGUYEN
lamfm95@gmail.com
Dept. of Control Engineering and Automation
Supervisor Dr. T. Anh Xuan Tran
School School of Electrical Engineering
Hanoi, March 1, 2022

i
Declaration of Authorship
I, Tung Lam N , declare that this thesis titled, “A Study on Improving Speaker
GUYEN
Diarization System” and the work presented in it are my own. I confirm that:
• This work was done wholly or mainly while in candidature for a research degree at
this University.
• Where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated.
• Where I have consulted the published work of others, this is always clearly at-
tributed.
• Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
Signed:
Date:

iii
“I’m not much but I’m all I have.”
- Philip K Dick, Martian Time-Slip

v
Abstracts
Speaker diarization is the method of dividing a conversation into segments spoken by
the same speaker, usually referred to as “who spoke when”. At Viettel, this task is es-
pecially important to the IP contact center (IPCC) automatic quality assurance system,
by which hundreds of thousands of calls are processed everyday. Integrated within a
speaker recognition system, speaker diarization helps distinguishing between agents and
customers within each support call and giving further useful insights (e.g: agent attitude
and customer satisfaction,..). The key to accurately do such task is to learn discrimi-
native speaker representations. X-Vectors, bottle-neck features of a time-delayed neu-
ral network (TDNN), have emerged as the speaker representations of choice for many
speaker diarization system. On the other hand, ECAPA-TDNN, a recent development
over X-Vectors’ neural network with residual connections and attention on both time and
feature channels, has shown state-of-the-art results on popular English corpora. There-
fore, the aim of this work is to explore capability of ECAPA-TDNN versus X-Vectors in
the current Vietnamese speaker diarization system. Both baseline and proposed systems
are evaluated in two tasks: speaker verification, to evaluate the discriminative character-
istics of speaker representations; and speaker diarization, to evaluate how these speaker
representations affect the whole complex system. Used data include private data sets
(IPCC_110000, VTR_1350) and a public data set (ZALO_400). In general, conducted
experiments show the proposed system out-perform the baseline system on all tasks and
on all data sets.

vii
Acknowledgements
First and foremost, I would like to express my deep gratitude to my main supervisor, Dr.
T. Anh Xuan Tran. Without her outstanding guidance and patience, I would never finish
this thesis.
I would like to thank Dr. Van Hai Do, Mr. Nhat Minh Le and colleagues at Viettel
Cyberspace Center as their kindness and tremendous technical assistance have made my
days doing this thesis much more relieved.
Finally, huge thanks to my friends for giving me stress-relief at weekends, and my family
who did most of the cooking so I would have more time on working on this thesis.
Hanoi, March 1, 2022

ix
Contents
Declaration of Authorship i
Abstracts v
Acknowledgements vii
1 Introduction 1
1.1 Research Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Speaker Diarization System 7
2.1 Front-end Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Front-end Post-processing . . . . . . . . . . . . . . . . . . . . 10
2.1.2.1 Speech Enhancement . . . . . . . . . . . . . . . . . 10
2.1.2.2 De-reverbation . . . . . . . . . . . . . . . . . . . . . 10
2.1.2.3 Speech Separation . . . . . . . . . . . . . . . . . . . 10
2.2 Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Speaker Representations . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 X-Vector Embeddings . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1.1 Frame Level . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1.2 Segment Level . . . . . . . . . . . . . . . . . . . . . 15

x
2.4.2 ECAPA-TDNN Embeddings . . . . . . . . . . . . . . . . . . . 16
2.4.2.1 Frame-level . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2.1.1 1D Convolutional Layer . . . . . . . . . . . 17
2.4.2.1.2 1D Squeeze-and-Excitation Block . . . . . 18
2.4.2.1.3 Res2Net-with-Squeeze-Excitation Block . . 19
2.4.2.2 Segment-level . . . . . . . . . . . . . . . . . . . . . 19
2.4.2.2.1 Attentive Statistical Pooling . . . . . . . . . 20
2.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 PLDA Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . 26
3 Experiments 29
3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Equal Error Rate and Minimum Decision Cost Function . . . . 29
3.1.2 Diarization Error Rate . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Kaldi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 SpeechBrain . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Kal-Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 IPCC_110000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1.1 IPCC_110000 Verification Test Set . . . . . . . . . . 36
3.3.1.2 IPCC_110000 Diarization Test Set . . . . . . . . . . 37
3.3.2 VTR_1350 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2.1 VTR_1350 Verification Test Set . . . . . . . . . . . 39
3.3.2.2 VTR_1350 Diarization Test Set . . . . . . . . . . . . 39
3.3.3 ZALO_400 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3.1 ZALO_400 Verification Test Set . . . . . . . . . . . 40

xi
3.3.3.2 ZALO_400 Diarization Test Set . . . . . . . . . . . . 41
3.4 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Speaker Diarization System . . . . . . . . . . . . . . . . . . . 41
3.4.2 Speaker Verification System . . . . . . . . . . . . . . . . . . . 43
3.5 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Results 51
4.1 Speaker Verification Task . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Speaker Diarization Task . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Conclusions and Future Works 57

xiii
List of Figures
1.1 A traditional speaker diarization system diagram. . . . . . . . . . . . . 1
1.2 An example speaker diarization result. . . . . . . . . . . . . . . . . . . 2
1.3 An example clustering result of a 3-way conversation (adapted from [ ]).
8
Each dot represents a speech segment in 2D dimension. . . . . . . . . . 3
1.4 Generic speaker diarization system diagram, including 3 phases: embed-
dings extractor training, PLDA backend training and speaker diarization.
In this thesis, two state-of-the-art embeddings extractor: X-Vector and
ECAPA-TDNN are experimented. . . . . . . . . . . . . . . . . . . . . 4
1.5 Generic speaker verification system diagram, employing the same em-
beddings extractor and PLDA backend as used in 1.4. This system is pri-
marily used to optimize the speaker diarization system. The EER thresh-
old can be used for clustering without knowing the number of speakers
in system 1.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Diagram of a F-banks / MFCCs extraction process (adapted from [ ]).
11 8
2.2 N=10 Mel filters for signal samples sampled at 16000Hz. . . . . . . . . 9
2.3 Example output of a VAD system visualized in Audacity (audio editor)
[ ].
36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Diagram of X-Vectors DNN (adapted from [ ]).
58 . . . . . . . . . . . . 14
2.5 Diagram of X-Vectors’ frame-level TDNN with sub-sampling (as con-
figured in [ ]).
59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Diagram of X-Vectors’ segment-level DNN (as configured in [ ])
59 . . . 16
2.7 Complete network architecture of ECAPA-TDNN (adapted from [ ]).
62 . 17

xiv
2.8 Kernel sliding across speech frames in a dilated 1D-CNN layer, with
k=3, d=4 and c=6. Essentially this is a TDNN layer with context of
{3,0,3}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9 A 1D-Squeeze-and-Excitation block. Different colors represent different
scales for channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 A Res2Net-with-Squeeze-Excitation Block. . . . . . . . . . . . . . . . 20
2.11 Attentive Statistics Pooling (on both time frames and channels). . . . . 21
2.12 An example of LDA transformation from 2D to 1D (taken from [ ]).
76 . 23
2.13 Fitting the parameters of the PLDA model (taken from [ ]).
77 . . . . . . 25
2.14 Agglomerative hierarchical clustering flowchart. . . . . . . . . . . . . . 26
2.15 An example iterative process of agglomerative hierarchical clustering
(taken from [ ]).
80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.16 Visualization of the result of hierarchical clustering (taken from [ ]).
80 . 27
3.1 An EER plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Kaldi logo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Kaldi general architecture diagram. . . . . . . . . . . . . . . . . . . . . 32
3.4 Filtering VTR_1350 data set by utterances’ durations and number of
utterances per speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Generating 200 5-way conversations from VTR_1350 data set. The max.
and min. numbers of utterances picked from each conversation are 2 and
30 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 IPCC_110000 data distributions. . . . . . . . . . . . . . . . . . . . . . 35
3.7 VTR_1350 data distributions. . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 ZALO_400 data distributions. . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Baseline speaker diarization system diagram. . . . . . . . . . . . . . . 44
3.10 Baseline speaker verification system diagram. . . . . . . . . . . . . . . 45
3.11 Proposed speaker diarization system diagram. . . . . . . . . . . . . . . 48
3.12 Proposed speaker verification system diagram. . . . . . . . . . . . . . . 49

xv
4.1 A speaker diarization output of a 3-way conversation in VTR_1350 test
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

xvii
List of Tables
3.1 List of speech tasks and corpora that are currently supported by Speech-
Brain (taken from [ ])
81 . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 IPCC_110000 data set overview. . . . . . . . . . . . . . . . . . . . . . 35
3.3 IPCC_110000 data subsets. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 VTR_1350 data set overview. . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 ZALO_400 data set overview. . . . . . . . . . . . . . . . . . . . . . . 40
3.6 EER and MinDCF performance of all systems on the standard Vox-
Celeb1 and VoxSRC 2019 test sets (taken from [ ]).
62 . . . . . . . . . . 46
3.7 Diarization Error Rates (DERs) on AMI dataset using the beamformed
array signal on baseline and proposed systems (taken from [ ]).
88 . . . . 47
4.1 EER and MinDCF performance. . . . . . . . . . . . . . . . . . . . . . 53
4.2 DER performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xix
List of Abbreviations
IPCC IP Contact Center
DNN Deep Neural Network
CNN Convolutional Neural Network
TDNN Time-Delayed Neural Network
RTTM Rich Transcription Time Marked
RNN Recurrent Neural Network
LPC Linear Prediction Coding
PLP Perceptual Linear Prediction
DWT Discrete Wavelet Transform
MFBC Mel Filterbank Coefficients
MFCC Mel Frequency Cepstral Coefficients
STFT Short-time Discrete Fourier Transform
DCT Discrete Cosine Transform
WPE Weighted Prediction Error
MLE Maximum Likelihood Estimation
PIT Permutation Invariant Training
VAD Voice Activity Detection
SAD Speech Activity Detection
HMM Hidden Markov Model
GMM Gaussian Mixture Model
GLR Generalized Likelihood Ratio
BIC Bayesian Information Criterion
UBM Universal Background Model
LDA Linear Discriminant Analysis
PLDA Probabilistic Linear Discriminant Analysis
LSTM Long Short-Term Memory
SE-Res2Net Res2Net-with-Squeeze-Excitation
ReLU Rectified Linear Unit

xx
AAM Additive Angular Margin
AHC Agglomerative Hierarchical Clustering
EER Equal Error Rate
CER Crossover Error Rate
FAR False Acceptance Rate
FRR False Rejection Rate
TPR True Positive Rate
FPR False Positive Rate
FNR False Negative Rate
MinDCF Minimum Decision Cost Function
DER Diarization Error Rate
PCM Pulse-Code Modulation
SNR Signal-to-Noise Ratio

1
Chapter 1
Introduction
1.1 Research Interest
Speaker diarization, usually referred as "who spoke when”, is the method of dividing a
conversation that often includes a number of speakers into segments spoken by the same
speaker. This task is especially important to Viettel IP contact center (IPCC) automatic
quality assurance system, where hundreds of thousands of calls are processed everyday,
and the human resources are limited and costly. In the scenarios that only single-channel
recordings are provided, speaker diarization, integrated within a speaker recognition
system, helps distinguishing between agents and customers within each support call and
giving further useful insights (e.g: agent attitude and customer satisfaction,..). Never-
theless, speaker diarization can also be applied in analyzing other forms of recorded
conversations such as meetings, medical therapy sessions, court sessions, talk shows,...
Front-End
Processing
Voice Activity
Detection
Segmentation
Speaker
Representation
Clustering
Post-
Processing
Audio
Input
Diarization
Output
F 1.1: A traditional speaker diarization system diagram.
IGURE
A traditional speaker diarization system (figure ) is built by six modules: front-end
1.2
processing, voice activity detection, segmentation, speaker representation, clustering,
and post-processing. All output information, including the number of speakers, the be-
ginning time and duration of each of their speech segments, is encapsulated in form of
Rich Transcription Time Marked (RTTM) file [ ] (figure ).
1 1.2

2 Chapter 1. Introduction
F 1.2: An example speaker diarization result.
IGURE
An important factor that affects the speaker diarization accuracy is the number of par-
ticipating speakers in the conversation. This number could be revealed or hidden from
the system before the diarization process, depending on the nature of the conversations.
An example of the case where it’s revealed is a check-up call between a doctor and a pa-
tient, or a support call between a customer and an agent, which is usually a conversation
between only two people (i.e: a 2-way conversation), assuming there’s no new speaker
interrupting or joining the conversation.
By acknowledging that the conversation have only a defined number of speaker, the
speaker diarization system can simply slice the recorded conversation into smaller speech
partitions and classify them into a known number of clusters (e.g: using k-means [ ]).
2
However, in case that the number of speakers is unknown to the system, the system must
guess it first. The guessing ends when a stopping criterion (or a decision threshold) is
met. For example, in company meeting with shareholders, although the number people
participating in the meeting is on the record, the number of people actually speak in that
meeting is unknown. While most people might remain silent throughout the meeting,
only board members would take the mic. In this case, the diarization system can em-
ploy an unsupervised clustering method (e.g: X-means [ ]), or an supervised clustering
3
method (e.g; UIS-RNN) [ ])
4
In fact, multiple attempts to build an end-to-end speaker diarization system without a
clustering module have been made in [ ], [ ], and [ ]. However, this thesis only focuses
5 6 7
on a traditional speaker diarization system, which employs a speaker clustering module.
Figure demonstrates an example clustering result.
1.3

1.1. Research Interest 3
F 1.3: An example clustering result of a 3-way conversation (adapted
IGURE
from [ ]). Each dot represents a speech segment in 2D dimension.
8
In this case, extracting speaker representations with discriminative characteristics is ex-
tremely important, since these representations as input data have a huge influence on the
accuracy of clustering stage. The discriminative characteristics of a speaker representing
method can be tested indirectly via speaker diarization system, which is the main focus
of this thesis. On the other hand, it can be tested directly via a simple speaker verifica-
tion system. A speaker verification system verifies the identity of a questioned speaker
by comparing the voice data that supposedly belong to himself with his enrolled voice
data. If the similarity between the enrolled and the input is lower than a determined
threshold, the imposter gets rejected.
In summary, the speaker diarization system and the speaker verification system are cor-
related in the sense that they are using the same way of representing speakers. Hence
optimizations in the speaker verification system would also led to improvements in the
speaker diarization system, which is the main approach of this thesis. Figure demon-
1.4
strates a generic system that employs both speaker diarization and speaker verification.
The speaker verification system, employing the same embeddings extractor and PLDA
backend used the speaker diarization system, is primarily used for optimizing the di-
arization performance. In this thesis, both baseline and proposed systems are based on
this generic model. It’s also noted that the post-processing module is left out.

Phase 1:
Embeddings
Extractor Training
Phase 2:
PLDA Backend
Training
Training Data
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Training Data
Training
Embeddings
Extractor
Training
PLDA
Backend
Phase 3:
Speaker
Diarization
Front-end
Processing
Voice Activity
Detection
Segmentation
Extract Speaker
Embeddings
Speaker
Embeddings
PLDA
Scoring
Affinity
Scoring
Matrix
Clustering
RTTM
Output
Audio
Input
(Enrolled)
Training
Speaker
Embeddings
F 1.4: Generic speaker diarization system diagram, including 3
IGURE
phases: embeddings extractor training, PLDA backend training and speaker
diarization. In this thesis, two state-of-the-art embeddings extractor: X-
Vector and ECAPA-TDNN are experimented.

1.1. Research Interest 5
Testing
Data
Generate
Verification Pairs
Phase *:
Speaker Verification
(for optimization)
Verification Pairs
utt-4ae utt-c3d target
utt-f34 utt-ced target
utt-ae3 utt-b2e nontarget
...
Audio
Input
(Enrolled)
Audio
Input
(Imposter)
All pairs
are scored?
Fetch
Verification Pair
No
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Speaker
Embed.
(Enrolled)
Speaker
Embed.
(Imposter)
PLDA
Scoring
Scoring
Value
Verification Scores
utt-4ae utt-c3d 0.6554
utt-f34 utt-ced 0.9786
utt-ae3 utt-b2e 0.4587
...
Yes
Embeddings
Extractor
PLDA
Backend
EER
Plotting
EER
EER Threshold
(from Phase 1)
(from Phase 2)
F 1.5: Generic speaker verification system diagram, employing the
IGURE
same embeddings extractor and PLDA backend as used in . This sys-
1.4
tem is primarily used to optimize the speaker diarization system. The EER
threshold can be used for clustering without knowing the number of speak-
ers in system .
1.4

1.2 Thesis Outline
This thesis is organized into 4 chapters with the following contents:
Chapter 1 The current chapter gives general information about the research interest, a
general overview on speaker diarization and its co-related task, speaker verification.
Chapter 2 This chapter presents components of a speaker diarization system, including
all components of the implemented speaker verification system. Notable topics are X-
Vector and ECAPA-TDNN speaker representations.
Chapter 3 This chapter discusses evaluation metrics, used data sets, and applied meth-
ods.
Chapter 4 This chapter closely examines experiments’ results.
Chapter 5 This chapter summarizes the work in this thesis and give some future direc-
tions.

7
Chapter 2
Speaker Diarization System
2.1 Front-end Processing
The very first stage of a speaker diarization system (or more generally speaking, a speech
processing system) is front-end processing. At this stage, acoustic features are curated
and processed in such a way that is considered most favorable for the system. The fea-
tures must be in balance between simplicity (not much correlated and simple enough to
be input of the learning network), and complexity (still possessing useful information,
making space for the network to learn). Afterwards, some front-end post-processing
techniques, such as speech enhancement, speech de-reverberation, speech separation and
target speaker extraction, could be performed to further enhance speech features towards
better speaker diarization performance.
2.1.1 Features Extraction
There’s a wide variety of methods to represent the speech signal parametrically, such
as linear prediction coding (LPC), perceptual linear prediction (PLP), discrete wavelet
transform (DWT), Mel filterbank coefficients (MFBCs), and Mel frequency cepstral
coefficients (MFCCs). However, in the last twenty years, the last two methods emerge
as features of choice in the field of speech processing. [ ] [ ]. Figure demonstrates
9 10 4.1
a typical F-banks / MFCCs extraction process.
At the beginning of F-banks / MFCCs extraction process, the input speech signal is di-
vided into homogeneous overlapping short frames (in most cases into short frames of
25ms with overlaps of 10ms), and windowed (usually with Hamming / Hanning win-
dow) to eliminate noises in later signal transforms [ ] (i.e: with 16000Hz sampled
12

8 Chapter 2. Speaker Diarization System
Speech
Signal
Framing and
Windowing
Short Time Fourier
Transform (STFT)
Log-amplitude
N Mel filters
Discrete Cosine
Transform (DCT)
MFCCs
MFBCs
Sum
F 2.1: Diagram of a F-banks / MFCCs extraction process (adapted
IGURE
from [ ]).
11
speech signal, each frame contains 16*1000*25/1000 = 400 samples, with an overlap of
10/25*400 = 160 samples with the previous frame).
Afterwards, the framed and windowed signal is then analysed by a Short-Time Dis-
crete Fourier Transform (STFT), through which the signal sample is converted from
amplitude-time domain to amplitude-frequency domain.
Next, the y-axis (amplitude) is converted to a log scale as it better represents the human
perception of loudness [ ], according to Weber-Fechner Law [ ].
13 14
Then, the x-axis (frequency) is also converted, but into mel-scale [ ]. The reason for
15
this conversion is the fact that out human ears have lower resolution at higher frequen-
cies than lower frequencies (i.e: it’s fairly easy for us to distinguish between sounds at
300Hz and 400Hz, but it gets much harder when we have to compare sounds at 1300Hz
and 1400Hz, even when the difference is still 100Hz). The mel-scale formula is purely
discovered via psychological experiments with many variation, one of which can be ex-
pressed as follow:
m = 2595log10 1+
f
700 = 1127ln 1+
f
700 (2.1)
At this stage, Mel filterbank coefficients (MFBCs) can be computed after 2 steps:
• Step 1: Apply a chosen number N of triangular band-pass filters linearly spaced in
Mel scale:
– The lowest and highest frequencies of the bands correspond to the lowest and
highest frequency of the initially sampled signal.

2.1. Front-end Processing 9
– In mel-frequency domain, these filters have the same bandwidth, and they over-
lap each other by half of the bandwidth.
– Each filter is a triangular filter with the frequency response of 1 at the center
frequency, and decreases linearly towards zero till it reaches the center frequen-
cies of the two adjacent filters.
• Step 2: Compute N mean log-amplitude values of N filtered signal samples (from
the original signal sample). These values are taken as N Mel filterbank coefficients.
For example, for signal samples sampled at 16000Hz, N=10 Mel filters can be visualized
in the following figure:
F 2.2: N=10 Mel filters for signal samples sampled at 16000Hz.
IGURE
Going further, to obtain Mel frequency cepstral coefficients (MFCCs), the following
steps, which continues step 2 of MFBCs’ computation above, are done:
• Step 1: Sum all N Mel-filtered signal samples to obtain a Mel-weighted signal
sample.
• Step 2: Apply discrete cosine transform (DCT) to transform the signal from log-mel
frequency domain to quefrency [ ] domain.
16
• Step 3: Take first K coefficients (usually K=13) as K MFCCs. The next steps are
optional to generate high resolution MFCCs.
• Step 4: Compute first order and second order time derivatives of each coefficients
to yield K*2 more coefficients.

• Step 5: Compute sum of squares of amplitude of the signal sample to obtain one
energy coefficient. After this step, the total number of MFCCs are K*3+1 (i.e:
K=13 corresponds to 40 MFCCs)
2.1.2 Front-end Post-processing
2.1.2.1 Speech Enhancement
Speech enhancements techniques primarily focus on diminishing noises from noisy au-
dio. These techniques include: classical signal processing based de-noising [ ]; deep-
17
learning based de-noising [ ][ ][ ][ ][ ], and multi-channel processing [ ].
17 18 19 20 21 22
2.1.2.2 De-reverbation
De-reverberation techniques are utilized to remove effects of reverberation from input
signal. A popular method is Weighted Prediction Error (WPE), which is the dom-
inant method used in top-performing systems in DiHARD and CHiME competitions
[ ][ ][ ][ ][ ]. The basic idea of WPE is to decompose the original signal model
23 24 25 26 27
into early reflection and late reverberation. Then it tries to estimate a filter to maintain
the early reflection while suppressing the late reverberation basing on maximum likeli-
hood estimation (MLE) method [ ]. The improvement WPE gives is not significant,
28
but it is solid across all tasks. It also shows additional performance improvements when
applied on multi-channel signals.
2.1.2.3 Speech Separation
Speech separation is primarily useful when overlapping speech regions are significantly
large. Two main branches under this approach are:
• Deep-learning based speech separation: Some early attempts are Deep Cluster-
ing [ ], Permutation Invariant Training(PIT) [ ] and Conv-Tasnet [ ]. How-
29 30 31
ever, single-channel speech separation systems often produce a redundant non-
speech or even a duplicated speech signal for the non-overlap regions (leakage).
Leakage-filtering for single-channel systems were proposed and significantly im-
proved speaker diarization performance [ ][ ].
32 33

2.2. Voice Activity Detection 11
• Beam-forming based speech separation: This method appears in top-performing
systems in CHiME-6 challenge [ ][ ].
34 35
2.2 Voice Activity Detection
Voice activity detection (VAD) (also known as speech activity detection - SAD), is a
technique to address the presence or absence of human speech in a given audio signal,
which is an indispensable component of most speech processing systems:
• In a speech synthesis system, VAD helps removing noises in training data, thus,
reduce noises in synthesized audio.
• In a speech recognition system, VAD helps dropping frames of noises to save com-
puting power and reduce the number of insertion errors in decoded texts.
• In a speaker diarization system, VAD helps generating better speaker representa-
tions, which is the most important factor affecting the whole system’s performance
in terms of precision.
VAD systems can be classified into two types:
• Two-phases VAD systems, which mostly comprises of two parts: a feature extrac-
tion front end, where acoustic features such as MFCCs are extracted; and a classi-
fier, where a model predicts whether the input frame is speech or not.
• ASR-based VAD systems, where VAD timestamps are inferred directly from word
alignments. In this case, the ASR system preceded the VAD system.
F 2.3: Example output of a VAD system visualized in Audacity (au-
IGURE
dio editor) [ ].
36

VAD techniques have been sporadically developed throughout the years. In 1997, Benyas-
sine et al. presented a silence compression scheme that reduces transmission bandwidth
during silence period [ ]. This system employed a VAD algorithm that was later usu-
37
ally referred to as G729B algorithm. In 1998, Sohn et al. introduced the first statistical
model-based VAD that accounts for time-varying noise statistics [ ]. Just one year
38
later, they proposed an improved version with a Hidden Markov Model (HMM)[ ]
39
based hang-over scheme [ ]. In 2011, Ying et al. proposed a Gaussian mixture mod-
40
els (GMM)[ ] based VAD trained on a unsupervised learning framework. A popular
41
implementation of this method is Google WebRTC VAD [ ] [ ].
42 43
Later, with the rise of neural networks, in 2013, Thad Hughes introduced recurrent neu-
ral network (RNN) based VAD [ ] which was claimed to outperform existing GMM-
44
based VAD. Then, in 2018, a team from Johns Hopkins University proposed a Time-
Delay Neural Network VAD [ ], which is trained using alignments from the GMM-
45
HMM process and known to perform much faster than RNN-based VAD.
In this thesis, Google WebRTC VAD is adopted as the main VAD technique, considering
it’s simplicity of installation, as well as being used by Google in production environment
for a long period of time.
2.3 Segmentation
In a speaker diarization system, Speech segmentation breaks the input audio stream into
multiple segments so that each segment can be assigned to a speaker label.
The simplest method of segmentation is uniform segmentation, in which audio input
is segmented with a consistent window length and overlap length. The window must
be sufficiently short to safely assume that they do not contain multiple speakers, but at
the same time long enough to capture enough acoustic information (usually from 1 to 2
seconds).
A more complex method is speaker change point detection, in which speaker change
points are detected by comparing two hypotheses: Hypothesis H0 assuming both left and
right samples are from the same speaker and hypothesis H1 assuming the two samples
are from the different speakers. Some notable approaches are Kullback Leibler 2 (KL2)
algorithm [ ], Generalized Likelihood Ratio (GLR) [ ], and Bayesian Information
46 47
Criterion (BIC) [ ][ ].
48 49

2.4. Speaker Representations 13
2.4 Speaker Representations
Speaker representations play a critical role in measuring the similarity of speech seg-
ments, basing on which speech segments are classified into a known or unknown number
of speakers.
While features such as MFCCs or MFBCs are discriminative enough for speech recog-
nition, they are considered too noisy for speaker diarization. In order to overcome this
limitation, numerous studies have been carried out.
From 2010 to 2015, the dominant approach is to train a probabilistic model (e.g.: Gaus-
sian Mixture Model-Universal Background Model – GMM UBM) to extract speaker
representations in a new low dimensional speaker and channel-dependent space. A
probabilistic linear discriminant analysis (PLDA) [ ] could also be trained to further
50
improve scoring stage. Those representations are commonly referred to as I-Vectors
[ ][ ][ ][ ].
51 52 53 54
Since late-2015, deep learning has emerged as the dominant approach for this task. The
main concept is to train a deep neural network (DNN) to classify all speakers in a data
set, and then, in the testing stage, use bottleneck features as a speaker representation.
In this year, Heigold et al. proposed an end-to-end text-dependent speaker verification
system [ ] that learns speaker embeddings (commonly known as D-Vectors) based on
55
the cosine similarity. It is developed to handle variable length input in a text-independent
verification task through a temporal pooling layer and data augmentation. The model
was trained entirely on Google’s proprietary datasets. D-Vectors are later improved by a
short-term memory (LSTM) [ ] network and a triplet loss function [ ].
56 57
In 2017, a group of researchers from Johns Hopkins university proposed a modified
version of D-Vectors trained on smaller, publicly available datasets, and pre-processed
with a different strategy [ ]. In 2018, by exploiting data augmentation, they further
58
improve their speaker representations and referred to these representation as X-Vectors
[ ].
59
2.4.1 X-Vector Embeddings
X-Vectors are bottleneck features of a deep neural network trained to classify a large
number of speakers. The training data is divided into small batches of K speakers and

M speech segments (each of which must have more than T frames). The loss function
(multi-class cross entropy) goes as follow:
E = −
N
∑
n=1
K
∑
k=1
dnk ln

P

spkk | x
( )
m
1:T
(2.2)
where:
• P

spkk | x
( )
m
1:T 
is probability of speaker n given T input frames x
( )
m
1 ,x
( )
m
2 ,...x
( )
m
T
• dnk is 1 if the speaker label for segment n is k, and is 0 otherwise.
The network operates at 2 levels: frame-level and segment-level, connected by a statis-
tics pooling layer (as shown in figure 2). This multi-level structure allows the DNN to be
trained with segments of different lengths. Hence, the training data would be better uti-
lized and the extracted X-Vector would be more robust against the variance of segment’s
length.
F 2.4: Diagram of X-Vectors DNN (adapted from [ ]).
IGURE 58

2.4.1.1 Frame Level
At frame level, the network is essentially a time delayed neural network (TDNN) [ ]
60
with sub-sampling. The default configuration is shown in figure : Input features
2.5
are 30 MFCCs extracted from frames of 25ms with the overlaps of 10ms. The TDNN
has 5 layers with different context specifications. Layer 3, 4 and 5 are fully connected.
To account for the lack of context at the first and the last frames, speech segments are
padded at both ends.
t
t
t
t-5 t-3 t-1 t+1 t t+5
t-3 t+3
layer 1
(dim=512)
layer 2
(dim=512)
layer 3
(dim=512)
layer 4
(dim=512)
layer 5
(dim=1500)
(to statistical pooling)
Layer Input context Input context with
sub-sampling
5 {0} {0}
4 {0} {0}
3 [-3,3] {-3,0,3}
2 [-2,2] {-2,0,2}
1 [-2,2] [-2,2]
INPUT
(dim=30)
t-7 t+7
F 2.5: Diagram of X-Vectors’ frame-level TDNN with sub-sampling
IGURE
(as configured in [ ]).
59
2.4.1.2 Segment Level
At segment level, the network is a fully-connected feed-forward DNN with a pooled in-
put. Figure demonstrates the default setup: All frame-level outputs
2.6 ht (t T
= 1, ,
··· )
(layer 5 of the TDNN) are aggregated to compute mean and standard deviation .
µ σ
µ =
1
T
T
∑
t
ht (2.3)
σ = s
1
T
T
∑
t
ht ht − 
µ µ (2.4)

Both of those statistics are concatenated together to one vector that represents the whole
segment. This vector is then passed through 2 fully-connected layers, each of which
has a rectified linear unit (ReLU) [ ]. At last, the output layer (a log-softmax classi-
61
fier) gives a probability distribution of all speakers in the training data. In this network
configuration, it could be the output of layer 6 or layer 7.
t
t
t
layer 7
(dim=512)
layer 6
(dim=512)
OUTPUT
(dim=TotalSpeakers)
t
statistical pooling
(dim=3000)
t-2 t-1 t
t - T
layer 5
(dim=1500)
F 2.6: Diagram of X-Vectors’ segment-level DNN (as configured in
IGURE
[ ])
59
As mentioned earlier, X-Vectors are essentially bottleneck features of a DNN. In this
network configuration, it could be the output of layer 6 or layer 7. However, the latter is
selected since it’s proven experimentally to perform better in the speaker identification
task [ ].
59
2.4.2 ECAPA-TDNN Embeddings
In late 2020, Desplanques et al. proposed ECAPA-TDNN [ ], an enhanced structure
62
based on X-Vectors’ network. The basic TDNN layers are replaced with 1D-Convolutional
Layers [ ] and Res2Net-with-Squeeze-Excitation (SE-Res2Net) Blocks [ ][ ][ ],
63 64 65 66
while the basic statistics pooling layer is replaced with an Attentive Statiscal Pooling,
which utilizes both frame-wise and channel-wise attention mechanism [ ]. The com-
67
plete architecture of ECAPA-TDNN is visualized in figure .
2.7

Conv1D
(k=5, d=1)
(+ ReLU + BatchNorm)
SE-Res2Block
(k=3, d=2)
SE-Res2Block
(k=3, d=3)
SE-Res2Block
(k=3, d=4)
Conv1D
(k=1, d=1)
(+ ReLU)
+
Attentive Stat Pooling
(+ BatchNorm)
C T
x
C T
x
80 T
x
Fully Connected
(+ BatchNorm)
Additive Angular Margin
Softmax
1536 T
x
C T
x
3 C T
x ( x )
2 1536 T
x x
192 1
x
num. speakers 1
x
INPUT
OUTPUT
Frame
Level
Segment
Level
F 2.7: Complete network architecture of ECAPA-TDNN (adapted
IGURE
from [ ]).
62
2.4.2.1 Frame-level
At frame-level, ECAPA-TDNN network consists of 1D Convolutional layers (with ReLU
and optional batch normalization), and 1D Squeeze-and-Excitation blocks. The network
also utilizes residual connections at high-level to reduce the effect of vanishing gradients.
2.4.2.1.1 1D Convolutional Layer
In a 1D convolution layer, instead of sliding along two dimensions as in well-known
CNNs in image processing [ ][ ][ ], a kernel of size k and dilation d slides along
68 69 70
dimension of time frames. Figure demonstrate how a 1D Convolutional (1D-Conv)
2.8
kernel works. Where:
• k denotes kernel size.
• d denotes dilation spacing (i.e: if d is larger than 1 the 1D-Conv layer is dilated).

• c denotes number of channels (i.e: number of extracted feature coefficients, e.g: 40
MFCCs)
frames
d: dialation spacing
k: kernel size
c: number
of channels
F 2.8: Kernel sliding across speech frames in a dilated 1D-CNN
IGURE
layer, with k=3, d=4 and c=6. Essentially this is a TDNN layer with context
of {3,0,3}.
2.4.2.1.2 1D Squeeze-and-Excitation Block
For computer vision tasks, Squeeze-and-Excitation blocks [ ] has proven to be very ef-
66
fective in improving channel inter-dependencies at low computational cost. In ECAPA-
TDNN architecture, this approach help re-scaling the frame-level features basing on
global properties of the signal sample. A 1D-Squeeze-and-Excitation block consists
of 3 components, as shown in figure :
2.9
C: number
of channels
Fsqueeze( )
·
F excitation( , )
· W
Fscale( , )
· ·
T frames
ht
z s
C
T frames
F 2.9: A 1D-Squeeze-and-Excitation block. Different colors repre-
IGURE
sent different scales for channels.

• Squeeze operation, where frame-wise mean vector is calculated from inputs:
z
z =
1
T
T
∑
t
ht (2.5)
• Excitation operation, where is used to calculate channel-wise scale vector through
z
two bottle-neck fully connected layers that generate outputs of the same dimension
as inputs’:
s W
= (
σ 2 f (W1z b
+ 1)+b 2) (2.6)
where denotes the sigmoid function; denotes a non-linearity (e.g: a ReLU),
σ( )
· f( )
·
Wk and b k denotes the learnable weight and bias of bottle-neck fully connected
layer k.
• Scale operation, where the original input frames are scaled with :
s
h̃t c
, = s t c
, ht c
, (2.7)
2.4.2.1.3 Res2Net-with-Squeeze-Excitation Block
In 2019, Shang-Hua Gao et. all proposed Res2Net [ ], a multi-scale backbone network
65
for computer vision tasks, based on ResNet [ ]. This computer In ECAPA-TDNN,
64
Res2Net is integrated with a SE-Block, forming a Res2Net-with-Squeeze-Excitation
(SE-Res2Net) block, to benefit from residual connections (i.e: to reduce vanishing gra-
dients) why keeping the number of parameters at a reasonable figure. Under this setup,
the number of channel subsets corresponds to number of intermediate 1D-COnv blocks,
and thus, may increase the number of parameters. The SE-Block is then used to amplify
attention to channels while adding only a small number of parameters. Figure 2.10
visualizes a SE-Res2Net block.
2.4.2.2 Segment-level
At segment level, ECAPA-TDNN employs a soft multi-head self-attention [ ] model to
67
calculate weighted statistics at the pooling layer, which accounts for signal samples of
varied lengths. The statistic outputs (weighted mean and weighted standard deviation)
are then concatenated together. The result vector is propagated through a single fully-
connected layer with batch normalization, and then, the final layer, Additive Angular

C
channels
T frames
Conv1D
+
+
T T
subset concat
T
+
Conv1D
Conv1D
SE Block
INPUT OUTPUT
s subsets;
C/s channels each
Conv1D
C
Conv1D
C
F 2.10: A Res2Net-with-Squeeze-Excitation Block.
IGURE
Margin Softmax (AAM-Softmax) [ ] layer. The final output is a N-dimension vector
71
where N is the total number of speakers in the training set (i.e: to classify N speakers in
the training set).
2.4.2.2.1 Attentive Statistical Pooling
In 2019, Okabe et al. proposed using attention mechanism to give different weights to
different frames in the signal sample to calculate weighted mean and weighted standard
deviation at X-Vectors’ pooling layer [ ]. In ECAPA-TDNN, the attention mechanism
72
is extended further: not only on time frames, but also on channels:
The raw scalar channel-and-frame-wise score et c
, and its normalized value αt c
, , is calcu-
lated as follow:
et c
, = v T
c f (Wht + )+
b k c (2.8)
αt c
, =
exp(et c
, )
∑T
τ exp(eτ,c )
(2.9)
where:
• ht are the activations of the last layer frame at time step .
t

C: number
of channels
Attention Model
Fscale( , )
· ·
T frames
ht
C
T frames
A
F 2.11: Attentive Statistics Pooling (on both time frames and chan-
IGURE
nels).
• W ∈ R R×C
and b ∈ R R×1 project the activation into a representation of smaller
dimension (R). This projection is shared across all C channels.
• f ( )
· denotes a non-linearity.
• vc ∈ RR×C
and kc transform the output of the non-linearity to a channel-
f ( )
·
dependent scalar score.
The normalized score αt c
, is then used to calculate weighted mean and weighted standard
deviations vectors as follow:
µ̃c =
T
∑
t
αt c
, ht c
, (2.10)
σ̃c = s
T
∑
t
αt c
, h2
t c
, − µ̃2
c (2.11)
where µ̃c and σ̃c are respectively the channel components of weighted mean and weighted
standard deviation vectors µ̃ and σ̃.
Moreover, the temporal context of the pooling layer is expanded by making the self-
attention to look at global properties of the signal sample (e,g: to account for noise and
recording conditions). The local input ht of equation are concatenated with the
2.8
non-weighted mean and non-weighted standard deviation of ht itself.

2.5 Clustering
2.5.1 PLDA Scoring
After generating the speaker representations for each segment, a clustering algorithm is
applied to make clusters of segments. The distance or the similarity between each pair of
observations can be computed using a wide variety of techniques, including Euclidean
distance [ ], mean squared difference [ ], and cosine similarity [ ].
73 74 75
Although the similarity metrics can be calculated directly from the pairs of extracted
speaker embeddings, purely-statistical data reduction techniques such as Linear Discrim-
inant Analysis (LDA) can be employed to further improve the discriminative character-
istics of these features with a little computation cost. The LDA transformation can be
given as in the following equation:
z W
= T
∗x (2.12)
where:
• x = {x i} ∈ RD is the D-dimension input vector.
• z = {z i} ∈ RD0
(D0
≤ D) is the representation of input vector in the new latent space.
• W ∈ RDxD0
is a linear transformation matrix.
In this case, LDA is formulated as an optimization problem to find a linear transfor-
mation that maximize the ratio of the between-class scattering to the within-class
W
scattering:
W = argmax
W
J( ) =
W argmax
W
trace

WTSBW

trace(WT SWW)
(2.13)
where:
• SW denotes within class scatter matrix SW = ∑n
i=1 (xi − µ yi
)(xi − µ yi
)T
. Here {yi}
are class labels and µk is the sample mean of the k-th class. SW is positive definite.
• SB denotes between-class scatter matrix SB = ∑m
k=1 nk(µ k− µ µ
)( k − µ) T
. Here m
is the number of classes, is the overall sample mean, and
µ nk is the number of
samples in the k-th class. SB is positive semi-definite.

2.5. Clustering 23
F 2.12: An example of LDA transformation from 2D to 1D (taken
IGURE
from [ ]).
76
However, LDA is a deterministic algorithm that works only well on seen data. This is
not ideal in case of real speaker diarization applications, where enrolled or questioned
speakers are not included in the training data set.
Therefore, a probabilistic version of LDA, namely Probabilistic Linear Discriminant
Analysis (PLDA) [ ][ ], is employed to take advantages of LDA while dealing with
77 78
unseen classes. Essentially, PLDA is formulated upon LDA by representing both the
sample mean of the class and the data between class itself with separate distributions.
The chosen distribution is usually a mixture of Gaussian distributions, i.e a Gaussian
Mixture Model (GMM) [ ]. Let is a latent class variable representing the mean of a
41 y
class within the GMM, then probability of generating same given class mean and the
x y
prior probability of class mean in the same space are given by the following equations:
y
P N
( ) =
x y
| (x y
| ,S W) (2.14)
P N m
( ) =
y (y| ,S B) (2.15)
where
• SW and SB respectively denote the within-class and between-class scatter matrices
as seen in equation .
2.13
• N(x y
| ,SW) is a multivariate Gaussian distribution with mean and variance
y SW.

• N m
(y| ,SB) is a multivariate Gaussian distribution with mean and variance
m SB.
As proven in [ ],
77 SW and SB can be diagonalized as follow:
VT ΦwV I
= (2.16)
VT
ΦbV = Ψ (2.17)
and by defining A V
= −T , they can be rewritten as:
SW = AIAT (2.18)
SB = A A
Ψ T (2.19)
Then, from equation :
2.14
P N S
( ) =
x y
| (x y
| , W )
= N

x m v
| +A ,AAT

= + (
m A N
∗ u v
| ,I)
(2.20)
and from equation :
2.15
P N S
( ) =
y (y m
| , B)
= N

y m
| ,A A
Ψ T

= + ( )
m A N
∗ v | 0,Ψ
(2.21)
Let and re Gaussian random variables in the latent space:
u v
u v
∼ |
N(. ,I) (2.22)
v ∼ |
N(. 0,Ψ) (2.23)
then the relationship between and the latent variables can be written as follow:
x y
, u v
,
y m v
= +A (2.24)
x m u
= +A (2.25)

2.5. Clustering 25
The unknown parameters of PLDA are the mean , the covariance matrix , and the
m Ψ
loading matrix . All of these parameters are learnable using the maximum likelihood
A
method. Figure demonstrates the training process of the PLDA model.
2.13
F 2.13: Fitting the parameters of the PLDA model (taken from [ ]).
IGURE 77
The PLDA score between two given vector u1,u2 in latent space is calculated by taking
the log of the likelihood ratio based on two hypotheses: if both of the vectors belong
R
the same class, or not. is given as:
R
Score log R u
= ( ( 1,u2)) = log 
P u
( 1,u2)
P u
( 1) (
P u 2) = log R
P u
( 1|v P u
) ( 2|v P v dv
) ( )
P u
( 1) (
P u2) 
(2.26)

2.5.2 Agglomerative Hierarchical Clustering
Agglomerative hierarchical clustering starts by treating each observation as a separate
cluster. Then, it repeatedly executes the following two steps: (1) identify the two clus-
ters that are closest together, and (2) merge the two most similar clusters. This iterative
process continues until all the clusters are merged together (figure ). The cluster-
2.15
ing process can stopped after a defined number of clusters or a decision threshold is
reached. The main output of Hierarchical Clustering is a dendrogram [ ], which shows
79
the hierarchical relationship between the clusters (figure ).
2.16
F 2.14: Agglomerative hierarchical clustering flowchart.
IGURE

2.5. Clustering 27
F 2.15: An example iterative process of agglomerative hierarchical
IGURE
clustering (taken from [ ]).
80
F 2.16: Visualization of the result of hierarchical clustering (taken
IGURE
from [ ]).
80

29
Chapter 3
Experiments
3.1 Evaluation Metrics
3.1.1 Equal Error Rate and Minimum Decision Cost Function
Equal error rate or crossover error rate (EER or CER) is the rate at which both acceptance
and rejection errors are equal. In order to find the EER of a given system, an EER plot
is created through the following steps:
• Calculate acceptance rate (FAR - i.e.: false positive speaker identification, or false
alarm), and false rejection rate (FRR - i.e.: false negative speaker identification, or
missed detection) for a set of decision thresholds :
t
FARt =
Number Of False Acceptancest
Number of Identification Attempts
(3.1)
FRRt =
Number Of False Rejectionst
Number of Identification Attempts
(3.2)
• Plot FAR and FRR against decision threshold t. EER is y-value of the intersection
of those lines. An example of EER plot is shown in figure .
3.1
As the decision threshold (i.e: sensitivity) increases, the false alarms will drop with
the missed detection wil rise. In this case, the configured system is more secured by
reduce the acceptance possibility. Conversely, when the decision threshold is lowered,
the system is less secure against impostors.

30 Chapter 3. Experiments
F 3.1: An EER plot.
IGURE
EER is originally used as main evaluation metric in a speaker identification system. In
a tradition speaker diarization, it’s also used to effectively estimate the discriminated
characteristic of speaker embedding extraction techniques system, due to the fact that
this metric is not affected by some other modules like clustering or resegmentation. The
lower EER, the better the system performs.
An important metric usually coupled with EER is Minimum Decision Cost Function
(MinDCF), representing the minimum value for the linear combination of the false
alarms and missed detections at different threshold:
MinDCF = min
t
(( )
1− p ∗FAR t + p FRR
∗ t) (3.3)
where:
• t is the decision threshold.
• FARt and FRRt are calculated as in equations and .
3.1 3.2
• p is the prior probability of the enrolled entity. Common values for are and
p 0 01
.
0 001
. .

3.2. Frameworks 31
The lower EER and MinDCF, the better a speaker verification system performs.
3.1.2 Diarization Error Rate
Diarization Error Rate (DER) is the most widely used metric for speaker diarization.
It is measured as the fraction of time that is not attributed correctly to a speaker or to
non-speech, calculated as in equation 3.4
DER =
TFalseAlarm +TMiss +TConfusion
TScored
(3.4)
where:
• TScored is the total duration of the recording without overlapped speech.
• TFalseAlarm is the scored time that a hypothesized speaker is labelled as a non-speech
in the reference.
• TMiss is the scored time that a hypothesized non-speech segment corresponds to a
reference speaker segment.
• T
Confusion is the scored time that a speaker ID is assigned to a wrong speaker.
The lower DER, the better a speaker diarization system performs.
3.2 Frameworks
3.2.1 Kaldi
Kaldi is a toolkit originally written in C++, Perl, Shell and Python for speech recognition,
speaker recognition, and many others. Kaldi used to be the framework of choice among
speech processing researchers. In addition to the flexibility and high performance, Kaldi
is enriched by many reproducible state-of-the-art recipes from researchers around the
world.
Some noteworthy features of Kaldi include:
• Code-level integration with Finite State Transducers (FSTs).

• Extensive linear algebra support: Both BLAS and LAPACK are supported
• Extensible design: The algorithms are provided in the most generic form possible.
• Open license: The code is licensed under Apache 2.0, which is one of the least
restrictive licenses available.
• Complete recipes: Recipes for building a complete speech recognition system is
included. These work with widely available datasets such as those provided by the
Linguistic Data Consortium (LDC).
F 3.2: Kaldi logo.
IGURE
F 3.3: Kaldi general architecture diagram.
IGURE
3.2.2 SpeechBrain
SpeechBrain [ ] is an open-source and all-in-one conversational AI toolkit based on
81
PyTorch. The main purpose is to create a single, flexible, and user-friendly toolkit that
can be used to easily develop state-of-the-art speech technologies, including systems

3.2. Frameworks 33
for speech recognition, speaker recognition, speech enhancement, speech separation,
language identification, multi-microphone signal processing, and many others. Speech-
Brain provides the implementation and experimental validation of both recent and long-
established speech processing models with SotA or competitive performance on a variety
of tasks (table ).
3.1
TABLE 3.1: List of speech tasks and corpora that are currently supported
by SpeechBrain (taken from [ ])
81
3.2.3 Kal-Star
Kal-Star [ ], developed by the author while working at Viettel Cyberspace Center, is
82
a Shell/Python library wrapping around Kaldi and SpeechBrain to enhance data pre-
processing and training processes. Kal-Star provides a wide variety of tools to prepare,
train and test data, mostly with speaker verification and speaker diarization tasks (Figure
1.4 1.5
and Figure ). Kal-Star inherits the file-based data indexing from Kaldi, which
treats view a given data set as a folder of , ,
spk2utt utt2spk wav.scp segments
, and (if
the data sets is segmented with a VAD) files. Further informations can be added later
to the folder (e.g: the diarization result RTTM is added once the speaker diarization
process is done.). Figures ,
3.4

F 3.4: Filtering VTR_1350 data set by utterances’ durations and
IGURE
number of utterances per speaker.
F 3.5: Generating 200 5-way conversations from VTR_1350 data set.
IGURE
The max. and min. numbers of utterances picked from each conversation
are 2 and 30 respectively.
3.3 Data Sets
In this thesis, three main Vietnamese data sets are used:
• IPCC_110000: splitted for training and testing. The test split is then used directly
for speaker verification task, and used to generate mock conversations for speaker
diarization task.
• VTR_1350 ZALO_400
and : used directly for speaker verification task, and also
used to generate mock conversation for speaker diarization task. These data sets are
not used in training.

3.3. Data Sets 35
3.3.1 IPCC_110000
IPCC_120000 data set consists of 1046.37 hours of audio in telephone environment
from approximately 110000 Vietnamese speakers. Data are recorded at Viettel Customer
Service IP Contact Center (IPCC) and sampled at 8000 Hertz. Most recorded utterances
are from 2 to 6 seconds in length, while each speaker have from 1 to 10 utterances,
making up from 10 to 60 seconds of speech. The spoken topics revolve around technical
difficulties that Viettel’s customers met in using mobile and internet services, as well as
every questions about common knowledge, the weather, sport results, or lottery results.
Table gives an overview of this data set and figure demonstrates how the data is
3.2 3.6
distributed.
Data set IPCC_110000
Base sample rate 8000 Hz
Environment Telephone
# Speakers 112837
# Utterances 919608
Total duration 1046.4 hours
TABLE 3.2: IPCC_110000 data set overview.
F 3.6: IPCC_110000 data distributions.
IGURE
All speakers and agents each has his/her own recording channel, and it is assumed that
each recording has only one speaker. In reality, a telephone conversation between a cus-
tomer and an agent can be interfered by: other customers join the conversations, or other
agents join the conversation. The latter case happens much less than the former, since
the IPCC is designed in such a way that agents have good sound isolation, and even if the
customer’s issue is passed to other agents, these agents would have their own recording
channel. As for the case where more than one customers join the conversation, it’s tested

that in 1000 randomly chosen conversations, only about 12 of them have more than 1
speakers (e.g: an infant interrupts his parent having a call with an IPCC agent). The ra-
tio is only about 1.2 percents, and thus, this case could be mitigated with a little negative
effect towards the discriminative characteristic of the trained embedding extractor.
IPCC_110000 is randomly splitted into 3 subsets: train, dev, and test sets. Each of the
test and dev data sets has 2000 speakers and about 20 hours of data. The train set contains
the remaining data. Table displays the number of speakers, number of utterances
3.3
and total duration of each subsets.
Split train dev test
# Speakers 108837 2000 2000
# Utterances 886172 16975 16461
Total duration (hours) 1005.1 20.94 20.39
TABLE 3.3: IPCC_110000 data subsets.
3.3.1.1 IPCC_110000 Verification Test Set
Generated from IPCC_110000 test split, the verification test set has verification
3888
pairs, generated by the following steps:
• Step 1: With , for each speaker randomly pick utterances from this
K = 3 K ∗ 2
speaker’ utterances pool to generate verification pairs. The labels are "target"
K
each, meaning utterances in each pair are from the same speaker.
• Step 2: Randomly pick other utterances from the utterances pool (if there
K = 3
are less than utterances left in the pool, discard all picked verification pairs and
K
skip to step 3) of the selected speaker at step 1, and other utterances from
K = 3
utterances pools of all other speakers to generate more verification pairs. The
K
labels are "nontarget" each, meaning utterances in each pair are not from the same
speaker.
An example result after step 2:
IPCC-131047_hanoi-3166 IPCC-131047_hanoi-1961 target

3.3. Data Sets 37
IPCC-131047_hanoi-2626 IPCC-203322_hanoi-2214 nontarget
• Step 3: Go back to step 1, until all speakers are considered for enrollment.
3.3.1.2 IPCC_110000 Diarization Test Set
Since original conversations from IPCC_110000 are shuffled by channels due to internal
policies that protect customer privacy, mock conversations generated from utterances of
distinguished speakers will be used instead. Furthermore, to expand the scope the di-
arization not only in 2-way conversations, but also in conversations with more speakers,
mock conversations of 3-way, 4-way, and 5-way conversations are also generated. From
the test split of IPCC_110000, the following data sets are generated:
• IPCC_110000-TEST-MOCK_200_2: 200 2-way conversations.
Each conversation among conversations in each subset is generated by following steps:
N
• Step 1: Chose a number of speaker .
S
• Step 2: Randomly pick speaker from speakers pool of the whole data set.
S
• Step 3: With each picked speaker sj, randomly pick uj (where uj is a random integer
in range [umin,umax]) utterances from the speaker’s utterances pool. If there’s not
enough uj in speaker’s utterances pool, then discard picked utterances and go back
to step 2.
• Step 4: Go back to step 2 until utterances are picked from all speakers.
S

• Step 5: Shuffle the list of picked utterances, then concatenate utterances together
(with in-between silences of duration randomly chosen ranging from 0.2 to 1.0 sec-
onds) into a single audio file. Conversation ID, oracle timestamps and distinguished
speakers information are bundled into an accompanying RTTM file.
Four mock conversations subsets generated from IPCC_110000 test split mentioned
above are generated with , ,
N = 200 S ∈ { }
2 3 4 5
, , , u min = 2 and u max = 20.
3.3.2 VTR_1350
VTR_1350 is a data set consisting of 491.1 hours of wide-band recordings, originally
sampled at 16000Hz, recorded by a selected group of 1346 broadcasters using written
transcripts (as shown in table ). The recording environment is significantly less
3.4
noisy than IPCC_110000’s, and thus it can be considered as a clean set. Most recorded
utterances are from 3 to 10 seconds in length, while each speaker have from 200 to 300
utterances, making up from 10 to 33 minutes of speech. The topics are daily news’
topics, from politics, health care to sports. Although the number of speakers are much
less than IPCC’s, the content of the recordings are much diverse and the duration are
much longer. However, VTR_1350 is significantly less natural as a normal conversation
than IPCC_110000, due to the mentioned fact that it’s recorded with planned transcripts
under controlled environment. Table gives an overview of this data set and figure
3.4
3.7 demonstrates how the data is distributed.
Data set VTR_1350
Environment Controlled recording
# Speakers 1346
# Utterances 318599
TABLE 3.4: VTR_1350 data set overview.
VTR_1350 is down-sampled to 8000Hz and used exclusively for testing in both speaker
verification and speaker diarization task.

3.3. Data Sets 39
F 3.7: VTR_1350 data distributions.
IGURE
3.3.2.1 VTR_1350 Verification Test Set
From VTR_1350 utterance list, a verification list of 7902 pairs is generated using the
same method described in , with (targets/non-targets per speakers). The
3.3.1.1 K = 3
followings are first 6 lines of the verification list:
VTR_1350-yenvth4-071303 VTR_1350-yenvth4-049030 target
VTR_1350-yenvth4-050310 VTR_1350-119706-073293 nontarget
VTR_1350-yenvth4-105862 VTR_1350-oanhdtv-029068 nontarget
VTR_1350-yenvth4-001159 VTR_1350-tungdhd-076747 nontarget
3.3.2.2 VTR_1350 Diarization Test Set
VTR_1350 is a data set of 1-way conversations. Hence, to make use of this data set in
speaker diarization task, four mock conversation subsets are generated form the data set
itself, basing on the method and configuration described in :
3.3.1.2
• VTR_1350-MOCK_200_2: 200 2-way conversations.

3.3.3 ZALO_400
ZALO_400, issued by ZALO AI Challenge 2020 [ ] is a wide-band data set consist-
83
ing of 8.7 hours of recordings, sampled at 48000Hz, recorded by a selected group of
400 broadcasters using planned transcripts. The content and recording environment
quite resembles that of VTR_1350, while the data distributions are quite different from
VTR_500’s. Most utteranes are from 4 to 12 seconds, while most speakers constitute
from 15 to 40 utterances, making up from 1 to 2 minutes of speech. ZALO_400 was
originally released as the train data set for the challenge. However, under the scope of
this thesis, it’s used exclusive for testing both speaker verification and speaker diariza-
tion tasks. Table gives an overview of this data set and figure demonstrates how
3.5 3.8
the data is distributed.
Data set ZALO_400
Environment controlled recording
# Speakers 400
# Utterances 10555
TABLE 3.5: ZALO_400 data set overview.
F 3.8: ZALO_400 data distributions.
IGURE
3.3.3.1 ZALO_400 Verification Test Set
The verification set is generated by the same method described in section , with
3.3.1.1
K = 3 (targets/non-targets per speakers). The followings are first 6 lines of the verifica-
tion list:

3.4. Baseline System 41
424-64 424-35 target
424-46 424-45 target
424-31 424-36 target
424-49 518 nontarget
424-39 500-12 nontarget
424-15 514-30 nontarget
3.3.3.2 ZALO_400 Diarization Test Set
The diarization test set consists of four mock conversation subsets generated from VTR_400
data set, including:
• ZALO_400-MOCK_200_2: 200 2-way conversations.
These conversations are generated by the same method and configuration described in
section .
3.3.1.2
3.4 Baseline System
The baseline system consists of three main phases:embeddings extractor training, PLDA
backend training and speaker diarization. In addition to these phase, an asterisk phase,
speaker verification is included to further optimize speaker diarization results.
3.4.1 Speaker Diarization System
The speaker diarization subsystem takes a recorded conversation as input and lets the
user know the number of speakers and the timestamps of their speaking within the con-
versation. Without further recognizing, all speakers remain anonymous. The result of
this subsystem can be encapsulated into a Rich Transcription Time Marked (RTTM) file
[ ]. The main computation pipeline, which takes a recorded conversation as input and
1
gives RTTM file as output, can be described in the following stages, in order:

• Stage 1 - Front-end Processing: The enrolled and the questioned utterance are
windowed by a 10ms Hamming window with a 10ms frame shift to extract 30-
dimensional MFCCs.
• Stage 2 - Voice Activity Detection: WebRTC VAD [ ] is utilized to extract speech
43
partitions from the enrolled and questioned utterances. In this process, each utter-
ance is sliced into uniform non-overlapping small chunk of 0.03 seconds and We-
bRTC VAD will decide each frame is speech or not. The threshold of this decision,
which is called "aggressiveness", is set to its max level - level 4. After that adjacent
chunks recognized as speech are grouped together into a bigger speech segment. If
the maximum PCM amplitude of a newly formed speech segment smaller than 0.05,
the segment will be discarded. Furthermore, speech segments that are shorter than
0.2 seconds are also discarded. Finally, each speech segment are padded with 0.05
seconds at its head and at its tail; and if two consecutive speech segments overlap
each other and the merged segment of those two is shorter than 15 seconds, they are
merged together.
• Stage 3 - Uniform Segmentation: Each speech partition is further sub-segmented
into homogeneous overlapping sub-segments of length seconds with an overlap
L
of seconds. The value of L is chosen between 1.5 and 4 seconds.
L/2 { }
L L
; /2
denotes this segmenting strategy. The extracted features from stage 2 are mapped
into these sub-segments for the next stage.
• Stage 4 - Embeddings Extraction: An embeddings extractor is employed to extract
embeddings from each sub-segments from the last stage. In the baseline system, it’s
an X-Vector embeddings extractor. In the baseline system, the extractor is a X-
Vectors embeddings extractor, trained from IPCC_110000 data through 3 epochs.
The training process is executed by using Kaldi’s VoxCeleb recipe [ ][ ]. The
84 59
data augmentation strategies are slightly changed to the followings:
– Additive Noises: adding random noise sequence to the input signal with a
signal-to-noise ratio (SNR) [ ] randomly chosen between 0 and 15.
85
– Reverberation: involving simulated room impulse responses (RIR) [ ] with
86
input signal.
– Additive Noises and Reverberation: combining the additive noises and rever-
beration augmentation on a single input signal.

– Speed Perturbation: speeding up and speeding down the input signal. To avoid
changing the speaker characteristics too much, speed perturbation is restricted
to a maximum of .
±5%
– Waveform dropout: replacing random chunks of the input waveform with zeros.
– Frequency dropout: filtering the input signal with random band-stop filters to
add zeros in the frequency spectrum.
The output of this stage is two 128-dimensional embedding vectors that respectively
represent the enrolled utterance and the questioned utterance.
• Stage 5 - PLDA Scoring: The PLDA back-end that was trained on embeddings
extracted using the same embeddings extractor from the last step is used to score the
similarity (a single float value) between the enrolled and the questioned utterances.
• Stage 6 - Agglomerative Hierarchical Clustering: Sub-segments’ embeddings
are clustered using Agglomerative Hierarchical Clustering (AHC) method, in which
the distance function directly takes results from the affinity scoring matrix gener-
ated from the last stage. This method supports clustering with a known number
speakers or an unknown number of speakers. The latter case requires a pre-defined
decision threshold. At the end of this stage, each sub-segment embedding vector, or
the sub-segment itself, is tagged with a number from 1 to - the number of distin-
K
guished speakers in the audio input. Each sub-segment’s begin time and duration,
along with the speaker tag, are all recorded in the output Rich Transcription Time
Marked (RTTM) file [ ].
1
3.4.2 Speaker Verification System
The main purpose of the implemented speaker verification system is to evaluate the
discriminative characteristic of extracted embeddings, without being affected by other
modules in a traditional speaker diarization system. In other words, the use of this
system is to improve and optimize the baseline diarization system in terms of speaker
representations.
From a given data set with speaker information, one can generate a verification list - a list
of utterances pairs with ground truth values that tell each of those pair is from the same

Phase 1:
Embeddings
Extractor Training
Phase 2:
PLDA Backend
Training
Training Data
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Training Data
Training
Embeddings
Extractor
Training
PLDA
Backend
Phase 3:
Speaker
Diarization
Front-end
Processing
Voice Activity
Detection
Segmentation
Extract Speaker
Embeddings
Speaker
Embeddings
PLDA
Scoring
Affinity
Scoring
Matrix
Clustering
RTTM
Output
Audio
Input
(Enrolled)
Training
Speaker
Embeddings
IPCC_100K
(train + dev)
WebRTC
Uniform
Agglomerative
Hierarchical
X-Vectors
F 3.9: Baseline speaker diarization system diagram.
IGURE
speaker or not. A crucial assumption is that in the enrolled or questioned utterances,
there is only one speaker participating.
The speaker verification system takes the verification list as input and gives a scoring
list as output, which tells a similarity value for each pair in the verification list. The
higher similarity is, the higher chance two utterances are from the same speaker. Then,
by choosing a decision threshold, which is the minimum value that the similarity value
needs to reach to get the pair of utterances determined from the same speaker, one can
generate a list of predictions. By comparing the verification list with the prediction list,
binary classification metrics [ ] including False Positive Rate (FPR) and False Negative
87
Rate (FNR) are calculated.

Testing
Data
Generate
Verification Pairs
Phase *:
(for optimization)
Verification Pairs
...
Audio
Input
(Enrolled)
Audio
Input
(Imposter)
All pairs
are scored?
Fetch
Verification Pair
No
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Speaker
Embed.
(Enrolled)
Speaker
Embed.
(Imposter)
PLDA
Scoring
Scoring
Value
Verification Scores
...
Yes
Embeddings
Extractor
PLDA
Backend
EER
Plotting
EER
EER Threshold
(from )
Phase 1
(from )
Phase 2
X-Vectors
WebRTC
F 3.10: Baseline speaker verification system diagram.
IGURE
By repeatedly choosing all similarity values in the scoring list as decision thresholds, a
plot of rates (FPR and FNR) against the decision threshold can be made. The intersection
of the TPR and FNR lines on the plot is the Equal Error Rate (EER) (as its name suggests,
it’s where the rate of errors are equal). Another important metric usually calculated at
the point of EER on the graph is Minimum Decision Cost Function (MinDCF). The
detailed computation methods for these metrics are described in section .
3.1.1
The main computation pipeline, which takes a pair of utterances in the verification list
and gives their similarity, is divided into the following stages, in order:
• Stage 1 - Front-end Processing: The enrolled and the questioned utterance are
windowed by a 10ms Hamming window with a 10ms frame shift to extract 30-
dimensional MFCCs.

• Stage 2 - Voice Activity Detection: WebRTC VAD [ ] is utilized to extract speech
43
partitions from the conversation. The working configuration of WebRTC VAD is
the same as the configuration described in the voice activity detection stage in sec-
tion .
3.4.1
• Stage 3 - Embeddings Extraction: Each group of Speech partitions extracted from
stage 2 are concatenated together and then an embeddings extractor is employed to
extract speaker embeddings. It’s the same extractor used in section . The
3.4.1
output of this stage is 128-dimensional embedding vectors that represent sub-
N N
segments.
• Stage 4 - PLDA Scoring: The PLDA back-end that was trained on embeddings
extracted using the same embeddings extractor from the last step is used to score the
similarity (a single float value) between the enrolled and the questioned utterances.
3.5 Proposed System
In recent years, ECAPA-TDNN, a recent development over X-Vectors’ neurel network
with residual connections and attention on both time and feature channels, has shown
state-of-the-art results on popular English corpora. Table and reports the outper-
3.6 3.7
forming of ECAPA-TDNN against a strong X-Vector baseline system as experimented in
[ ] in both speaker verification task and speaker diarization task with English corpora.
62
TABLE 3.6: EER and MinDCF performance of all systems on the standard
VoxCeleb1 and VoxSRC 2019 test sets (taken from [ ]).
62
In the proposed system, X-Vectors-based extractor is replaced with ECAPA-TDNN-
based extractor, and the PLDA backend is trained with ECAPA-TDNN embeddings in-
stead. The employed ECAPA-TDNN embeddings extractor is trained on the same data
set and data augmentation strategies, on which the X-Vectors embeddings extractor is

3.5. Proposed System 47
TABLE 3.7: Diarization Error Rates (DERs) on AMI dataset using the
beamformed array signal on baseline and proposed systems (taken from
[ ]).
88
trained. The network architecture is kept as described in section . The number of
2.4.2
MFCCs taken is reduced from 80 down to 40, the minimum learning rate is lowered by
10 times, and the number of epochs is doubled from 10 to 20. Figure visualizes the
??
proposed system.

Phase 1:
Embeddings
Extractor Training
Phase 2:
PLDA Backend
Training
Training Data
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Training Data
Training
Embeddings
Extractor
Training
PLDA
Backend
Phase 3:
Speaker
Diarization
Front-end
Processing
Voice Activity
Detection
Segmentation
Extract Speaker
Embeddings
Speaker
Embeddings
PLDA
Scoring
Affinity
Scoring
Matrix
Clustering
RTTM
Output
Audio
Input
(Enrolled)
Training
Speaker
Embeddings
IPCC_100K
(train + dev)
WebRTC
Uniform
Agglomerative
Hierarchical
ECAPA-
TDNN
F 3.11: Proposed speaker diarization system diagram.
IGURE

3.5. Proposed System 49
Testing
Data
Generate
Verification Pairs
Phase *:
(for optimization)
Verification Pairs
...
Audio
Input
(Enrolled)
Audio
Input
(Imposter)
All pairs
are scored?
Fetch
Verification Pair
No
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Speaker
Embed.
(Enrolled)
Speaker
Embed.
(Imposter)
PLDA
Scoring
Scoring
Value
Verification Scores
...
Yes
Embeddings
Extractor
PLDA
Backend
EER
Plotting
EER
EER Threshold
(from )
Phase 1
(from )
Phase 2
ECAPA-
TDNN
WebRTC
F 3.12: Proposed speaker verification system diagram.
IGURE

51
Chapter 4
Results
4.1 Speaker Verification Task
In this task, both baseline and proposed speaker verification sub-systems are tested with
different PLDA dimension reduction configurations. The dimension reduction ratios ri
are on the scale of . The corresponding reduced,
{ }
0 5 0 6 0 7 0 8 0 85 0 9 0 95 1 00
. , . , . , . , . , . , . , .
or target dimension Di is calculated in equation , where V is the original embedding’s
4.1
dimension, which is 128 in case of the baseline system, and 192 in case of the proposed
system. The PLDA backend is trained on the same training data set of the embeddings
extractor.
Di = 4∗ 
ri ∗V
4  (4.1)
As reported in table , the proposed system with ECAPA-TDNN architecture out-
4.1
performs the baseline system regarding both EER and MinDCF performance in all test
cases. In the tests with IPCC_110000 test split, the proposed system gives 64.5% relative
improvement in EER, with corresponding 82.4% and 86.5% relative improvements on
MinDCF(p=0.01) and MinDCF(p=0.001) respectively. The improvements on MinDCF
are smaller in the tests with VTR_1350, where the proposed system gives 66.1% rela-
tive improvement in EER, with corresponding 64.7% and 63.1% relative improvements
on MinDCF(p=0.01) and MinDCF(p=0.001) respectively. The improvements given by
the proposed system in the tests with ZALO_400 are smaller than in both of men-
tioned tests, but still significant, where it gives 45.6% relative improvement in EER,
with corresponding 22.5% and 22.5% relative improvements on MinDCF(p=0.01) and
MinDCF(p=0.001) respectively.

52 Chapter 4. Results
Furthermore, both systems show a consistent degradation in terms of equal error rate
(EER) when the dimension reduction ratio increases with only two exceptions. The first
exception occurs in the tests with IPCC_110000 test split: the EER of the proposed
system falls from 1.44% down to 1.39% and then raise up to 1.54%, when the target
dimension falls from 192 to 180, and then 172. The second exception happens occurs
in the tests with ZALO_400 data set, when the EER of the proposed system falls from
8.08% down to 7.91% and then raise up to 8.16%, when the target dimension experience
the same changes mentioned in the first exception. However, in both of these exceptions,
the swings are insignificant and did not affect the trend of EER. As for MinDCF, this
metric does not show a clear trend against the reduction of embedding dimensions.
In summary, the proposed system shows significant improvements over the baseline sys-
tem, and the embedding dimension shouldn’t be further reduced in PLDA scoring stage.

4.1. Speaker Verification Task 53
IPCC_110000 (test split)
(8000Hz, K=3, # Trials=3888)
X-Vector
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 128 0.6240 0.8112 3.91
0.95 120 0.6183 0.8117 3.96
0.90 112 0.6183 0.8066 4.01
0.85 108 0.6163 0.8102 4.17
0.80 100 0.6317 0.8050 4.27
0.70 88 0.6497 0.8020 4.32
0.60 76 0.6445 0.8138 4.53
0.50 64 0.6533 0.7917 4.78
ECAPA-TDNN
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 192 0.1024 0.1085 1.44
0.95 180 0.1065 0.1080 1.44
0.90 172 0.1096 0.1096 1.39
0.85 160 0.0998 0.0998 1.54
0.80 152 0.1070 0.1070 1.54
0.70 132 0.0983 0.0983 1.65
0.60 112 0.1101 0.1101 1.85
0.50 96 0.1240 0.1240 2.01
VTR_1350
(8000Hz (resampled), K=3, # Trials=7902)
X-Vector
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 128 0.7588 0.8651 9.69
0.95 120 0.7603 0.8641 9.82
0.90 112 0.7472 0.8659 9.87
0.85 108 0.7325 0.8669 9.92
0.80 100 0.7327 0.8502 10.02
0.70 88 0.7423 0.8322 10.10
0.60 76 0.7261 0.8461 10.30
0.50 64 0.7459 0.8494 10.48
ECAPA-TDNN
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 192 0.2680 0.3192 3.29
0.95 180 0.2721 0.3680 3.49
0.90 172 0.2797 0.3936 3.54
0.85 160 0.3055 0.4052 3.59
0.80 152 0.2971 0.3941 3.62
0.70 132 0.3214 0.4432 3.70
0.60 112 0.3774 0.4450 3.77
0.50 96 0.3991 0.4938 3.77
ZALO_400
(8000Hz (resampled), K=3, # Trials=2376)
X-Vector
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 128 0.9470 0.9470 14.39
0.95 120 0.9562 0.9562 14.56
0.90 112 0.9444 0.9444 14.65
0.85 108 0.9444 0.9444 14.73
0.80 100 0.9402 0.9402 14.90
0.70 88 0.9478 0.9478 14.90
0.60 76 0.9621 0.9621 15.24
0.50 64 0.9739 0.9739 14.65
ECAPA-TDNN
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 192 0.7340 0.7340 7.83
0.95 180 0.7424 0.7424 8.08
0.90 172 0.7306 0.7306 7.91
0.85 160 0.7214 0.7214 8.16
0.80 152 0.6987 0.6987 8.16
0.70 132 0.7079 0.7079 8.67
0.60 112 0.7374 0.7374 9.34
0.50 96 0.7332 0.7332 9.93
TABLE 4.1: EER and MinDCF performance.

4.2 Speaker Diarization Task
In this task, the whole baseline and proposed systems are tested with mock conversa-
tions, consisting of different number of engaging speakers, with different uniform sub-
segmenting configurations. In this test, oracle VAD (i.e: ground-truth VAD) is used to
diminish the effect of any voice activity detection modules, PLDA scoring is carried out
without dimension reduction, and the exact number of engaging speakers in each con-
versation is known before clustering process. Results are reported in table , where
4.2
{ }
x y
: represents an uniform segmentation configuration of windows of length seconds
x
with seconds overlaps.
y
F 4.1: A speaker diarization output of a 3-way conversation in
IGURE
VTR_1350 test set.
Both system performs relatively well with IPCC_110000 test split’s mock conversations,
where DERs are all below 4.5 percent. This results match the fact that these conversa-
tions are generated from the data set that is in-domain with the embedding extractor’s
training data set. The results with ZALO_400 mock conversations is much worse when
the DER goes up to 17.15% in case of the baseline system, and 11.20% in case of the
proposed system. With VTR_1350 mock conversations, the results are even worse than
that. It goes up to 24.25% in case of the baseline system, and 22.33% in case of the
proposed system.
In most cases, the proposed system with ECAPA_TDNN out-perform the baseline sys-
tem and with most conversation types and all sub-segmentation configurations. In each
set of conversations participated by the same number of speaker, the best DER among
different sub-segmenting configurations of the proposed system usually outperforms the
baseline system by from 30% to 70%.

4.2. Speaker Diarization Task 55
Furthermore, while both systems performs better with wider provided context (i.e: large
sub-segmentation window size), the proposed system doesn’t show a significant im-
provement in the relative DER reduction against context windows size. In other words,
the proposed system’s DER is expected to decrease faster than the baseline system’s
DER since ECAPA-TDNN theoretically make use of the wide context thanks to its at-
tention mechanism, while it isn’t. This experiment proved that in speaker diarization
task, where speech segments are sub-segmented into sub-segments smaller than 4 sec-
onds, the attention at time channel isn’t much effective.
In conclusion, the proposed system gives significant improvement in DER performance
over the baseline system.

IPCC_110000.test - Mock Conversations
(8000Hz; 2-way, 3-way, 4-way and 5-way conversations (200 each))
# spk subseg. DER (%)
X-Vector ECAPA
2 {1.5 : 0.75} 2.93 2.66
2 {2.0 : 1.00} 2.72 2.06
2 {3.0 : 1.50} 2.54 1.70
2 {4.0 : 2.00} 2.17 1.50
3 {1.5 : 0.75} 2.65 2.65
3 {2.0 : 1.00} 2.72 1.36
3 {3.0 : 1.50} 2.34 0.93
3 {4.0 : 2.00} 2.10 0.89
X-Vector ECAPA
4 {1.5 : 0.75} 4.35 4.33
4 {2.0 : 1.00} 4.43 2.82
4 {3.0 : 1.50} 3.53 2.36
4 {4.0 : 2.00} 2.88 2.93
5 {1.5 : 0.75} 5.27 3.88
5 {2.0 : 1.00} 3.78 2.63
5 {3.0 : 1.50} 3.26 2.91
5 {4.0 : 2.00} 3.03 3.14
VTR_1350 - Mock Conversations
(8000Hz (resampled); 2-way, 3-way, 4-way and 5-way conversations (200 each))
X-Vector ECAPA
2 {1.5 : 0.75} 11.31 18.59
2 {2.0 : 1.00} 9.32 8.41
2 {3.0 : 1.50} 5.31 2.66
2 {4.0 : 2.00} 2.45 1.31
3 {1.5 : 0.75} 17.77 20.13
3 {2.0 : 1.00} 12.52 12.80
3 {3.0 : 1.50} 8.39 5.63
3 {4.0 : 2.00} 7.08 3.42
X-Vector ECAPA
4 {1.5 : 0.75} 21.18 21.63
4 {2.0 : 1.00} 16.54 12.71
4 {3.0 : 1.50} 10.34 5.34
4 {4.0 : 2.00} 8.20 4.44
5 {1.5 : 0.75} 24.25 22.33
5 {2.0 : 1.00} 18.24 11.70
5 {3.0 : 1.50} 12.16 4.90
5 {4.0 : 2.00} 8.63 2.95
ZALO_400 - Mock Conversations
(8000Hz (resampled); 2-way, 3-way, 4-way and 5-way conversations (200 each))
X-Vector ECAPA
2 {1.5 : 0.75} 5.62 4.86
2 {2.0 : 1.00} 4.78 3.60
2 {3.0 : 1.50} 5.40 2.72
2 {4.0 : 2.00} 6.61 6.06
3 {1.5 : 0.75} 10.85 7.11
3 {2.0 : 1.00} 10.11 6.15
3 {3.0 : 1.50} 9.72 5.77
3 {4.0 : 2.00} 10.92 7.59
X-Vector ECAPA
4 {1.5 : 0.75} 15.35 8.85
4 {2.0 : 1.00} 12.83 6.59
4 {3.0 : 1.50} 11.06 6.14
4 {4.0 : 2.00} 12.17 6.58
5 {1.5 : 0.75} 17.15 11.20
5 {2.0 : 1.00} 13.79 8.72
5 {3.0 : 1.50} 12.31 6.76
5 {4.0 : 2.00} 12.44 7.09
TABLE 4.2: DER performance.

57
Chapter 5
Conclusions and Future Works
In this thesis, a new architecture of deep neural network, ECAPA-TDNN, was experi-
mented in comparison the the baseline system based on X-Vectors and showed signifi-
cant overall improvements. The proposed system outperformed the baseline on all Viet-
namese data sets and on on tasks: speaker verification and speaker diarization. Thanks to
the attention mechanism that operates on both time and features channels, the proposed
network can learn which data in the context and which features are more important.
This context-aware feature of ECAPA-TDNN is remarkably important since different
languages have different ways of constructing sentences, and the word positioning in
Vietnamese is totally different from that of English or French. In this sense, ECAPA-
TDNN can be adapted to a wide variety of languages of different writing styles, and
indeed it worked with Vietnamese conversations. Followings are some highlighted pros
and cons in using ECAPA-TDNN in the speaker diarization system:
Pros:
• ECAPA-TDNN provides context-aware embeddings with attention on both time and
features channels that work exceptionally well with Vietnamese data.
• Based entirely on Pytorch framework, ECAPA-TDNN is much easier to train, to test
and to customize than in Kaldi.
Cons:
• Both the training and inference processes are slower due to complexity of the net-
work. With an NVIDIA-A100 GPU it still takes 80 hours to complete 20 training
epochs with IPCC_110000 data set.

58 Chapter 5. Conclusions and Future Works
• The network is not yet production-ready, while X-Vectors network, trained with
Kaldi has long been used in production, for both speaker verification and diarization
system.
Further research directions can be explored to improve the understanding about the ca-
pability of ECAPA-TDNN in speaker diarization includes:
• Trials and errors with more configurations (in this thesis, only some minor changes
were made to the original network configurations).
• Exploring other types of clustering methods.
• Learning how effective the proposed system is in case of conversations with multi-
ple overlaps.
• Apply post-processing methods to the diarization result.
• Build a Vietnamese conversation data set based on real conversations.

59
Bibliography
[1] Omid Sadjadi et al. . en. 2021.
NIST 2021 Speaker Recognition Evaluation Plan
URL: https://tsapps.nist.gov/publication/get%5Fpdf.cfm?pub
%5Fid=932697.
[2] David Arthur and Sergei Vassilvitskii. “k-means++: the advantages of careful seed-
ing”. In: SODA ’07. 2007.
[3] Dan Pelleg and Andrew W. Moore. “X-means: Extending K-means with Efficient
Estimation of the Number of Clusters”. In: . 2000.
ICML
[4] Aonan Zhang et al. “Fully Supervised Speaker Diarization”. In: ICASSP 2019 -
2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (2019), pp. 6301–6305.
[5] Shota Horiguchi et al. “End-to-End Speaker Diarization for an Unknown Number
of Speakers with Encoder-Decoder Based Attractors”. In: abs/2005.09921
ArXiv
(2020).
[6] Yuki Takashima et al. “End-to-End Speaker Diarization Conditioned on Speech
Activity and Overlap Detection”. In: 2021 IEEE Spoken Language Technology
Workshop (SLT) (2021), pp. 849–856.
[7] Tsun-Yat Leung and Lahiru Samarakoon. “Robust End-to-End Speaker Diarization
with Conformer and Additive Margin Penalty”. In: Interspeech 2021 (2021).
[8] Niruhan Viswarupan. K-Means Data Clustering. 2017. :
URL https://towardsdatascience
com/k-means-data-clustering-bce3335d2203 (visited on 12/09/2021).
[9] Sabur Ajibola Alim and Nahrul Khair Alang Rashid. “From Natural to Artificial
Intelligence - Algorithms and Applications”. In: IntechOpen, 2018. Chap. 1.
[10] Urmila Shrawankar and Vilas M. Thakare. “Techniques for Feature Extraction In
Speech Recognition System : A Comparative Study”. In: abs/1305.1145
CoRR
(2013). arXiv: . : .
1305.1145 URL http://arxiv.org/abs/1305.1145

60 Bibliography
[11] Smita Magre, Pooja Janse, and Ratnadeep Deshmukh. “A Review on Feature Ex-
traction and Noise Reduction Technique”. In: (Feb. 2014).
[12] Bob Meddins. “5 - The design of FIR filters”. In: Introduction to Digital Signal
Processing. Ed. by Bob Meddins. Oxford: Newnes, 2000, pp. 102–136. : 978-
ISBN
0-7506-5048-9. :
DOI https :/ /doi.org /10. 1016/B978 - 075065048 - 9 /
50007-6 https://www.sciencedirect.com/science/article/pii/
. :
URL
B9780750650489500076.
[13] Torben Poulsen. “Loudness of tone pulses in a free field”. In: Acoustical Society
of America Journal 69.6 (June 1981), pp. 1786–1790. : .
DOI 10.1121/1.385915
[14] Stanislas Dehaene. “The neural basis of the Weber–Fechner law: a logarithmic
mental number line”. In: Trends in cognitive sciences 7.4 (2003), pp. 145–147.
[15] S. S. Stevens. “A Scale for the Measurement of the Psychological Magnitude
Pitch”. In: Acoustical Society of America Journal 8.3 (Jan. 1937), p. 185. :
DOI
10.1121/1.1915893.
[16] Robert B. Randall. “A history of cepstrum analysis and its application to mechan-
ical problems”. In: Mechanical Systems and Signal Processing 97 (2017). Special
Issue on Surveillance, pp. 3–19. : 0888-3270. :
ISSN DOI https://doi.org/10.
1016/j.ymssp.2016.12.026 https://www.sciencedirect.com/
. :
URL
science/article/pii/S0888327016305556.
[17] Philipos C. Loizou. Speech Enhancement: Theory and Practice. 2nd. USA: CRC
Press, Inc., 2013. : 1466504218.
ISBN
[18] Xugang Lu et al. “Speech enhancement based on deep denoising autoencoder”.
In: . 2013.
INTERSPEECH
[19] Yong Xu et al. “A Regression Approach to Speech Enhancement Based on Deep
Neural Networks”. In: IEEE/ACM Transactions on Audio, Speech, and Language
Processing 23.1 (2015), pp. 7–19. : .
DOI 10.1109/TASLP.2014.2364452
[20] Hakan Erdogan et al. “Phase-sensitive and recognition-boosted speech separation
using deep recurrent neural networks”. In: 2015 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 708–712.
[21] Tian Gao et al. “Densely Connected Progressive Learning for LSTM-Based Speech
Enhancement”. In: 2018 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). 2018, pp. 5054–5058. :
DOI 10 . 1109 / ICASSP .
2018.8461861.

Bibliography 61
[22] Desh Raj et al. Integration of speech separation, diarization, and recognition
for multi-speaker meetings: System description, comparison, and analysis. 2020.
arXiv: .
2011.02014 [eess.AS]
[23] Gregory Sell et al. “Diarization is Hard: Some Experiences and Lessons Learned
for the JHU Team in the Inaugural DIHARD Challenge”. In: .
INTERSPEECH
2018.
[24] Neville Ryant et al. The Second DIHARD Diarization Challenge: Dataset, task,
and baselines. 2019. arXiv: .
1906.07839 [eess.AS]
[25] Mireia Díez et al. “BUT System for DIHARD Speech Diarization Challenge 2018”.
In: . 2018.
INTERSPEECH
[26] Shinji Watanabe et al. “CHiME-6 Challenge: Tackling Multispeaker Speech Recog-
nition for Unsegmented Recordings”. In: Proc. 6th International Workshop on
Speech Processing in Everyday Environments (CHiME 2020). 2020, pp. 1–7. :
DOI
10.21437/CHiME.2020-1.
[27] Ashish Arora et al. The JHU Multi-Microphone Multi-Speaker ASR System for the
CHiME-6 Challenge. 2020. arXiv: .
2006.07898 [eess.AS]
[28] Wikipedia contributors. Maximum likelihood estimation — Wikipedia, The Free
Encyclopedia. https://en.wikipedia.org/w/index.php?title=Maximum_
likelihood_estimation&oldid=1051139067. [Online; accessed 17-November-
2021]. 2021.
[29] John R. Hershey et al. Deep clustering: Discriminative embeddings for segmenta-
tion and separation. 2015. arXiv: .
1508.04306 [cs.NE]
[30] Morten Kolb k et al.
æ Multi-talker Speech Separation with Utterance-level Per-
mutation Invariant Training of Deep Recurrent Neural Networks. 2017. arXiv:
1703.06284 [cs.SD].
[31] Yi Luo and Nima Mesgarani. “Conv-TasNet: Surpassing Ideal Time–Frequency
Magnitude Masking for Speech Separation”. In: IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing 27.8 (2019), 1256–1266. : 2329-9304.
ISSN
DOI URL
: .
10.1109/taslp.2019.2915167 : http://dx.doi.org/10.1109/
TASLP.2019.2915167.
[32] Xiong Xiao et al. Microsoft Speaker Diarization System for the VoxCeleb Speaker
Recognition Challenge 2020. 2020. arXiv: .
2010.11458 [eess.AS]

62 Bibliography
[33] Arsha Nagrani et al. VoxSRC 2020: The Second VoxCeleb Speaker Recognition
Challenge. 2020. arXiv: .
2012.06867 [cs.SD]
[34] Takuya Yoshioka et al. “Recognizing Overlapped Speech in Meetings: A Mul-
tichannel Separation Approach Using Neural Networks”. In: Interspeech 2018
(2018). : . :
DOI 10.21437/interspeech.2018-2284 URL http://dx.doi.
org/10.21437/Interspeech.2018-2284.
[35] Christoph Boeddecker et al. “Front-end processing for the CHiME-5 dinner party
scenario”. In: Proc. 5th International Workshop on Speech Processing in Everyday
Environments (CHiME 2018). 2018, pp. 35–40. : .
DOI 10.21437/CHiME.2018-8
[36] Wikipedia contributors. Audacity (audio editor) — Wikipedia, The Free Encyclo-
pedia. https :/ /en .wikipedia .org /w / index. php? title= Audacity_
(audio_editor)&oldid=1054771106. [Online; accessed 17-November-2021].
2021.
[37] A. Benyassine et al. “ITU-T Recommendation G.729 Annex B: a silence compres-
sion scheme for use with G.729 optimized for V.70 digital simultaneous voice and
data applications”. In: IEEE Communications Magazine 35.9 (1997), pp. 64–73.
DOI: .
10.1109/35.620527
[38] Jongseo Sohn and Wonyong Sung. “A voice activity detector employing soft deci-
sion based noise spectrum adaptation”. In: Proceedings of the 1998 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat.
No.98CH36181). Vol. 1. 1998, 365–368 vol.1. :
DOI 10.1109/ICASSP.1998.
674443.
[39] Monica Franzese and Antonella Iuliano. “Hidden Markov Models”. In: Encyclo-
pedia of Bioinformatics and Computational Biology. Ed. by Shoba Ranganathan et
al. Oxford: Academic Press, 2019, pp. 753–762. : 978-0-12-811432-2. :
ISBN DOI
https://doi.org/10.1016/B978-0-12-809633-8.20488-3 https:
. :
URL
//www.sciencedirect.com/science/article/pii/B9780128096338204883.
[40] Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. “A statistical model-based
voice activity detection”. In: IEEE Signal Processing Letters 6.1 (1999), pp. 1–3.
DOI: .
10.1109/97.736233
[41] Jacob Benesty, M Mohan Sondhi, Yiteng Huang, et al. Springer handbook of
speech processing. Vol. 1. Springer, 2008.

Bibliography 63
[42] Wikipedia contributors. WebRTC — Wikipedia, The Free Encyclopedia. [Online;
accessed 17-November-2021]. 2021. :
URL https://en.wikipedia.org/w/
index.php?title=WebRTC&oldid=1053350113.
[43] Webrtc/common_audio/VAD - external/webrtc - git at google. :
URL https://
chromium.googlesource.com/external/webrtc/+/branch-heads/43/
webrtc/common%5Faudio/vad/.
[44] Thad Hughes and Keir Mierle. “Recurrent neural networks for voice activity de-
tection”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing. 2013, pp. 7378–7382. : .
DOI 10.1109/ICASSP.2013.6639096
[45] Jesus Lopez et al. “Advances in Speaker Recognition for Telephone and Audio-
Visual Data: the JHU-MIT Submission for NIST SRE19”. In: Nov. 2020, pp. 273–
280. : .
DOI 10.21437/Odyssey.2020-39
[46] Matthew A. Siegler. “Automatic Segmentation, Classification and Clustering of
Broadcast News Audio”. In: 1997.
[47] “Step-by-step and integrated approaches in broadcast news speaker diarization”.
In: Computer Speech & Language 20.2 (2006). Odyssey 2004: The speaker and
Language Recognition Workshop, pp. 303–330. : 0885-2308. :
ISSN DOI https://
doi.org/10.1016/j.csl.2005.08.002 https://www.sciencedirect.
. :
URL
com/science/article/pii/S0885230805000471.
[48] Scott Chen. “Speaker, Environment and Channel Change Detection and Clustering
via the Bayesian Information Criterion”. In: 1998.
[49] Perrine Delacourt and Christian Wellekens. “DISTBIC: A speaker-based segmen-
tation for audio data indexing”. In: Speech Communication 32 (Sept. 2000), pp. 111–
126. : .
DOI 10.1016/S0167-6393(00)00027-3
[50] Simon Prince and James H. Elder. “Probabilistic Linear Discriminant Analysis
for Inferences About Identity”. In: 2007 IEEE 11th International Conference on
Computer Vision (2007), pp. 1–8.
[51] Daniel Garcia-Romero and Carol Y. Espy-Wilson. “Analysis of i-vector Length
Normalization in Speaker Recognition Systems”. In: . 2011.
INTERSPEECH
[52] In: ().
[53] Gregory Sell and Daniel Garcia-Romero. “Speaker diarization with plda i-vector
scoring and unsupervised calibration”. In: 2014 IEEE Spoken Language Technol-
ogy Workshop (SLT) (2014), pp. 413–417.

A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf

A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Semelhante a A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf

Semelhante a A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf (20)

Mais de Man_Ebook

Mais de Man_Ebook (20)

Último

Último (20)

A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf