SlideShare uma empresa Scribd logo
1 de 90
Baixar para ler offline
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
A Study on Improving
Speaker Diarization System
TUNG LAM NGUYEN
lamfm95@gmail.com
Dept. of Control Engineering and Automation
Supervisor Dr. T. Anh Xuan Tran
School School of Electrical Engineering
Hanoi, March 1, 2022
i
Declaration of Authorship
I, Tung Lam N , declare that this thesis titled, “A Study on Improving Speaker
GUYEN
Diarization System” and the work presented in it are my own. I confirm that:
• This work was done wholly or mainly while in candidature for a research degree at
this University.
• Where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated.
• Where I have consulted the published work of others, this is always clearly at-
tributed.
• Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
Signed:
Date:
iii
“I’m not much but I’m all I have.”
- Philip K Dick, Martian Time-Slip
v
Abstracts
Speaker diarization is the method of dividing a conversation into segments spoken by
the same speaker, usually referred to as “who spoke when”. At Viettel, this task is es-
pecially important to the IP contact center (IPCC) automatic quality assurance system,
by which hundreds of thousands of calls are processed everyday. Integrated within a
speaker recognition system, speaker diarization helps distinguishing between agents and
customers within each support call and giving further useful insights (e.g: agent attitude
and customer satisfaction,..). The key to accurately do such task is to learn discrimi-
native speaker representations. X-Vectors, bottle-neck features of a time-delayed neu-
ral network (TDNN), have emerged as the speaker representations of choice for many
speaker diarization system. On the other hand, ECAPA-TDNN, a recent development
over X-Vectors’ neural network with residual connections and attention on both time and
feature channels, has shown state-of-the-art results on popular English corpora. There-
fore, the aim of this work is to explore capability of ECAPA-TDNN versus X-Vectors in
the current Vietnamese speaker diarization system. Both baseline and proposed systems
are evaluated in two tasks: speaker verification, to evaluate the discriminative character-
istics of speaker representations; and speaker diarization, to evaluate how these speaker
representations affect the whole complex system. Used data include private data sets
(IPCC_110000, VTR_1350) and a public data set (ZALO_400). In general, conducted
experiments show the proposed system out-perform the baseline system on all tasks and
on all data sets.
vii
Acknowledgements
First and foremost, I would like to express my deep gratitude to my main supervisor, Dr.
T. Anh Xuan Tran. Without her outstanding guidance and patience, I would never finish
this thesis.
I would like to thank Dr. Van Hai Do, Mr. Nhat Minh Le and colleagues at Viettel
Cyberspace Center as their kindness and tremendous technical assistance have made my
days doing this thesis much more relieved.
Finally, huge thanks to my friends for giving me stress-relief at weekends, and my family
who did most of the cooking so I would have more time on working on this thesis.
Hanoi, March 1, 2022
ix
Contents
Declaration of Authorship i
Abstracts v
Acknowledgements vii
1 Introduction 1
1.1 Research Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Speaker Diarization System 7
2.1 Front-end Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Front-end Post-processing . . . . . . . . . . . . . . . . . . . . 10
2.1.2.1 Speech Enhancement . . . . . . . . . . . . . . . . . 10
2.1.2.2 De-reverbation . . . . . . . . . . . . . . . . . . . . . 10
2.1.2.3 Speech Separation . . . . . . . . . . . . . . . . . . . 10
2.2 Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Speaker Representations . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 X-Vector Embeddings . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1.1 Frame Level . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1.2 Segment Level . . . . . . . . . . . . . . . . . . . . . 15
x
2.4.2 ECAPA-TDNN Embeddings . . . . . . . . . . . . . . . . . . . 16
2.4.2.1 Frame-level . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2.1.1 1D Convolutional Layer . . . . . . . . . . . 17
2.4.2.1.2 1D Squeeze-and-Excitation Block . . . . . 18
2.4.2.1.3 Res2Net-with-Squeeze-Excitation Block . . 19
2.4.2.2 Segment-level . . . . . . . . . . . . . . . . . . . . . 19
2.4.2.2.1 Attentive Statistical Pooling . . . . . . . . . 20
2.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 PLDA Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . 26
3 Experiments 29
3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Equal Error Rate and Minimum Decision Cost Function . . . . 29
3.1.2 Diarization Error Rate . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Kaldi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 SpeechBrain . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Kal-Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 IPCC_110000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1.1 IPCC_110000 Verification Test Set . . . . . . . . . . 36
3.3.1.2 IPCC_110000 Diarization Test Set . . . . . . . . . . 37
3.3.2 VTR_1350 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2.1 VTR_1350 Verification Test Set . . . . . . . . . . . 39
3.3.2.2 VTR_1350 Diarization Test Set . . . . . . . . . . . . 39
3.3.3 ZALO_400 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3.1 ZALO_400 Verification Test Set . . . . . . . . . . . 40
xi
3.3.3.2 ZALO_400 Diarization Test Set . . . . . . . . . . . . 41
3.4 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Speaker Diarization System . . . . . . . . . . . . . . . . . . . 41
3.4.2 Speaker Verification System . . . . . . . . . . . . . . . . . . . 43
3.5 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Results 51
4.1 Speaker Verification Task . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Speaker Diarization Task . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Conclusions and Future Works 57
xiii
List of Figures
1.1 A traditional speaker diarization system diagram. . . . . . . . . . . . . 1
1.2 An example speaker diarization result. . . . . . . . . . . . . . . . . . . 2
1.3 An example clustering result of a 3-way conversation (adapted from [ ]).
8
Each dot represents a speech segment in 2D dimension. . . . . . . . . . 3
1.4 Generic speaker diarization system diagram, including 3 phases: embed-
dings extractor training, PLDA backend training and speaker diarization.
In this thesis, two state-of-the-art embeddings extractor: X-Vector and
ECAPA-TDNN are experimented. . . . . . . . . . . . . . . . . . . . . 4
1.5 Generic speaker verification system diagram, employing the same em-
beddings extractor and PLDA backend as used in 1.4. This system is pri-
marily used to optimize the speaker diarization system. The EER thresh-
old can be used for clustering without knowing the number of speakers
in system 1.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Diagram of a F-banks / MFCCs extraction process (adapted from [ ]).
11 8
2.2 N=10 Mel filters for signal samples sampled at 16000Hz. . . . . . . . . 9
2.3 Example output of a VAD system visualized in Audacity (audio editor)
[ ].
36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Diagram of X-Vectors DNN (adapted from [ ]).
58 . . . . . . . . . . . . 14
2.5 Diagram of X-Vectors’ frame-level TDNN with sub-sampling (as con-
figured in [ ]).
59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Diagram of X-Vectors’ segment-level DNN (as configured in [ ])
59 . . . 16
2.7 Complete network architecture of ECAPA-TDNN (adapted from [ ]).
62 . 17
xiv
2.8 Kernel sliding across speech frames in a dilated 1D-CNN layer, with
k=3, d=4 and c=6. Essentially this is a TDNN layer with context of
{3,0,3}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9 A 1D-Squeeze-and-Excitation block. Different colors represent different
scales for channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 A Res2Net-with-Squeeze-Excitation Block. . . . . . . . . . . . . . . . 20
2.11 Attentive Statistics Pooling (on both time frames and channels). . . . . 21
2.12 An example of LDA transformation from 2D to 1D (taken from [ ]).
76 . 23
2.13 Fitting the parameters of the PLDA model (taken from [ ]).
77 . . . . . . 25
2.14 Agglomerative hierarchical clustering flowchart. . . . . . . . . . . . . . 26
2.15 An example iterative process of agglomerative hierarchical clustering
(taken from [ ]).
80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.16 Visualization of the result of hierarchical clustering (taken from [ ]).
80 . 27
3.1 An EER plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Kaldi logo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Kaldi general architecture diagram. . . . . . . . . . . . . . . . . . . . . 32
3.4 Filtering VTR_1350 data set by utterances’ durations and number of
utterances per speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Generating 200 5-way conversations from VTR_1350 data set. The max.
and min. numbers of utterances picked from each conversation are 2 and
30 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 IPCC_110000 data distributions. . . . . . . . . . . . . . . . . . . . . . 35
3.7 VTR_1350 data distributions. . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 ZALO_400 data distributions. . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Baseline speaker diarization system diagram. . . . . . . . . . . . . . . 44
3.10 Baseline speaker verification system diagram. . . . . . . . . . . . . . . 45
3.11 Proposed speaker diarization system diagram. . . . . . . . . . . . . . . 48
3.12 Proposed speaker verification system diagram. . . . . . . . . . . . . . . 49
xv
4.1 A speaker diarization output of a 3-way conversation in VTR_1350 test
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
xvii
List of Tables
3.1 List of speech tasks and corpora that are currently supported by Speech-
Brain (taken from [ ])
81 . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 IPCC_110000 data set overview. . . . . . . . . . . . . . . . . . . . . . 35
3.3 IPCC_110000 data subsets. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 VTR_1350 data set overview. . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 ZALO_400 data set overview. . . . . . . . . . . . . . . . . . . . . . . 40
3.6 EER and MinDCF performance of all systems on the standard Vox-
Celeb1 and VoxSRC 2019 test sets (taken from [ ]).
62 . . . . . . . . . . 46
3.7 Diarization Error Rates (DERs) on AMI dataset using the beamformed
array signal on baseline and proposed systems (taken from [ ]).
88 . . . . 47
4.1 EER and MinDCF performance. . . . . . . . . . . . . . . . . . . . . . 53
4.2 DER performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
xix
List of Abbreviations
IPCC IP Contact Center
DNN Deep Neural Network
CNN Convolutional Neural Network
TDNN Time-Delayed Neural Network
RTTM Rich Transcription Time Marked
RNN Recurrent Neural Network
LPC Linear Prediction Coding
PLP Perceptual Linear Prediction
DWT Discrete Wavelet Transform
MFBC Mel Filterbank Coefficients
MFCC Mel Frequency Cepstral Coefficients
STFT Short-time Discrete Fourier Transform
DCT Discrete Cosine Transform
WPE Weighted Prediction Error
MLE Maximum Likelihood Estimation
PIT Permutation Invariant Training
VAD Voice Activity Detection
SAD Speech Activity Detection
HMM Hidden Markov Model
GMM Gaussian Mixture Model
GLR Generalized Likelihood Ratio
BIC Bayesian Information Criterion
UBM Universal Background Model
LDA Linear Discriminant Analysis
PLDA Probabilistic Linear Discriminant Analysis
LSTM Long Short-Term Memory
SE-Res2Net Res2Net-with-Squeeze-Excitation
ReLU Rectified Linear Unit
xx
AAM Additive Angular Margin
AHC Agglomerative Hierarchical Clustering
EER Equal Error Rate
CER Crossover Error Rate
FAR False Acceptance Rate
FRR False Rejection Rate
TPR True Positive Rate
FPR False Positive Rate
FNR False Negative Rate
MinDCF Minimum Decision Cost Function
DER Diarization Error Rate
PCM Pulse-Code Modulation
SNR Signal-to-Noise Ratio
1
Chapter 1
Introduction
1.1 Research Interest
Speaker diarization, usually referred as "who spoke when”, is the method of dividing a
conversation that often includes a number of speakers into segments spoken by the same
speaker. This task is especially important to Viettel IP contact center (IPCC) automatic
quality assurance system, where hundreds of thousands of calls are processed everyday,
and the human resources are limited and costly. In the scenarios that only single-channel
recordings are provided, speaker diarization, integrated within a speaker recognition
system, helps distinguishing between agents and customers within each support call and
giving further useful insights (e.g: agent attitude and customer satisfaction,..). Never-
theless, speaker diarization can also be applied in analyzing other forms of recorded
conversations such as meetings, medical therapy sessions, court sessions, talk shows,...
Front-End
Processing
Voice Activity
Detection
Segmentation
Speaker
Representation
Clustering
Post-
Processing
Audio
Input
Diarization
Output
F 1.1: A traditional speaker diarization system diagram.
IGURE
A traditional speaker diarization system (figure ) is built by six modules: front-end
1.2
processing, voice activity detection, segmentation, speaker representation, clustering,
and post-processing. All output information, including the number of speakers, the be-
ginning time and duration of each of their speech segments, is encapsulated in form of
Rich Transcription Time Marked (RTTM) file [ ] (figure ).
1 1.2
2 Chapter 1. Introduction
F 1.2: An example speaker diarization result.
IGURE
An important factor that affects the speaker diarization accuracy is the number of par-
ticipating speakers in the conversation. This number could be revealed or hidden from
the system before the diarization process, depending on the nature of the conversations.
An example of the case where it’s revealed is a check-up call between a doctor and a pa-
tient, or a support call between a customer and an agent, which is usually a conversation
between only two people (i.e: a 2-way conversation), assuming there’s no new speaker
interrupting or joining the conversation.
By acknowledging that the conversation have only a defined number of speaker, the
speaker diarization system can simply slice the recorded conversation into smaller speech
partitions and classify them into a known number of clusters (e.g: using k-means [ ]).
2
However, in case that the number of speakers is unknown to the system, the system must
guess it first. The guessing ends when a stopping criterion (or a decision threshold) is
met. For example, in company meeting with shareholders, although the number people
participating in the meeting is on the record, the number of people actually speak in that
meeting is unknown. While most people might remain silent throughout the meeting,
only board members would take the mic. In this case, the diarization system can em-
ploy an unsupervised clustering method (e.g: X-means [ ]), or an supervised clustering
3
method (e.g; UIS-RNN) [ ])
4
In fact, multiple attempts to build an end-to-end speaker diarization system without a
clustering module have been made in [ ], [ ], and [ ]. However, this thesis only focuses
5 6 7
on a traditional speaker diarization system, which employs a speaker clustering module.
Figure demonstrates an example clustering result.
1.3
1.1. Research Interest 3
F 1.3: An example clustering result of a 3-way conversation (adapted
IGURE
from [ ]). Each dot represents a speech segment in 2D dimension.
8
In this case, extracting speaker representations with discriminative characteristics is ex-
tremely important, since these representations as input data have a huge influence on the
accuracy of clustering stage. The discriminative characteristics of a speaker representing
method can be tested indirectly via speaker diarization system, which is the main focus
of this thesis. On the other hand, it can be tested directly via a simple speaker verifica-
tion system. A speaker verification system verifies the identity of a questioned speaker
by comparing the voice data that supposedly belong to himself with his enrolled voice
data. If the similarity between the enrolled and the input is lower than a determined
threshold, the imposter gets rejected.
In summary, the speaker diarization system and the speaker verification system are cor-
related in the sense that they are using the same way of representing speakers. Hence
optimizations in the speaker verification system would also led to improvements in the
speaker diarization system, which is the main approach of this thesis. Figure demon-
1.4
strates a generic system that employs both speaker diarization and speaker verification.
The speaker verification system, employing the same embeddings extractor and PLDA
backend used the speaker diarization system, is primarily used for optimizing the di-
arization performance. In this thesis, both baseline and proposed systems are based on
this generic model. It’s also noted that the post-processing module is left out.
4 Chapter 1. Introduction
Phase 1:
Embeddings
Extractor Training
Phase 2:
PLDA Backend
Training
Training Data
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Training Data
Training
Embeddings
Extractor
Training
PLDA
Backend
Phase 3:
Speaker
Diarization
Front-end
Processing
Voice Activity
Detection
Segmentation
Extract Speaker
Embeddings
Speaker
Embeddings
PLDA
Scoring
Affinity
Scoring
Matrix
Clustering
RTTM
Output
Audio
Input
(Enrolled)
Training
Speaker
Embeddings
F 1.4: Generic speaker diarization system diagram, including 3
IGURE
phases: embeddings extractor training, PLDA backend training and speaker
diarization. In this thesis, two state-of-the-art embeddings extractor: X-
Vector and ECAPA-TDNN are experimented.
1.1. Research Interest 5
Testing
Data
Generate
Verification Pairs
Phase *:
Speaker Verification
(for optimization)
Verification Pairs
utt-4ae utt-c3d target
utt-f34 utt-ced target
utt-ae3 utt-b2e nontarget
...
Audio
Input
(Enrolled)
Audio
Input
(Imposter)
All pairs
are scored?
Fetch
Verification Pair
No
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Speaker
Embed.
(Enrolled)
Speaker
Embed.
(Imposter)
PLDA
Scoring
Scoring
Value
Verification Scores
utt-4ae utt-c3d 0.6554
utt-f34 utt-ced 0.9786
utt-ae3 utt-b2e 0.4587
...
Yes
Embeddings
Extractor
PLDA
Backend
EER
Plotting
EER
EER Threshold
(from Phase 1)
(from Phase 2)
F 1.5: Generic speaker verification system diagram, employing the
IGURE
same embeddings extractor and PLDA backend as used in . This sys-
1.4
tem is primarily used to optimize the speaker diarization system. The EER
threshold can be used for clustering without knowing the number of speak-
ers in system .
1.4
6 Chapter 1. Introduction
1.2 Thesis Outline
This thesis is organized into 4 chapters with the following contents:
Chapter 1 The current chapter gives general information about the research interest, a
general overview on speaker diarization and its co-related task, speaker verification.
Chapter 2 This chapter presents components of a speaker diarization system, including
all components of the implemented speaker verification system. Notable topics are X-
Vector and ECAPA-TDNN speaker representations.
Chapter 3 This chapter discusses evaluation metrics, used data sets, and applied meth-
ods.
Chapter 4 This chapter closely examines experiments’ results.
Chapter 5 This chapter summarizes the work in this thesis and give some future direc-
tions.
7
Chapter 2
Speaker Diarization System
2.1 Front-end Processing
The very first stage of a speaker diarization system (or more generally speaking, a speech
processing system) is front-end processing. At this stage, acoustic features are curated
and processed in such a way that is considered most favorable for the system. The fea-
tures must be in balance between simplicity (not much correlated and simple enough to
be input of the learning network), and complexity (still possessing useful information,
making space for the network to learn). Afterwards, some front-end post-processing
techniques, such as speech enhancement, speech de-reverberation, speech separation and
target speaker extraction, could be performed to further enhance speech features towards
better speaker diarization performance.
2.1.1 Features Extraction
There’s a wide variety of methods to represent the speech signal parametrically, such
as linear prediction coding (LPC), perceptual linear prediction (PLP), discrete wavelet
transform (DWT), Mel filterbank coefficients (MFBCs), and Mel frequency cepstral
coefficients (MFCCs). However, in the last twenty years, the last two methods emerge
as features of choice in the field of speech processing. [ ] [ ]. Figure demonstrates
9 10 4.1
a typical F-banks / MFCCs extraction process.
At the beginning of F-banks / MFCCs extraction process, the input speech signal is di-
vided into homogeneous overlapping short frames (in most cases into short frames of
25ms with overlaps of 10ms), and windowed (usually with Hamming / Hanning win-
dow) to eliminate noises in later signal transforms [ ] (i.e: with 16000Hz sampled
12
8 Chapter 2. Speaker Diarization System
Speech
Signal
Framing and
Windowing
Short Time Fourier
Transform (STFT)
Log-amplitude
N Mel filters
Discrete Cosine
Transform (DCT)
MFCCs
MFBCs
Sum
F 2.1: Diagram of a F-banks / MFCCs extraction process (adapted
IGURE
from [ ]).
11
speech signal, each frame contains 16*1000*25/1000 = 400 samples, with an overlap of
10/25*400 = 160 samples with the previous frame).
Afterwards, the framed and windowed signal is then analysed by a Short-Time Dis-
crete Fourier Transform (STFT), through which the signal sample is converted from
amplitude-time domain to amplitude-frequency domain.
Next, the y-axis (amplitude) is converted to a log scale as it better represents the human
perception of loudness [ ], according to Weber-Fechner Law [ ].
13 14
Then, the x-axis (frequency) is also converted, but into mel-scale [ ]. The reason for
15
this conversion is the fact that out human ears have lower resolution at higher frequen-
cies than lower frequencies (i.e: it’s fairly easy for us to distinguish between sounds at
300Hz and 400Hz, but it gets much harder when we have to compare sounds at 1300Hz
and 1400Hz, even when the difference is still 100Hz). The mel-scale formula is purely
discovered via psychological experiments with many variation, one of which can be ex-
pressed as follow:
m = 2595log10 1+
f
700 = 1127ln 1+
f
700 (2.1)
At this stage, Mel filterbank coefficients (MFBCs) can be computed after 2 steps:
• Step 1: Apply a chosen number N of triangular band-pass filters linearly spaced in
Mel scale:
– The lowest and highest frequencies of the bands correspond to the lowest and
highest frequency of the initially sampled signal.
2.1. Front-end Processing 9
– In mel-frequency domain, these filters have the same bandwidth, and they over-
lap each other by half of the bandwidth.
– Each filter is a triangular filter with the frequency response of 1 at the center
frequency, and decreases linearly towards zero till it reaches the center frequen-
cies of the two adjacent filters.
• Step 2: Compute N mean log-amplitude values of N filtered signal samples (from
the original signal sample). These values are taken as N Mel filterbank coefficients.
For example, for signal samples sampled at 16000Hz, N=10 Mel filters can be visualized
in the following figure:
F 2.2: N=10 Mel filters for signal samples sampled at 16000Hz.
IGURE
Going further, to obtain Mel frequency cepstral coefficients (MFCCs), the following
steps, which continues step 2 of MFBCs’ computation above, are done:
• Step 1: Sum all N Mel-filtered signal samples to obtain a Mel-weighted signal
sample.
• Step 2: Apply discrete cosine transform (DCT) to transform the signal from log-mel
frequency domain to quefrency [ ] domain.
16
• Step 3: Take first K coefficients (usually K=13) as K MFCCs. The next steps are
optional to generate high resolution MFCCs.
• Step 4: Compute first order and second order time derivatives of each coefficients
to yield K*2 more coefficients.
10 Chapter 2. Speaker Diarization System
• Step 5: Compute sum of squares of amplitude of the signal sample to obtain one
energy coefficient. After this step, the total number of MFCCs are K*3+1 (i.e:
K=13 corresponds to 40 MFCCs)
2.1.2 Front-end Post-processing
2.1.2.1 Speech Enhancement
Speech enhancements techniques primarily focus on diminishing noises from noisy au-
dio. These techniques include: classical signal processing based de-noising [ ]; deep-
17
learning based de-noising [ ][ ][ ][ ][ ], and multi-channel processing [ ].
17 18 19 20 21 22
2.1.2.2 De-reverbation
De-reverberation techniques are utilized to remove effects of reverberation from input
signal. A popular method is Weighted Prediction Error (WPE), which is the dom-
inant method used in top-performing systems in DiHARD and CHiME competitions
[ ][ ][ ][ ][ ]. The basic idea of WPE is to decompose the original signal model
23 24 25 26 27
into early reflection and late reverberation. Then it tries to estimate a filter to maintain
the early reflection while suppressing the late reverberation basing on maximum likeli-
hood estimation (MLE) method [ ]. The improvement WPE gives is not significant,
28
but it is solid across all tasks. It also shows additional performance improvements when
applied on multi-channel signals.
2.1.2.3 Speech Separation
Speech separation is primarily useful when overlapping speech regions are significantly
large. Two main branches under this approach are:
• Deep-learning based speech separation: Some early attempts are Deep Cluster-
ing [ ], Permutation Invariant Training(PIT) [ ] and Conv-Tasnet [ ]. How-
29 30 31
ever, single-channel speech separation systems often produce a redundant non-
speech or even a duplicated speech signal for the non-overlap regions (leakage).
Leakage-filtering for single-channel systems were proposed and significantly im-
proved speaker diarization performance [ ][ ].
32 33
2.2. Voice Activity Detection 11
• Beam-forming based speech separation: This method appears in top-performing
systems in CHiME-6 challenge [ ][ ].
34 35
2.2 Voice Activity Detection
Voice activity detection (VAD) (also known as speech activity detection - SAD), is a
technique to address the presence or absence of human speech in a given audio signal,
which is an indispensable component of most speech processing systems:
• In a speech synthesis system, VAD helps removing noises in training data, thus,
reduce noises in synthesized audio.
• In a speech recognition system, VAD helps dropping frames of noises to save com-
puting power and reduce the number of insertion errors in decoded texts.
• In a speaker diarization system, VAD helps generating better speaker representa-
tions, which is the most important factor affecting the whole system’s performance
in terms of precision.
VAD systems can be classified into two types:
• Two-phases VAD systems, which mostly comprises of two parts: a feature extrac-
tion front end, where acoustic features such as MFCCs are extracted; and a classi-
fier, where a model predicts whether the input frame is speech or not.
• ASR-based VAD systems, where VAD timestamps are inferred directly from word
alignments. In this case, the ASR system preceded the VAD system.
F 2.3: Example output of a VAD system visualized in Audacity (au-
IGURE
dio editor) [ ].
36
12 Chapter 2. Speaker Diarization System
VAD techniques have been sporadically developed throughout the years. In 1997, Benyas-
sine et al. presented a silence compression scheme that reduces transmission bandwidth
during silence period [ ]. This system employed a VAD algorithm that was later usu-
37
ally referred to as G729B algorithm. In 1998, Sohn et al. introduced the first statistical
model-based VAD that accounts for time-varying noise statistics [ ]. Just one year
38
later, they proposed an improved version with a Hidden Markov Model (HMM)[ ]
39
based hang-over scheme [ ]. In 2011, Ying et al. proposed a Gaussian mixture mod-
40
els (GMM)[ ] based VAD trained on a unsupervised learning framework. A popular
41
implementation of this method is Google WebRTC VAD [ ] [ ].
42 43
Later, with the rise of neural networks, in 2013, Thad Hughes introduced recurrent neu-
ral network (RNN) based VAD [ ] which was claimed to outperform existing GMM-
44
based VAD. Then, in 2018, a team from Johns Hopkins University proposed a Time-
Delay Neural Network VAD [ ], which is trained using alignments from the GMM-
45
HMM process and known to perform much faster than RNN-based VAD.
In this thesis, Google WebRTC VAD is adopted as the main VAD technique, considering
it’s simplicity of installation, as well as being used by Google in production environment
for a long period of time.
2.3 Segmentation
In a speaker diarization system, Speech segmentation breaks the input audio stream into
multiple segments so that each segment can be assigned to a speaker label.
The simplest method of segmentation is uniform segmentation, in which audio input
is segmented with a consistent window length and overlap length. The window must
be sufficiently short to safely assume that they do not contain multiple speakers, but at
the same time long enough to capture enough acoustic information (usually from 1 to 2
seconds).
A more complex method is speaker change point detection, in which speaker change
points are detected by comparing two hypotheses: Hypothesis H0 assuming both left and
right samples are from the same speaker and hypothesis H1 assuming the two samples
are from the different speakers. Some notable approaches are Kullback Leibler 2 (KL2)
algorithm [ ], Generalized Likelihood Ratio (GLR) [ ], and Bayesian Information
46 47
Criterion (BIC) [ ][ ].
48 49
2.4. Speaker Representations 13
2.4 Speaker Representations
Speaker representations play a critical role in measuring the similarity of speech seg-
ments, basing on which speech segments are classified into a known or unknown number
of speakers.
While features such as MFCCs or MFBCs are discriminative enough for speech recog-
nition, they are considered too noisy for speaker diarization. In order to overcome this
limitation, numerous studies have been carried out.
From 2010 to 2015, the dominant approach is to train a probabilistic model (e.g.: Gaus-
sian Mixture Model-Universal Background Model – GMM UBM) to extract speaker
representations in a new low dimensional speaker and channel-dependent space. A
probabilistic linear discriminant analysis (PLDA) [ ] could also be trained to further
50
improve scoring stage. Those representations are commonly referred to as I-Vectors
[ ][ ][ ][ ].
51 52 53 54
Since late-2015, deep learning has emerged as the dominant approach for this task. The
main concept is to train a deep neural network (DNN) to classify all speakers in a data
set, and then, in the testing stage, use bottleneck features as a speaker representation.
In this year, Heigold et al. proposed an end-to-end text-dependent speaker verification
system [ ] that learns speaker embeddings (commonly known as D-Vectors) based on
55
the cosine similarity. It is developed to handle variable length input in a text-independent
verification task through a temporal pooling layer and data augmentation. The model
was trained entirely on Google’s proprietary datasets. D-Vectors are later improved by a
short-term memory (LSTM) [ ] network and a triplet loss function [ ].
56 57
In 2017, a group of researchers from Johns Hopkins university proposed a modified
version of D-Vectors trained on smaller, publicly available datasets, and pre-processed
with a different strategy [ ]. In 2018, by exploiting data augmentation, they further
58
improve their speaker representations and referred to these representation as X-Vectors
[ ].
59
2.4.1 X-Vector Embeddings
X-Vectors are bottleneck features of a deep neural network trained to classify a large
number of speakers. The training data is divided into small batches of K speakers and
14 Chapter 2. Speaker Diarization System
M speech segments (each of which must have more than T frames). The loss function
(multi-class cross entropy) goes as follow:
E = −
N
∑
n=1
K
∑
k=1
dnk ln

P

spkk | x
( )
m
1:T
(2.2)
where:
• P

spkk | x
( )
m
1:T 
is probability of speaker n given T input frames x
( )
m
1 ,x
( )
m
2 ,...x
( )
m
T
• dnk is 1 if the speaker label for segment n is k, and is 0 otherwise.
The network operates at 2 levels: frame-level and segment-level, connected by a statis-
tics pooling layer (as shown in figure 2). This multi-level structure allows the DNN to be
trained with segments of different lengths. Hence, the training data would be better uti-
lized and the extracted X-Vector would be more robust against the variance of segment’s
length.
F 2.4: Diagram of X-Vectors DNN (adapted from [ ]).
IGURE 58
2.4. Speaker Representations 15
2.4.1.1 Frame Level
At frame level, the network is essentially a time delayed neural network (TDNN) [ ]
60
with sub-sampling. The default configuration is shown in figure : Input features
2.5
are 30 MFCCs extracted from frames of 25ms with the overlaps of 10ms. The TDNN
has 5 layers with different context specifications. Layer 3, 4 and 5 are fully connected.
To account for the lack of context at the first and the last frames, speech segments are
padded at both ends.
t
t
t
t-5 t-3 t-1 t+1 t t+5
t-3 t+3
layer 1
(dim=512)
layer 2
(dim=512)
layer 3
(dim=512)
layer 4
(dim=512)
layer 5
(dim=1500)
(to statistical pooling)
Layer Input context Input context with
sub-sampling
5 {0} {0}
4 {0} {0}
3 [-3,3] {-3,0,3}
2 [-2,2] {-2,0,2}
1 [-2,2] [-2,2]
INPUT
(dim=30)
t-7 t+7
F 2.5: Diagram of X-Vectors’ frame-level TDNN with sub-sampling
IGURE
(as configured in [ ]).
59
2.4.1.2 Segment Level
At segment level, the network is a fully-connected feed-forward DNN with a pooled in-
put. Figure demonstrates the default setup: All frame-level outputs
2.6 ht (t T
= 1, ,
··· )
(layer 5 of the TDNN) are aggregated to compute mean and standard deviation .
µ σ
µ =
1
T
T
∑
t
ht (2.3)
σ = s
1
T
T
∑
t
ht ht − 
µ µ (2.4)
16 Chapter 2. Speaker Diarization System
Both of those statistics are concatenated together to one vector that represents the whole
segment. This vector is then passed through 2 fully-connected layers, each of which
has a rectified linear unit (ReLU) [ ]. At last, the output layer (a log-softmax classi-
61
fier) gives a probability distribution of all speakers in the training data. In this network
configuration, it could be the output of layer 6 or layer 7.
t
t
t
layer 7
(dim=512)
layer 6
(dim=512)
OUTPUT
(dim=TotalSpeakers)
t
statistical pooling
(dim=3000)
t-2 t-1 t
t - T
layer 5
(dim=1500)
F 2.6: Diagram of X-Vectors’ segment-level DNN (as configured in
IGURE
[ ])
59
As mentioned earlier, X-Vectors are essentially bottleneck features of a DNN. In this
network configuration, it could be the output of layer 6 or layer 7. However, the latter is
selected since it’s proven experimentally to perform better in the speaker identification
task [ ].
59
2.4.2 ECAPA-TDNN Embeddings
In late 2020, Desplanques et al. proposed ECAPA-TDNN [ ], an enhanced structure
62
based on X-Vectors’ network. The basic TDNN layers are replaced with 1D-Convolutional
Layers [ ] and Res2Net-with-Squeeze-Excitation (SE-Res2Net) Blocks [ ][ ][ ],
63 64 65 66
while the basic statistics pooling layer is replaced with an Attentive Statiscal Pooling,
which utilizes both frame-wise and channel-wise attention mechanism [ ]. The com-
67
plete architecture of ECAPA-TDNN is visualized in figure .
2.7
2.4. Speaker Representations 17
Conv1D
(k=5, d=1)
(+ ReLU + BatchNorm)
SE-Res2Block
(k=3, d=2)
SE-Res2Block
(k=3, d=3)
SE-Res2Block
(k=3, d=4)
Conv1D
(k=1, d=1)
(+ ReLU)
+
Attentive Stat Pooling
(+ BatchNorm)
C T
x
C T
x
80 T
x
Fully Connected
(+ BatchNorm)
Additive Angular Margin
Softmax
1536 T
x
C T
x
3 C T
x ( x )
2 1536 T
x x
192 1
x
num. speakers 1
x
INPUT
OUTPUT
Frame
Level
Segment
Level
F 2.7: Complete network architecture of ECAPA-TDNN (adapted
IGURE
from [ ]).
62
2.4.2.1 Frame-level
At frame-level, ECAPA-TDNN network consists of 1D Convolutional layers (with ReLU
and optional batch normalization), and 1D Squeeze-and-Excitation blocks. The network
also utilizes residual connections at high-level to reduce the effect of vanishing gradients.
2.4.2.1.1 1D Convolutional Layer
In a 1D convolution layer, instead of sliding along two dimensions as in well-known
CNNs in image processing [ ][ ][ ], a kernel of size k and dilation d slides along
68 69 70
dimension of time frames. Figure demonstrate how a 1D Convolutional (1D-Conv)
2.8
kernel works. Where:
• k denotes kernel size.
• d denotes dilation spacing (i.e: if d is larger than 1 the 1D-Conv layer is dilated).
18 Chapter 2. Speaker Diarization System
• c denotes number of channels (i.e: number of extracted feature coefficients, e.g: 40
MFCCs)
frames
d: dialation spacing
k: kernel size
c: number
of channels
F 2.8: Kernel sliding across speech frames in a dilated 1D-CNN
IGURE
layer, with k=3, d=4 and c=6. Essentially this is a TDNN layer with context
of {3,0,3}.
2.4.2.1.2 1D Squeeze-and-Excitation Block
For computer vision tasks, Squeeze-and-Excitation blocks [ ] has proven to be very ef-
66
fective in improving channel inter-dependencies at low computational cost. In ECAPA-
TDNN architecture, this approach help re-scaling the frame-level features basing on
global properties of the signal sample. A 1D-Squeeze-and-Excitation block consists
of 3 components, as shown in figure :
2.9
C: number
of channels
Fsqueeze( )
·
F excitation( , )
· W
Fscale( , )
· ·
T frames
ht
z s
C
T frames
F 2.9: A 1D-Squeeze-and-Excitation block. Different colors repre-
IGURE
sent different scales for channels.
2.4. Speaker Representations 19
• Squeeze operation, where frame-wise mean vector is calculated from inputs:
z
z =
1
T
T
∑
t
ht (2.5)
• Excitation operation, where is used to calculate channel-wise scale vector through
z
two bottle-neck fully connected layers that generate outputs of the same dimension
as inputs’:
s W
= (
σ 2 f (W1z b
+ 1)+b 2) (2.6)
where denotes the sigmoid function; denotes a non-linearity (e.g: a ReLU),
σ( )
· f( )
·
Wk and b k denotes the learnable weight and bias of bottle-neck fully connected
layer k.
• Scale operation, where the original input frames are scaled with :
s
h̃t c
, = s t c
, ht c
, (2.7)
2.4.2.1.3 Res2Net-with-Squeeze-Excitation Block
In 2019, Shang-Hua Gao et. all proposed Res2Net [ ], a multi-scale backbone network
65
for computer vision tasks, based on ResNet [ ]. This computer In ECAPA-TDNN,
64
Res2Net is integrated with a SE-Block, forming a Res2Net-with-Squeeze-Excitation
(SE-Res2Net) block, to benefit from residual connections (i.e: to reduce vanishing gra-
dients) why keeping the number of parameters at a reasonable figure. Under this setup,
the number of channel subsets corresponds to number of intermediate 1D-COnv blocks,
and thus, may increase the number of parameters. The SE-Block is then used to amplify
attention to channels while adding only a small number of parameters. Figure 2.10
visualizes a SE-Res2Net block.
2.4.2.2 Segment-level
At segment level, ECAPA-TDNN employs a soft multi-head self-attention [ ] model to
67
calculate weighted statistics at the pooling layer, which accounts for signal samples of
varied lengths. The statistic outputs (weighted mean and weighted standard deviation)
are then concatenated together. The result vector is propagated through a single fully-
connected layer with batch normalization, and then, the final layer, Additive Angular
20 Chapter 2. Speaker Diarization System
C
channels
T frames
Conv1D
+
+
T T
subset concat
T
+
Conv1D
Conv1D
SE Block
INPUT OUTPUT
s subsets;
C/s channels each
Conv1D
C
Conv1D
C
F 2.10: A Res2Net-with-Squeeze-Excitation Block.
IGURE
Margin Softmax (AAM-Softmax) [ ] layer. The final output is a N-dimension vector
71
where N is the total number of speakers in the training set (i.e: to classify N speakers in
the training set).
2.4.2.2.1 Attentive Statistical Pooling
In 2019, Okabe et al. proposed using attention mechanism to give different weights to
different frames in the signal sample to calculate weighted mean and weighted standard
deviation at X-Vectors’ pooling layer [ ]. In ECAPA-TDNN, the attention mechanism
72
is extended further: not only on time frames, but also on channels:
The raw scalar channel-and-frame-wise score et c
, and its normalized value αt c
, , is calcu-
lated as follow:
et c
, = v T
c f (Wht + )+
b k c (2.8)
αt c
, =
exp(et c
, )
∑T
τ exp(eτ,c )
(2.9)
where:
• ht are the activations of the last layer frame at time step .
t
2.4. Speaker Representations 21
C: number
of channels
Attention Model
Fscale( , )
· ·
T frames
ht
C
T frames
A
F 2.11: Attentive Statistics Pooling (on both time frames and chan-
IGURE
nels).
• W ∈ R R×C
and b ∈ R R×1 project the activation into a representation of smaller
dimension (R). This projection is shared across all C channels.
• f ( )
· denotes a non-linearity.
• vc ∈ RR×C
and kc transform the output of the non-linearity to a channel-
f ( )
·
dependent scalar score.
The normalized score αt c
, is then used to calculate weighted mean and weighted standard
deviations vectors as follow:
µ̃c =
T
∑
t
αt c
, ht c
, (2.10)
σ̃c = s
T
∑
t
αt c
, h2
t c
, − µ̃2
c (2.11)
where µ̃c and σ̃c are respectively the channel components of weighted mean and weighted
standard deviation vectors µ̃ and σ̃.
Moreover, the temporal context of the pooling layer is expanded by making the self-
attention to look at global properties of the signal sample (e,g: to account for noise and
recording conditions). The local input ht of equation are concatenated with the
2.8
non-weighted mean and non-weighted standard deviation of ht itself.
22 Chapter 2. Speaker Diarization System
2.5 Clustering
2.5.1 PLDA Scoring
After generating the speaker representations for each segment, a clustering algorithm is
applied to make clusters of segments. The distance or the similarity between each pair of
observations can be computed using a wide variety of techniques, including Euclidean
distance [ ], mean squared difference [ ], and cosine similarity [ ].
73 74 75
Although the similarity metrics can be calculated directly from the pairs of extracted
speaker embeddings, purely-statistical data reduction techniques such as Linear Discrim-
inant Analysis (LDA) can be employed to further improve the discriminative character-
istics of these features with a little computation cost. The LDA transformation can be
given as in the following equation:
z W
= T
∗x (2.12)
where:
• x = {x i} ∈ RD is the D-dimension input vector.
• z = {z i} ∈ RD0
(D0
≤ D) is the representation of input vector in the new latent space.
• W ∈ RDxD0
is a linear transformation matrix.
In this case, LDA is formulated as an optimization problem to find a linear transfor-
mation that maximize the ratio of the between-class scattering to the within-class
W
scattering:
W = argmax
W
J( ) =
W argmax
W
trace

WTSBW

trace(WT SWW)
(2.13)
where:
• SW denotes within class scatter matrix SW = ∑n
i=1 (xi − µ yi
)(xi − µ yi
)T
. Here {yi}
are class labels and µk is the sample mean of the k-th class. SW is positive definite.
• SB denotes between-class scatter matrix SB = ∑m
k=1 nk(µ k− µ µ
)( k − µ) T
. Here m
is the number of classes, is the overall sample mean, and
µ nk is the number of
samples in the k-th class. SB is positive semi-definite.
2.5. Clustering 23
F 2.12: An example of LDA transformation from 2D to 1D (taken
IGURE
from [ ]).
76
However, LDA is a deterministic algorithm that works only well on seen data. This is
not ideal in case of real speaker diarization applications, where enrolled or questioned
speakers are not included in the training data set.
Therefore, a probabilistic version of LDA, namely Probabilistic Linear Discriminant
Analysis (PLDA) [ ][ ], is employed to take advantages of LDA while dealing with
77 78
unseen classes. Essentially, PLDA is formulated upon LDA by representing both the
sample mean of the class and the data between class itself with separate distributions.
The chosen distribution is usually a mixture of Gaussian distributions, i.e a Gaussian
Mixture Model (GMM) [ ]. Let is a latent class variable representing the mean of a
41 y
class within the GMM, then probability of generating same given class mean and the
x y
prior probability of class mean in the same space are given by the following equations:
y
P N
( ) =
x y
| (x y
| ,S W) (2.14)
P N m
( ) =
y (y| ,S B) (2.15)
where
• SW and SB respectively denote the within-class and between-class scatter matrices
as seen in equation .
2.13
• N(x y
| ,SW) is a multivariate Gaussian distribution with mean and variance
y SW.
24 Chapter 2. Speaker Diarization System
• N m
(y| ,SB) is a multivariate Gaussian distribution with mean and variance
m SB.
As proven in [ ],
77 SW and SB can be diagonalized as follow:
VT ΦwV I
= (2.16)
VT
ΦbV = Ψ (2.17)
and by defining A V
= −T , they can be rewritten as:
SW = AIAT (2.18)
SB = A A
Ψ T (2.19)
Then, from equation :
2.14
P N S
( ) =
x y
| (x y
| , W )
= N

x m v
| +A ,AAT

= + (
m A N
∗ u v
| ,I)
(2.20)
and from equation :
2.15
P N S
( ) =
y (y m
| , B)
= N

y m
| ,A A
Ψ T

= + ( )
m A N
∗ v | 0,Ψ
(2.21)
Let and re Gaussian random variables in the latent space:
u v
u v
∼ |
N(. ,I) (2.22)
v ∼ |
N(. 0,Ψ) (2.23)
then the relationship between and the latent variables can be written as follow:
x y
, u v
,
y m v
= +A (2.24)
x m u
= +A (2.25)
2.5. Clustering 25
The unknown parameters of PLDA are the mean , the covariance matrix , and the
m Ψ
loading matrix . All of these parameters are learnable using the maximum likelihood
A
method. Figure demonstrates the training process of the PLDA model.
2.13
F 2.13: Fitting the parameters of the PLDA model (taken from [ ]).
IGURE 77
The PLDA score between two given vector u1,u2 in latent space is calculated by taking
the log of the likelihood ratio based on two hypotheses: if both of the vectors belong
R
the same class, or not. is given as:
R
Score log R u
= ( ( 1,u2)) = log 
P u
( 1,u2)
P u
( 1) (
P u 2) = log R
P u
( 1|v P u
) ( 2|v P v dv
) ( )
P u
( 1) (
P u2) 
(2.26)
26 Chapter 2. Speaker Diarization System
2.5.2 Agglomerative Hierarchical Clustering
Agglomerative hierarchical clustering starts by treating each observation as a separate
cluster. Then, it repeatedly executes the following two steps: (1) identify the two clus-
ters that are closest together, and (2) merge the two most similar clusters. This iterative
process continues until all the clusters are merged together (figure ). The cluster-
2.15
ing process can stopped after a defined number of clusters or a decision threshold is
reached. The main output of Hierarchical Clustering is a dendrogram [ ], which shows
79
the hierarchical relationship between the clusters (figure ).
2.16
F 2.14: Agglomerative hierarchical clustering flowchart.
IGURE
2.5. Clustering 27
F 2.15: An example iterative process of agglomerative hierarchical
IGURE
clustering (taken from [ ]).
80
F 2.16: Visualization of the result of hierarchical clustering (taken
IGURE
from [ ]).
80
29
Chapter 3
Experiments
3.1 Evaluation Metrics
3.1.1 Equal Error Rate and Minimum Decision Cost Function
Equal error rate or crossover error rate (EER or CER) is the rate at which both acceptance
and rejection errors are equal. In order to find the EER of a given system, an EER plot
is created through the following steps:
• Calculate acceptance rate (FAR - i.e.: false positive speaker identification, or false
alarm), and false rejection rate (FRR - i.e.: false negative speaker identification, or
missed detection) for a set of decision thresholds :
t
FARt =
Number Of False Acceptancest
Number of Identification Attempts
(3.1)
FRRt =
Number Of False Rejectionst
Number of Identification Attempts
(3.2)
• Plot FAR and FRR against decision threshold t. EER is y-value of the intersection
of those lines. An example of EER plot is shown in figure .
3.1
As the decision threshold (i.e: sensitivity) increases, the false alarms will drop with
the missed detection wil rise. In this case, the configured system is more secured by
reduce the acceptance possibility. Conversely, when the decision threshold is lowered,
the system is less secure against impostors.
30 Chapter 3. Experiments
F 3.1: An EER plot.
IGURE
EER is originally used as main evaluation metric in a speaker identification system. In
a tradition speaker diarization, it’s also used to effectively estimate the discriminated
characteristic of speaker embedding extraction techniques system, due to the fact that
this metric is not affected by some other modules like clustering or resegmentation. The
lower EER, the better the system performs.
An important metric usually coupled with EER is Minimum Decision Cost Function
(MinDCF), representing the minimum value for the linear combination of the false
alarms and missed detections at different threshold:
MinDCF = min
t
(( )
1− p ∗FAR t + p FRR
∗ t) (3.3)
where:
• t is the decision threshold.
• FARt and FRRt are calculated as in equations and .
3.1 3.2
• p is the prior probability of the enrolled entity. Common values for are and
p 0 01
.
0 001
. .
3.2. Frameworks 31
The lower EER and MinDCF, the better a speaker verification system performs.
3.1.2 Diarization Error Rate
Diarization Error Rate (DER) is the most widely used metric for speaker diarization.
It is measured as the fraction of time that is not attributed correctly to a speaker or to
non-speech, calculated as in equation 3.4
DER =
TFalseAlarm +TMiss +TConfusion
TScored
(3.4)
where:
• TScored is the total duration of the recording without overlapped speech.
• TFalseAlarm is the scored time that a hypothesized speaker is labelled as a non-speech
in the reference.
• TMiss is the scored time that a hypothesized non-speech segment corresponds to a
reference speaker segment.
• T
Confusion is the scored time that a speaker ID is assigned to a wrong speaker.
The lower DER, the better a speaker diarization system performs.
3.2 Frameworks
3.2.1 Kaldi
Kaldi is a toolkit originally written in C++, Perl, Shell and Python for speech recognition,
speaker recognition, and many others. Kaldi used to be the framework of choice among
speech processing researchers. In addition to the flexibility and high performance, Kaldi
is enriched by many reproducible state-of-the-art recipes from researchers around the
world.
Some noteworthy features of Kaldi include:
• Code-level integration with Finite State Transducers (FSTs).
32 Chapter 3. Experiments
• Extensive linear algebra support: Both BLAS and LAPACK are supported
• Extensible design: The algorithms are provided in the most generic form possible.
• Open license: The code is licensed under Apache 2.0, which is one of the least
restrictive licenses available.
• Complete recipes: Recipes for building a complete speech recognition system is
included. These work with widely available datasets such as those provided by the
Linguistic Data Consortium (LDC).
F 3.2: Kaldi logo.
IGURE
F 3.3: Kaldi general architecture diagram.
IGURE
3.2.2 SpeechBrain
SpeechBrain [ ] is an open-source and all-in-one conversational AI toolkit based on
81
PyTorch. The main purpose is to create a single, flexible, and user-friendly toolkit that
can be used to easily develop state-of-the-art speech technologies, including systems
3.2. Frameworks 33
for speech recognition, speaker recognition, speech enhancement, speech separation,
language identification, multi-microphone signal processing, and many others. Speech-
Brain provides the implementation and experimental validation of both recent and long-
established speech processing models with SotA or competitive performance on a variety
of tasks (table ).
3.1
TABLE 3.1: List of speech tasks and corpora that are currently supported
by SpeechBrain (taken from [ ])
81
3.2.3 Kal-Star
Kal-Star [ ], developed by the author while working at Viettel Cyberspace Center, is
82
a Shell/Python library wrapping around Kaldi and SpeechBrain to enhance data pre-
processing and training processes. Kal-Star provides a wide variety of tools to prepare,
train and test data, mostly with speaker verification and speaker diarization tasks (Figure
1.4 1.5
and Figure ). Kal-Star inherits the file-based data indexing from Kaldi, which
treats view a given data set as a folder of , ,
spk2utt utt2spk wav.scp segments
, and (if
the data sets is segmented with a VAD) files. Further informations can be added later
to the folder (e.g: the diarization result RTTM is added once the speaker diarization
process is done.). Figures ,
3.4
34 Chapter 3. Experiments
F 3.4: Filtering VTR_1350 data set by utterances’ durations and
IGURE
number of utterances per speaker.
F 3.5: Generating 200 5-way conversations from VTR_1350 data set.
IGURE
The max. and min. numbers of utterances picked from each conversation
are 2 and 30 respectively.
3.3 Data Sets
In this thesis, three main Vietnamese data sets are used:
• IPCC_110000: splitted for training and testing. The test split is then used directly
for speaker verification task, and used to generate mock conversations for speaker
diarization task.
• VTR_1350 ZALO_400
and : used directly for speaker verification task, and also
used to generate mock conversation for speaker diarization task. These data sets are
not used in training.
3.3. Data Sets 35
3.3.1 IPCC_110000
IPCC_120000 data set consists of 1046.37 hours of audio in telephone environment
from approximately 110000 Vietnamese speakers. Data are recorded at Viettel Customer
Service IP Contact Center (IPCC) and sampled at 8000 Hertz. Most recorded utterances
are from 2 to 6 seconds in length, while each speaker have from 1 to 10 utterances,
making up from 10 to 60 seconds of speech. The spoken topics revolve around technical
difficulties that Viettel’s customers met in using mobile and internet services, as well as
every questions about common knowledge, the weather, sport results, or lottery results.
Table gives an overview of this data set and figure demonstrates how the data is
3.2 3.6
distributed.
Data set IPCC_110000
Base sample rate 8000 Hz
Environment Telephone
# Speakers 112837
# Utterances 919608
Total duration 1046.4 hours
TABLE 3.2: IPCC_110000 data set overview.
F 3.6: IPCC_110000 data distributions.
IGURE
All speakers and agents each has his/her own recording channel, and it is assumed that
each recording has only one speaker. In reality, a telephone conversation between a cus-
tomer and an agent can be interfered by: other customers join the conversations, or other
agents join the conversation. The latter case happens much less than the former, since
the IPCC is designed in such a way that agents have good sound isolation, and even if the
customer’s issue is passed to other agents, these agents would have their own recording
channel. As for the case where more than one customers join the conversation, it’s tested
36 Chapter 3. Experiments
that in 1000 randomly chosen conversations, only about 12 of them have more than 1
speakers (e.g: an infant interrupts his parent having a call with an IPCC agent). The ra-
tio is only about 1.2 percents, and thus, this case could be mitigated with a little negative
effect towards the discriminative characteristic of the trained embedding extractor.
IPCC_110000 is randomly splitted into 3 subsets: train, dev, and test sets. Each of the
test and dev data sets has 2000 speakers and about 20 hours of data. The train set contains
the remaining data. Table displays the number of speakers, number of utterances
3.3
and total duration of each subsets.
Split train dev test
# Speakers 108837 2000 2000
# Utterances 886172 16975 16461
Total duration (hours) 1005.1 20.94 20.39
TABLE 3.3: IPCC_110000 data subsets.
3.3.1.1 IPCC_110000 Verification Test Set
Generated from IPCC_110000 test split, the verification test set has verification
3888
pairs, generated by the following steps:
• Step 1: With , for each speaker randomly pick utterances from this
K = 3 K ∗ 2
speaker’ utterances pool to generate verification pairs. The labels are "target"
K
each, meaning utterances in each pair are from the same speaker.
• Step 2: Randomly pick other utterances from the utterances pool (if there
K = 3
are less than utterances left in the pool, discard all picked verification pairs and
K
skip to step 3) of the selected speaker at step 1, and other utterances from
K = 3
utterances pools of all other speakers to generate more verification pairs. The
K
labels are "nontarget" each, meaning utterances in each pair are not from the same
speaker.
An example result after step 2:
IPCC-131047_hanoi-3166 IPCC-131047_hanoi-1961 target
IPCC-131047_hanoi-1319 IPCC-131047_hanoi-3657 target
IPCC-131047_hanoi-2303 IPCC-131047_hanoi-1015 target
3.3. Data Sets 37
IPCC-131047_hanoi-2626 IPCC-203322_hanoi-2214 nontarget
IPCC-131047_hanoi-0582 IPCC-203268_hanoi-3113 nontarget
IPCC-131047_hanoi-1679 IPCC-203268_hanoi-5260 nontarget
• Step 3: Go back to step 1, until all speakers are considered for enrollment.
3.3.1.2 IPCC_110000 Diarization Test Set
Since original conversations from IPCC_110000 are shuffled by channels due to internal
policies that protect customer privacy, mock conversations generated from utterances of
distinguished speakers will be used instead. Furthermore, to expand the scope the di-
arization not only in 2-way conversations, but also in conversations with more speakers,
mock conversations of 3-way, 4-way, and 5-way conversations are also generated. From
the test split of IPCC_110000, the following data sets are generated:
• IPCC_110000-TEST-MOCK_200_2: 200 2-way conversations.
• IPCC_110000-TEST-MOCK_200_3: 200 3-way conversations.
• IPCC_110000-TEST-MOCK_200_4: 200 4-way conversations.
• IPCC_110000-TEST-MOCK_200_5: 200 5-way conversations.
Each conversation among conversations in each subset is generated by following steps:
N
• Step 1: Chose a number of speaker .
S
• Step 2: Randomly pick speaker from speakers pool of the whole data set.
S
• Step 3: With each picked speaker sj, randomly pick uj (where uj is a random integer
in range [umin,umax]) utterances from the speaker’s utterances pool. If there’s not
enough uj in speaker’s utterances pool, then discard picked utterances and go back
to step 2.
• Step 4: Go back to step 2 until utterances are picked from all speakers.
S
38 Chapter 3. Experiments
• Step 5: Shuffle the list of picked utterances, then concatenate utterances together
(with in-between silences of duration randomly chosen ranging from 0.2 to 1.0 sec-
onds) into a single audio file. Conversation ID, oracle timestamps and distinguished
speakers information are bundled into an accompanying RTTM file.
Four mock conversations subsets generated from IPCC_110000 test split mentioned
above are generated with , ,
N = 200 S ∈ { }
2 3 4 5
, , , u min = 2 and u max = 20.
3.3.2 VTR_1350
VTR_1350 is a data set consisting of 491.1 hours of wide-band recordings, originally
sampled at 16000Hz, recorded by a selected group of 1346 broadcasters using written
transcripts (as shown in table ). The recording environment is significantly less
3.4
noisy than IPCC_110000’s, and thus it can be considered as a clean set. Most recorded
utterances are from 3 to 10 seconds in length, while each speaker have from 200 to 300
utterances, making up from 10 to 33 minutes of speech. The topics are daily news’
topics, from politics, health care to sports. Although the number of speakers are much
less than IPCC’s, the content of the recordings are much diverse and the duration are
much longer. However, VTR_1350 is significantly less natural as a normal conversation
than IPCC_110000, due to the mentioned fact that it’s recorded with planned transcripts
under controlled environment. Table gives an overview of this data set and figure
3.4
3.7 demonstrates how the data is distributed.
Data set VTR_1350
Base sample rate 16000 Hz
Environment Controlled recording
# Speakers 1346
# Utterances 318599
Total duration 491.1 hours
TABLE 3.4: VTR_1350 data set overview.
VTR_1350 is down-sampled to 8000Hz and used exclusively for testing in both speaker
verification and speaker diarization task.
3.3. Data Sets 39
F 3.7: VTR_1350 data distributions.
IGURE
3.3.2.1 VTR_1350 Verification Test Set
From VTR_1350 utterance list, a verification list of 7902 pairs is generated using the
same method described in , with (targets/non-targets per speakers). The
3.3.1.1 K = 3
followings are first 6 lines of the verification list:
VTR_1350-yenvth4-071303 VTR_1350-yenvth4-049030 target
VTR_1350-yenvth4-047494 VTR_1350-yenvth4-053126 target
VTR_1350-yenvth4-067463 VTR_1350-yenvth4-001415 target
VTR_1350-yenvth4-050310 VTR_1350-119706-073293 nontarget
VTR_1350-yenvth4-105862 VTR_1350-oanhdtv-029068 nontarget
VTR_1350-yenvth4-001159 VTR_1350-tungdhd-076747 nontarget
3.3.2.2 VTR_1350 Diarization Test Set
VTR_1350 is a data set of 1-way conversations. Hence, to make use of this data set in
speaker diarization task, four mock conversation subsets are generated form the data set
itself, basing on the method and configuration described in :
3.3.1.2
• VTR_1350-MOCK_200_2: 200 2-way conversations.
• VTR_1350-MOCK_200_3: 200 3-way conversations.
• VTR_1350-MOCK_200_4: 200 4-way conversations.
• VTR_1350-MOCK_200_5: 200 5-way conversations.
40 Chapter 3. Experiments
3.3.3 ZALO_400
ZALO_400, issued by ZALO AI Challenge 2020 [ ] is a wide-band data set consist-
83
ing of 8.7 hours of recordings, sampled at 48000Hz, recorded by a selected group of
400 broadcasters using planned transcripts. The content and recording environment
quite resembles that of VTR_1350, while the data distributions are quite different from
VTR_500’s. Most utteranes are from 4 to 12 seconds, while most speakers constitute
from 15 to 40 utterances, making up from 1 to 2 minutes of speech. ZALO_400 was
originally released as the train data set for the challenge. However, under the scope of
this thesis, it’s used exclusive for testing both speaker verification and speaker diariza-
tion tasks. Table gives an overview of this data set and figure demonstrates how
3.5 3.8
the data is distributed.
Data set ZALO_400
Base sample rate 48000 Hz
Environment controlled recording
# Speakers 400
# Utterances 10555
Total duration 8.699 hours
TABLE 3.5: ZALO_400 data set overview.
F 3.8: ZALO_400 data distributions.
IGURE
3.3.3.1 ZALO_400 Verification Test Set
The verification set is generated by the same method described in section , with
3.3.1.1
K = 3 (targets/non-targets per speakers). The followings are first 6 lines of the verifica-
tion list:
3.4. Baseline System 41
424-64 424-35 target
424-46 424-45 target
424-31 424-36 target
424-49 518 nontarget
424-39 500-12 nontarget
424-15 514-30 nontarget
3.3.3.2 ZALO_400 Diarization Test Set
The diarization test set consists of four mock conversation subsets generated from VTR_400
data set, including:
• ZALO_400-MOCK_200_2: 200 2-way conversations.
• ZALO_400-MOCK_200_3: 200 3-way conversations.
• ZALO_400-MOCK_200_4: 200 4-way conversations.
• ZALO_400-MOCK_200_5: 200 5-way conversations.
These conversations are generated by the same method and configuration described in
section .
3.3.1.2
3.4 Baseline System
The baseline system consists of three main phases:embeddings extractor training, PLDA
backend training and speaker diarization. In addition to these phase, an asterisk phase,
speaker verification is included to further optimize speaker diarization results.
3.4.1 Speaker Diarization System
The speaker diarization subsystem takes a recorded conversation as input and lets the
user know the number of speakers and the timestamps of their speaking within the con-
versation. Without further recognizing, all speakers remain anonymous. The result of
this subsystem can be encapsulated into a Rich Transcription Time Marked (RTTM) file
[ ]. The main computation pipeline, which takes a recorded conversation as input and
1
gives RTTM file as output, can be described in the following stages, in order:
42 Chapter 3. Experiments
• Stage 1 - Front-end Processing: The enrolled and the questioned utterance are
windowed by a 10ms Hamming window with a 10ms frame shift to extract 30-
dimensional MFCCs.
• Stage 2 - Voice Activity Detection: WebRTC VAD [ ] is utilized to extract speech
43
partitions from the enrolled and questioned utterances. In this process, each utter-
ance is sliced into uniform non-overlapping small chunk of 0.03 seconds and We-
bRTC VAD will decide each frame is speech or not. The threshold of this decision,
which is called "aggressiveness", is set to its max level - level 4. After that adjacent
chunks recognized as speech are grouped together into a bigger speech segment. If
the maximum PCM amplitude of a newly formed speech segment smaller than 0.05,
the segment will be discarded. Furthermore, speech segments that are shorter than
0.2 seconds are also discarded. Finally, each speech segment are padded with 0.05
seconds at its head and at its tail; and if two consecutive speech segments overlap
each other and the merged segment of those two is shorter than 15 seconds, they are
merged together.
• Stage 3 - Uniform Segmentation: Each speech partition is further sub-segmented
into homogeneous overlapping sub-segments of length seconds with an overlap
L
of seconds. The value of L is chosen between 1.5 and 4 seconds.
L/2 { }
L L
; /2
denotes this segmenting strategy. The extracted features from stage 2 are mapped
into these sub-segments for the next stage.
• Stage 4 - Embeddings Extraction: An embeddings extractor is employed to extract
embeddings from each sub-segments from the last stage. In the baseline system, it’s
an X-Vector embeddings extractor. In the baseline system, the extractor is a X-
Vectors embeddings extractor, trained from IPCC_110000 data through 3 epochs.
The training process is executed by using Kaldi’s VoxCeleb recipe [ ][ ]. The
84 59
data augmentation strategies are slightly changed to the followings:
– Additive Noises: adding random noise sequence to the input signal with a
signal-to-noise ratio (SNR) [ ] randomly chosen between 0 and 15.
85
– Reverberation: involving simulated room impulse responses (RIR) [ ] with
86
input signal.
– Additive Noises and Reverberation: combining the additive noises and rever-
beration augmentation on a single input signal.
3.4. Baseline System 43
– Speed Perturbation: speeding up and speeding down the input signal. To avoid
changing the speaker characteristics too much, speed perturbation is restricted
to a maximum of .
±5%
– Waveform dropout: replacing random chunks of the input waveform with zeros.
– Frequency dropout: filtering the input signal with random band-stop filters to
add zeros in the frequency spectrum.
The output of this stage is two 128-dimensional embedding vectors that respectively
represent the enrolled utterance and the questioned utterance.
• Stage 5 - PLDA Scoring: The PLDA back-end that was trained on embeddings
extracted using the same embeddings extractor from the last step is used to score the
similarity (a single float value) between the enrolled and the questioned utterances.
• Stage 6 - Agglomerative Hierarchical Clustering: Sub-segments’ embeddings
are clustered using Agglomerative Hierarchical Clustering (AHC) method, in which
the distance function directly takes results from the affinity scoring matrix gener-
ated from the last stage. This method supports clustering with a known number
speakers or an unknown number of speakers. The latter case requires a pre-defined
decision threshold. At the end of this stage, each sub-segment embedding vector, or
the sub-segment itself, is tagged with a number from 1 to - the number of distin-
K
guished speakers in the audio input. Each sub-segment’s begin time and duration,
along with the speaker tag, are all recorded in the output Rich Transcription Time
Marked (RTTM) file [ ].
1
3.4.2 Speaker Verification System
The main purpose of the implemented speaker verification system is to evaluate the
discriminative characteristic of extracted embeddings, without being affected by other
modules in a traditional speaker diarization system. In other words, the use of this
system is to improve and optimize the baseline diarization system in terms of speaker
representations.
From a given data set with speaker information, one can generate a verification list - a list
of utterances pairs with ground truth values that tell each of those pair is from the same
44 Chapter 3. Experiments
Phase 1:
Embeddings
Extractor Training
Phase 2:
PLDA Backend
Training
Training Data
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Training Data
Training
Embeddings
Extractor
Training
PLDA
Backend
Phase 3:
Speaker
Diarization
Front-end
Processing
Voice Activity
Detection
Segmentation
Extract Speaker
Embeddings
Speaker
Embeddings
PLDA
Scoring
Affinity
Scoring
Matrix
Clustering
RTTM
Output
Audio
Input
(Enrolled)
Training
Speaker
Embeddings
IPCC_100K
(train + dev)
WebRTC
Uniform
Agglomerative
Hierarchical
X-Vectors
F 3.9: Baseline speaker diarization system diagram.
IGURE
speaker or not. A crucial assumption is that in the enrolled or questioned utterances,
there is only one speaker participating.
The speaker verification system takes the verification list as input and gives a scoring
list as output, which tells a similarity value for each pair in the verification list. The
higher similarity is, the higher chance two utterances are from the same speaker. Then,
by choosing a decision threshold, which is the minimum value that the similarity value
needs to reach to get the pair of utterances determined from the same speaker, one can
generate a list of predictions. By comparing the verification list with the prediction list,
binary classification metrics [ ] including False Positive Rate (FPR) and False Negative
87
Rate (FNR) are calculated.
3.4. Baseline System 45
Testing
Data
Generate
Verification Pairs
Phase *:
Speaker Verification
(for optimization)
Verification Pairs
utt-4ae utt-c3d target
utt-f34 utt-ced target
utt-ae3 utt-b2e nontarget
...
Audio
Input
(Enrolled)
Audio
Input
(Imposter)
All pairs
are scored?
Fetch
Verification Pair
No
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Speaker
Embed.
(Enrolled)
Speaker
Embed.
(Imposter)
PLDA
Scoring
Scoring
Value
Verification Scores
utt-4ae utt-c3d 0.6554
utt-f34 utt-ced 0.9786
utt-ae3 utt-b2e 0.4587
...
Yes
Embeddings
Extractor
PLDA
Backend
EER
Plotting
EER
EER Threshold
(from )
Phase 1
(from )
Phase 2
X-Vectors
WebRTC
F 3.10: Baseline speaker verification system diagram.
IGURE
By repeatedly choosing all similarity values in the scoring list as decision thresholds, a
plot of rates (FPR and FNR) against the decision threshold can be made. The intersection
of the TPR and FNR lines on the plot is the Equal Error Rate (EER) (as its name suggests,
it’s where the rate of errors are equal). Another important metric usually calculated at
the point of EER on the graph is Minimum Decision Cost Function (MinDCF). The
detailed computation methods for these metrics are described in section .
3.1.1
The main computation pipeline, which takes a pair of utterances in the verification list
and gives their similarity, is divided into the following stages, in order:
• Stage 1 - Front-end Processing: The enrolled and the questioned utterance are
windowed by a 10ms Hamming window with a 10ms frame shift to extract 30-
dimensional MFCCs.
46 Chapter 3. Experiments
• Stage 2 - Voice Activity Detection: WebRTC VAD [ ] is utilized to extract speech
43
partitions from the conversation. The working configuration of WebRTC VAD is
the same as the configuration described in the voice activity detection stage in sec-
tion .
3.4.1
• Stage 3 - Embeddings Extraction: Each group of Speech partitions extracted from
stage 2 are concatenated together and then an embeddings extractor is employed to
extract speaker embeddings. It’s the same extractor used in section . The
3.4.1
output of this stage is 128-dimensional embedding vectors that represent sub-
N N
segments.
• Stage 4 - PLDA Scoring: The PLDA back-end that was trained on embeddings
extracted using the same embeddings extractor from the last step is used to score the
similarity (a single float value) between the enrolled and the questioned utterances.
3.5 Proposed System
In recent years, ECAPA-TDNN, a recent development over X-Vectors’ neurel network
with residual connections and attention on both time and feature channels, has shown
state-of-the-art results on popular English corpora. Table and reports the outper-
3.6 3.7
forming of ECAPA-TDNN against a strong X-Vector baseline system as experimented in
[ ] in both speaker verification task and speaker diarization task with English corpora.
62
TABLE 3.6: EER and MinDCF performance of all systems on the standard
VoxCeleb1 and VoxSRC 2019 test sets (taken from [ ]).
62
In the proposed system, X-Vectors-based extractor is replaced with ECAPA-TDNN-
based extractor, and the PLDA backend is trained with ECAPA-TDNN embeddings in-
stead. The employed ECAPA-TDNN embeddings extractor is trained on the same data
set and data augmentation strategies, on which the X-Vectors embeddings extractor is
3.5. Proposed System 47
TABLE 3.7: Diarization Error Rates (DERs) on AMI dataset using the
beamformed array signal on baseline and proposed systems (taken from
[ ]).
88
trained. The network architecture is kept as described in section . The number of
2.4.2
MFCCs taken is reduced from 80 down to 40, the minimum learning rate is lowered by
10 times, and the number of epochs is doubled from 10 to 20. Figure visualizes the
??
proposed system.
48 Chapter 3. Experiments
Phase 1:
Embeddings
Extractor Training
Phase 2:
PLDA Backend
Training
Training Data
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Training Data
Training
Embeddings
Extractor
Training
PLDA
Backend
Phase 3:
Speaker
Diarization
Front-end
Processing
Voice Activity
Detection
Segmentation
Extract Speaker
Embeddings
Speaker
Embeddings
PLDA
Scoring
Affinity
Scoring
Matrix
Clustering
RTTM
Output
Audio
Input
(Enrolled)
Training
Speaker
Embeddings
IPCC_100K
(train + dev)
WebRTC
Uniform
Agglomerative
Hierarchical
ECAPA-
TDNN
F 3.11: Proposed speaker diarization system diagram.
IGURE
3.5. Proposed System 49
Testing
Data
Generate
Verification Pairs
Phase *:
Speaker Verification
(for optimization)
Verification Pairs
utt-4ae utt-c3d target
utt-f34 utt-ced target
utt-ae3 utt-b2e nontarget
...
Audio
Input
(Enrolled)
Audio
Input
(Imposter)
All pairs
are scored?
Fetch
Verification Pair
No
Front-end
Processing
Voice Activity
Detection
Extract Speaker
Embeddings
Speaker
Embed.
(Enrolled)
Speaker
Embed.
(Imposter)
PLDA
Scoring
Scoring
Value
Verification Scores
utt-4ae utt-c3d 0.6554
utt-f34 utt-ced 0.9786
utt-ae3 utt-b2e 0.4587
...
Yes
Embeddings
Extractor
PLDA
Backend
EER
Plotting
EER
EER Threshold
(from )
Phase 1
(from )
Phase 2
ECAPA-
TDNN
WebRTC
F 3.12: Proposed speaker verification system diagram.
IGURE
51
Chapter 4
Results
4.1 Speaker Verification Task
In this task, both baseline and proposed speaker verification sub-systems are tested with
different PLDA dimension reduction configurations. The dimension reduction ratios ri
are on the scale of . The corresponding reduced,
{ }
0 5 0 6 0 7 0 8 0 85 0 9 0 95 1 00
. , . , . , . , . , . , . , .
or target dimension Di is calculated in equation , where V is the original embedding’s
4.1
dimension, which is 128 in case of the baseline system, and 192 in case of the proposed
system. The PLDA backend is trained on the same training data set of the embeddings
extractor.
Di = 4∗ 
ri ∗V
4  (4.1)
As reported in table , the proposed system with ECAPA-TDNN architecture out-
4.1
performs the baseline system regarding both EER and MinDCF performance in all test
cases. In the tests with IPCC_110000 test split, the proposed system gives 64.5% relative
improvement in EER, with corresponding 82.4% and 86.5% relative improvements on
MinDCF(p=0.01) and MinDCF(p=0.001) respectively. The improvements on MinDCF
are smaller in the tests with VTR_1350, where the proposed system gives 66.1% rela-
tive improvement in EER, with corresponding 64.7% and 63.1% relative improvements
on MinDCF(p=0.01) and MinDCF(p=0.001) respectively. The improvements given by
the proposed system in the tests with ZALO_400 are smaller than in both of men-
tioned tests, but still significant, where it gives 45.6% relative improvement in EER,
with corresponding 22.5% and 22.5% relative improvements on MinDCF(p=0.01) and
MinDCF(p=0.001) respectively.
52 Chapter 4. Results
Furthermore, both systems show a consistent degradation in terms of equal error rate
(EER) when the dimension reduction ratio increases with only two exceptions. The first
exception occurs in the tests with IPCC_110000 test split: the EER of the proposed
system falls from 1.44% down to 1.39% and then raise up to 1.54%, when the target
dimension falls from 192 to 180, and then 172. The second exception happens occurs
in the tests with ZALO_400 data set, when the EER of the proposed system falls from
8.08% down to 7.91% and then raise up to 8.16%, when the target dimension experience
the same changes mentioned in the first exception. However, in both of these exceptions,
the swings are insignificant and did not affect the trend of EER. As for MinDCF, this
metric does not show a clear trend against the reduction of embedding dimensions.
In summary, the proposed system shows significant improvements over the baseline sys-
tem, and the embedding dimension shouldn’t be further reduced in PLDA scoring stage.
4.1. Speaker Verification Task 53
IPCC_110000 (test split)
(8000Hz, K=3, # Trials=3888)
X-Vector
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 128 0.6240 0.8112 3.91
0.95 120 0.6183 0.8117 3.96
0.90 112 0.6183 0.8066 4.01
0.85 108 0.6163 0.8102 4.17
0.80 100 0.6317 0.8050 4.27
0.70 88 0.6497 0.8020 4.32
0.60 76 0.6445 0.8138 4.53
0.50 64 0.6533 0.7917 4.78
ECAPA-TDNN
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 192 0.1024 0.1085 1.44
0.95 180 0.1065 0.1080 1.44
0.90 172 0.1096 0.1096 1.39
0.85 160 0.0998 0.0998 1.54
0.80 152 0.1070 0.1070 1.54
0.70 132 0.0983 0.0983 1.65
0.60 112 0.1101 0.1101 1.85
0.50 96 0.1240 0.1240 2.01
VTR_1350
(8000Hz (resampled), K=3, # Trials=7902)
X-Vector
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 128 0.7588 0.8651 9.69
0.95 120 0.7603 0.8641 9.82
0.90 112 0.7472 0.8659 9.87
0.85 108 0.7325 0.8669 9.92
0.80 100 0.7327 0.8502 10.02
0.70 88 0.7423 0.8322 10.10
0.60 76 0.7261 0.8461 10.30
0.50 64 0.7459 0.8494 10.48
ECAPA-TDNN
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 192 0.2680 0.3192 3.29
0.95 180 0.2721 0.3680 3.49
0.90 172 0.2797 0.3936 3.54
0.85 160 0.3055 0.4052 3.59
0.80 152 0.2971 0.3941 3.62
0.70 132 0.3214 0.4432 3.70
0.60 112 0.3774 0.4450 3.77
0.50 96 0.3991 0.4938 3.77
ZALO_400
(8000Hz (resampled), K=3, # Trials=2376)
X-Vector
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 128 0.9470 0.9470 14.39
0.95 120 0.9562 0.9562 14.56
0.90 112 0.9444 0.9444 14.65
0.85 108 0.9444 0.9444 14.73
0.80 100 0.9402 0.9402 14.90
0.70 88 0.9478 0.9478 14.90
0.60 76 0.9621 0.9621 15.24
0.50 64 0.9739 0.9739 14.65
ECAPA-TDNN
PLDA MinDCF EER
ratio dim p=0.01 p=0.001 (%)
1.00 192 0.7340 0.7340 7.83
0.95 180 0.7424 0.7424 8.08
0.90 172 0.7306 0.7306 7.91
0.85 160 0.7214 0.7214 8.16
0.80 152 0.6987 0.6987 8.16
0.70 132 0.7079 0.7079 8.67
0.60 112 0.7374 0.7374 9.34
0.50 96 0.7332 0.7332 9.93
TABLE 4.1: EER and MinDCF performance.
54 Chapter 4. Results
4.2 Speaker Diarization Task
In this task, the whole baseline and proposed systems are tested with mock conversa-
tions, consisting of different number of engaging speakers, with different uniform sub-
segmenting configurations. In this test, oracle VAD (i.e: ground-truth VAD) is used to
diminish the effect of any voice activity detection modules, PLDA scoring is carried out
without dimension reduction, and the exact number of engaging speakers in each con-
versation is known before clustering process. Results are reported in table , where
4.2
{ }
x y
: represents an uniform segmentation configuration of windows of length seconds
x
with seconds overlaps.
y
F 4.1: A speaker diarization output of a 3-way conversation in
IGURE
VTR_1350 test set.
Both system performs relatively well with IPCC_110000 test split’s mock conversations,
where DERs are all below 4.5 percent. This results match the fact that these conversa-
tions are generated from the data set that is in-domain with the embedding extractor’s
training data set. The results with ZALO_400 mock conversations is much worse when
the DER goes up to 17.15% in case of the baseline system, and 11.20% in case of the
proposed system. With VTR_1350 mock conversations, the results are even worse than
that. It goes up to 24.25% in case of the baseline system, and 22.33% in case of the
proposed system.
In most cases, the proposed system with ECAPA_TDNN out-perform the baseline sys-
tem and with most conversation types and all sub-segmentation configurations. In each
set of conversations participated by the same number of speaker, the best DER among
different sub-segmenting configurations of the proposed system usually outperforms the
baseline system by from 30% to 70%.
4.2. Speaker Diarization Task 55
Furthermore, while both systems performs better with wider provided context (i.e: large
sub-segmentation window size), the proposed system doesn’t show a significant im-
provement in the relative DER reduction against context windows size. In other words,
the proposed system’s DER is expected to decrease faster than the baseline system’s
DER since ECAPA-TDNN theoretically make use of the wide context thanks to its at-
tention mechanism, while it isn’t. This experiment proved that in speaker diarization
task, where speech segments are sub-segmented into sub-segments smaller than 4 sec-
onds, the attention at time channel isn’t much effective.
In conclusion, the proposed system gives significant improvement in DER performance
over the baseline system.
56 Chapter 4. Results
IPCC_110000.test - Mock Conversations
(8000Hz; 2-way, 3-way, 4-way and 5-way conversations (200 each))
# spk subseg. DER (%)
X-Vector ECAPA
2 {1.5 : 0.75} 2.93 2.66
2 {2.0 : 1.00} 2.72 2.06
2 {3.0 : 1.50} 2.54 1.70
2 {4.0 : 2.00} 2.17 1.50
3 {1.5 : 0.75} 2.65 2.65
3 {2.0 : 1.00} 2.72 1.36
3 {3.0 : 1.50} 2.34 0.93
3 {4.0 : 2.00} 2.10 0.89
# spk subseg. DER (%)
X-Vector ECAPA
4 {1.5 : 0.75} 4.35 4.33
4 {2.0 : 1.00} 4.43 2.82
4 {3.0 : 1.50} 3.53 2.36
4 {4.0 : 2.00} 2.88 2.93
5 {1.5 : 0.75} 5.27 3.88
5 {2.0 : 1.00} 3.78 2.63
5 {3.0 : 1.50} 3.26 2.91
5 {4.0 : 2.00} 3.03 3.14
VTR_1350 - Mock Conversations
(8000Hz (resampled); 2-way, 3-way, 4-way and 5-way conversations (200 each))
# spk subseg. DER (%)
X-Vector ECAPA
2 {1.5 : 0.75} 11.31 18.59
2 {2.0 : 1.00} 9.32 8.41
2 {3.0 : 1.50} 5.31 2.66
2 {4.0 : 2.00} 2.45 1.31
3 {1.5 : 0.75} 17.77 20.13
3 {2.0 : 1.00} 12.52 12.80
3 {3.0 : 1.50} 8.39 5.63
3 {4.0 : 2.00} 7.08 3.42
# spk subseg. DER (%)
X-Vector ECAPA
4 {1.5 : 0.75} 21.18 21.63
4 {2.0 : 1.00} 16.54 12.71
4 {3.0 : 1.50} 10.34 5.34
4 {4.0 : 2.00} 8.20 4.44
5 {1.5 : 0.75} 24.25 22.33
5 {2.0 : 1.00} 18.24 11.70
5 {3.0 : 1.50} 12.16 4.90
5 {4.0 : 2.00} 8.63 2.95
ZALO_400 - Mock Conversations
(8000Hz (resampled); 2-way, 3-way, 4-way and 5-way conversations (200 each))
# spk subseg. DER (%)
X-Vector ECAPA
2 {1.5 : 0.75} 5.62 4.86
2 {2.0 : 1.00} 4.78 3.60
2 {3.0 : 1.50} 5.40 2.72
2 {4.0 : 2.00} 6.61 6.06
3 {1.5 : 0.75} 10.85 7.11
3 {2.0 : 1.00} 10.11 6.15
3 {3.0 : 1.50} 9.72 5.77
3 {4.0 : 2.00} 10.92 7.59
# spk subseg. DER (%)
X-Vector ECAPA
4 {1.5 : 0.75} 15.35 8.85
4 {2.0 : 1.00} 12.83 6.59
4 {3.0 : 1.50} 11.06 6.14
4 {4.0 : 2.00} 12.17 6.58
5 {1.5 : 0.75} 17.15 11.20
5 {2.0 : 1.00} 13.79 8.72
5 {3.0 : 1.50} 12.31 6.76
5 {4.0 : 2.00} 12.44 7.09
TABLE 4.2: DER performance.
57
Chapter 5
Conclusions and Future Works
In this thesis, a new architecture of deep neural network, ECAPA-TDNN, was experi-
mented in comparison the the baseline system based on X-Vectors and showed signifi-
cant overall improvements. The proposed system outperformed the baseline on all Viet-
namese data sets and on on tasks: speaker verification and speaker diarization. Thanks to
the attention mechanism that operates on both time and features channels, the proposed
network can learn which data in the context and which features are more important.
This context-aware feature of ECAPA-TDNN is remarkably important since different
languages have different ways of constructing sentences, and the word positioning in
Vietnamese is totally different from that of English or French. In this sense, ECAPA-
TDNN can be adapted to a wide variety of languages of different writing styles, and
indeed it worked with Vietnamese conversations. Followings are some highlighted pros
and cons in using ECAPA-TDNN in the speaker diarization system:
Pros:
• ECAPA-TDNN provides context-aware embeddings with attention on both time and
features channels that work exceptionally well with Vietnamese data.
• Based entirely on Pytorch framework, ECAPA-TDNN is much easier to train, to test
and to customize than in Kaldi.
Cons:
• Both the training and inference processes are slower due to complexity of the net-
work. With an NVIDIA-A100 GPU it still takes 80 hours to complete 20 training
epochs with IPCC_110000 data set.
58 Chapter 5. Conclusions and Future Works
• The network is not yet production-ready, while X-Vectors network, trained with
Kaldi has long been used in production, for both speaker verification and diarization
system.
Further research directions can be explored to improve the understanding about the ca-
pability of ECAPA-TDNN in speaker diarization includes:
• Trials and errors with more configurations (in this thesis, only some minor changes
were made to the original network configurations).
• Exploring other types of clustering methods.
• Learning how effective the proposed system is in case of conversations with multi-
ple overlaps.
• Apply post-processing methods to the diarization result.
• Build a Vietnamese conversation data set based on real conversations.
59
Bibliography
[1] Omid Sadjadi et al. . en. 2021.
NIST 2021 Speaker Recognition Evaluation Plan
URL: https://tsapps.nist.gov/publication/get%5Fpdf.cfm?pub
%5Fid=932697.
[2] David Arthur and Sergei Vassilvitskii. “k-means++: the advantages of careful seed-
ing”. In: SODA ’07. 2007.
[3] Dan Pelleg and Andrew W. Moore. “X-means: Extending K-means with Efficient
Estimation of the Number of Clusters”. In: . 2000.
ICML
[4] Aonan Zhang et al. “Fully Supervised Speaker Diarization”. In: ICASSP 2019 -
2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (2019), pp. 6301–6305.
[5] Shota Horiguchi et al. “End-to-End Speaker Diarization for an Unknown Number
of Speakers with Encoder-Decoder Based Attractors”. In: abs/2005.09921
ArXiv
(2020).
[6] Yuki Takashima et al. “End-to-End Speaker Diarization Conditioned on Speech
Activity and Overlap Detection”. In: 2021 IEEE Spoken Language Technology
Workshop (SLT) (2021), pp. 849–856.
[7] Tsun-Yat Leung and Lahiru Samarakoon. “Robust End-to-End Speaker Diarization
with Conformer and Additive Margin Penalty”. In: Interspeech 2021 (2021).
[8] Niruhan Viswarupan. K-Means Data Clustering. 2017. :
URL https://towardsdatascience
com/k-means-data-clustering-bce3335d2203 (visited on 12/09/2021).
[9] Sabur Ajibola Alim and Nahrul Khair Alang Rashid. “From Natural to Artificial
Intelligence - Algorithms and Applications”. In: IntechOpen, 2018. Chap. 1.
[10] Urmila Shrawankar and Vilas M. Thakare. “Techniques for Feature Extraction In
Speech Recognition System : A Comparative Study”. In: abs/1305.1145
CoRR
(2013). arXiv: . : .
1305.1145 URL http://arxiv.org/abs/1305.1145
60 Bibliography
[11] Smita Magre, Pooja Janse, and Ratnadeep Deshmukh. “A Review on Feature Ex-
traction and Noise Reduction Technique”. In: (Feb. 2014).
[12] Bob Meddins. “5 - The design of FIR filters”. In: Introduction to Digital Signal
Processing. Ed. by Bob Meddins. Oxford: Newnes, 2000, pp. 102–136. : 978-
ISBN
0-7506-5048-9. :
DOI https :/ /doi.org /10. 1016/B978 - 075065048 - 9 /
50007-6 https://www.sciencedirect.com/science/article/pii/
. :
URL
B9780750650489500076.
[13] Torben Poulsen. “Loudness of tone pulses in a free field”. In: Acoustical Society
of America Journal 69.6 (June 1981), pp. 1786–1790. : .
DOI 10.1121/1.385915
[14] Stanislas Dehaene. “The neural basis of the Weber–Fechner law: a logarithmic
mental number line”. In: Trends in cognitive sciences 7.4 (2003), pp. 145–147.
[15] S. S. Stevens. “A Scale for the Measurement of the Psychological Magnitude
Pitch”. In: Acoustical Society of America Journal 8.3 (Jan. 1937), p. 185. :
DOI
10.1121/1.1915893.
[16] Robert B. Randall. “A history of cepstrum analysis and its application to mechan-
ical problems”. In: Mechanical Systems and Signal Processing 97 (2017). Special
Issue on Surveillance, pp. 3–19. : 0888-3270. :
ISSN DOI https://doi.org/10.
1016/j.ymssp.2016.12.026 https://www.sciencedirect.com/
. :
URL
science/article/pii/S0888327016305556.
[17] Philipos C. Loizou. Speech Enhancement: Theory and Practice. 2nd. USA: CRC
Press, Inc., 2013. : 1466504218.
ISBN
[18] Xugang Lu et al. “Speech enhancement based on deep denoising autoencoder”.
In: . 2013.
INTERSPEECH
[19] Yong Xu et al. “A Regression Approach to Speech Enhancement Based on Deep
Neural Networks”. In: IEEE/ACM Transactions on Audio, Speech, and Language
Processing 23.1 (2015), pp. 7–19. : .
DOI 10.1109/TASLP.2014.2364452
[20] Hakan Erdogan et al. “Phase-sensitive and recognition-boosted speech separation
using deep recurrent neural networks”. In: 2015 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 708–712.
[21] Tian Gao et al. “Densely Connected Progressive Learning for LSTM-Based Speech
Enhancement”. In: 2018 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). 2018, pp. 5054–5058. :
DOI 10 . 1109 / ICASSP .
2018.8461861.
Bibliography 61
[22] Desh Raj et al. Integration of speech separation, diarization, and recognition
for multi-speaker meetings: System description, comparison, and analysis. 2020.
arXiv: .
2011.02014 [eess.AS]
[23] Gregory Sell et al. “Diarization is Hard: Some Experiences and Lessons Learned
for the JHU Team in the Inaugural DIHARD Challenge”. In: .
INTERSPEECH
2018.
[24] Neville Ryant et al. The Second DIHARD Diarization Challenge: Dataset, task,
and baselines. 2019. arXiv: .
1906.07839 [eess.AS]
[25] Mireia Díez et al. “BUT System for DIHARD Speech Diarization Challenge 2018”.
In: . 2018.
INTERSPEECH
[26] Shinji Watanabe et al. “CHiME-6 Challenge: Tackling Multispeaker Speech Recog-
nition for Unsegmented Recordings”. In: Proc. 6th International Workshop on
Speech Processing in Everyday Environments (CHiME 2020). 2020, pp. 1–7. :
DOI
10.21437/CHiME.2020-1.
[27] Ashish Arora et al. The JHU Multi-Microphone Multi-Speaker ASR System for the
CHiME-6 Challenge. 2020. arXiv: .
2006.07898 [eess.AS]
[28] Wikipedia contributors. Maximum likelihood estimation — Wikipedia, The Free
Encyclopedia. https://en.wikipedia.org/w/index.php?title=Maximum_
likelihood_estimation&oldid=1051139067. [Online; accessed 17-November-
2021]. 2021.
[29] John R. Hershey et al. Deep clustering: Discriminative embeddings for segmenta-
tion and separation. 2015. arXiv: .
1508.04306 [cs.NE]
[30] Morten Kolb k et al.
æ Multi-talker Speech Separation with Utterance-level Per-
mutation Invariant Training of Deep Recurrent Neural Networks. 2017. arXiv:
1703.06284 [cs.SD].
[31] Yi Luo and Nima Mesgarani. “Conv-TasNet: Surpassing Ideal Time–Frequency
Magnitude Masking for Speech Separation”. In: IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing 27.8 (2019), 1256–1266. : 2329-9304.
ISSN
DOI URL
: .
10.1109/taslp.2019.2915167 : http://dx.doi.org/10.1109/
TASLP.2019.2915167.
[32] Xiong Xiao et al. Microsoft Speaker Diarization System for the VoxCeleb Speaker
Recognition Challenge 2020. 2020. arXiv: .
2010.11458 [eess.AS]
62 Bibliography
[33] Arsha Nagrani et al. VoxSRC 2020: The Second VoxCeleb Speaker Recognition
Challenge. 2020. arXiv: .
2012.06867 [cs.SD]
[34] Takuya Yoshioka et al. “Recognizing Overlapped Speech in Meetings: A Mul-
tichannel Separation Approach Using Neural Networks”. In: Interspeech 2018
(2018). : . :
DOI 10.21437/interspeech.2018-2284 URL http://dx.doi.
org/10.21437/Interspeech.2018-2284.
[35] Christoph Boeddecker et al. “Front-end processing for the CHiME-5 dinner party
scenario”. In: Proc. 5th International Workshop on Speech Processing in Everyday
Environments (CHiME 2018). 2018, pp. 35–40. : .
DOI 10.21437/CHiME.2018-8
[36] Wikipedia contributors. Audacity (audio editor) — Wikipedia, The Free Encyclo-
pedia. https :/ /en .wikipedia .org /w / index. php? title= Audacity_
(audio_editor)&oldid=1054771106. [Online; accessed 17-November-2021].
2021.
[37] A. Benyassine et al. “ITU-T Recommendation G.729 Annex B: a silence compres-
sion scheme for use with G.729 optimized for V.70 digital simultaneous voice and
data applications”. In: IEEE Communications Magazine 35.9 (1997), pp. 64–73.
DOI: .
10.1109/35.620527
[38] Jongseo Sohn and Wonyong Sung. “A voice activity detector employing soft deci-
sion based noise spectrum adaptation”. In: Proceedings of the 1998 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat.
No.98CH36181). Vol. 1. 1998, 365–368 vol.1. :
DOI 10.1109/ICASSP.1998.
674443.
[39] Monica Franzese and Antonella Iuliano. “Hidden Markov Models”. In: Encyclo-
pedia of Bioinformatics and Computational Biology. Ed. by Shoba Ranganathan et
al. Oxford: Academic Press, 2019, pp. 753–762. : 978-0-12-811432-2. :
ISBN DOI
https://doi.org/10.1016/B978-0-12-809633-8.20488-3 https:
. :
URL
//www.sciencedirect.com/science/article/pii/B9780128096338204883.
[40] Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. “A statistical model-based
voice activity detection”. In: IEEE Signal Processing Letters 6.1 (1999), pp. 1–3.
DOI: .
10.1109/97.736233
[41] Jacob Benesty, M Mohan Sondhi, Yiteng Huang, et al. Springer handbook of
speech processing. Vol. 1. Springer, 2008.
Bibliography 63
[42] Wikipedia contributors. WebRTC — Wikipedia, The Free Encyclopedia. [Online;
accessed 17-November-2021]. 2021. :
URL https://en.wikipedia.org/w/
index.php?title=WebRTC&oldid=1053350113.
[43] Webrtc/common_audio/VAD - external/webrtc - git at google. :
URL https://
chromium.googlesource.com/external/webrtc/+/branch-heads/43/
webrtc/common%5Faudio/vad/.
[44] Thad Hughes and Keir Mierle. “Recurrent neural networks for voice activity de-
tection”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing. 2013, pp. 7378–7382. : .
DOI 10.1109/ICASSP.2013.6639096
[45] Jesus Lopez et al. “Advances in Speaker Recognition for Telephone and Audio-
Visual Data: the JHU-MIT Submission for NIST SRE19”. In: Nov. 2020, pp. 273–
280. : .
DOI 10.21437/Odyssey.2020-39
[46] Matthew A. Siegler. “Automatic Segmentation, Classification and Clustering of
Broadcast News Audio”. In: 1997.
[47] “Step-by-step and integrated approaches in broadcast news speaker diarization”.
In: Computer Speech & Language 20.2 (2006). Odyssey 2004: The speaker and
Language Recognition Workshop, pp. 303–330. : 0885-2308. :
ISSN DOI https://
doi.org/10.1016/j.csl.2005.08.002 https://www.sciencedirect.
. :
URL
com/science/article/pii/S0885230805000471.
[48] Scott Chen. “Speaker, Environment and Channel Change Detection and Clustering
via the Bayesian Information Criterion”. In: 1998.
[49] Perrine Delacourt and Christian Wellekens. “DISTBIC: A speaker-based segmen-
tation for audio data indexing”. In: Speech Communication 32 (Sept. 2000), pp. 111–
126. : .
DOI 10.1016/S0167-6393(00)00027-3
[50] Simon Prince and James H. Elder. “Probabilistic Linear Discriminant Analysis
for Inferences About Identity”. In: 2007 IEEE 11th International Conference on
Computer Vision (2007), pp. 1–8.
[51] Daniel Garcia-Romero and Carol Y. Espy-Wilson. “Analysis of i-vector Length
Normalization in Speaker Recognition Systems”. In: . 2011.
INTERSPEECH
[52] In: ().
[53] Gregory Sell and Daniel Garcia-Romero. “Speaker diarization with plda i-vector
scoring and unsupervised calibration”. In: 2014 IEEE Spoken Language Technol-
ogy Workshop (SLT) (2014), pp. 413–417.
A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf
A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf
A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf

Mais conteúdo relacionado

Mais procurados

Strategic Management in Nestle
Strategic Management in NestleStrategic Management in Nestle
Strategic Management in NestleAlar Kolk
 
Coca recruitment process is well established
Coca recruitment process is well establishedCoca recruitment process is well established
Coca recruitment process is well establishedabidajameel
 
Coca-Cola Positioning Strategies Campaigns
Coca-Cola Positioning Strategies Campaigns Coca-Cola Positioning Strategies Campaigns
Coca-Cola Positioning Strategies Campaigns Ahmed Nabil
 
Strategic Management BA(Hons) Accounting & Finance- Year 3
Strategic Management BA(Hons) Accounting & Finance- Year 3Strategic Management BA(Hons) Accounting & Finance- Year 3
Strategic Management BA(Hons) Accounting & Finance- Year 3Sutharsanarao Kalla Rama Rao
 
Amit Patel- DavidsTea IMP
Amit Patel- DavidsTea IMPAmit Patel- DavidsTea IMP
Amit Patel- DavidsTea IMPAmit Patel
 

Mais procurados (8)

Strategic Management in Nestle
Strategic Management in NestleStrategic Management in Nestle
Strategic Management in Nestle
 
Pepsi co india
Pepsi co indiaPepsi co india
Pepsi co india
 
Coca recruitment process is well established
Coca recruitment process is well establishedCoca recruitment process is well established
Coca recruitment process is well established
 
Coca-Cola Positioning Strategies Campaigns
Coca-Cola Positioning Strategies Campaigns Coca-Cola Positioning Strategies Campaigns
Coca-Cola Positioning Strategies Campaigns
 
Strategic Management BA(Hons) Accounting & Finance- Year 3
Strategic Management BA(Hons) Accounting & Finance- Year 3Strategic Management BA(Hons) Accounting & Finance- Year 3
Strategic Management BA(Hons) Accounting & Finance- Year 3
 
Amit Patel- DavidsTea IMP
Amit Patel- DavidsTea IMPAmit Patel- DavidsTea IMP
Amit Patel- DavidsTea IMP
 
CocaCola Audit
CocaCola AuditCocaCola Audit
CocaCola Audit
 
Contributor
ContributorContributor
Contributor
 

Semelhante a A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf

Implementation of a Localization System for Sensor Networks-berkley
Implementation of a Localization System for Sensor Networks-berkleyImplementation of a Localization System for Sensor Networks-berkley
Implementation of a Localization System for Sensor Networks-berkleyFarhad Gholami
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsSandra Long
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management frameworkSaurabh Nambiar
 
Maxime Javaux - Automated spike analysis
Maxime Javaux - Automated spike analysisMaxime Javaux - Automated spike analysis
Maxime Javaux - Automated spike analysisMaxime Javaux
 
Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_PlanningThesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_PlanningElliott Mitchell-Colgan
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingAndrea Tino
 
Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisevegod
 
BE Project Final Report on IVRS
BE Project Final Report on IVRSBE Project Final Report on IVRS
BE Project Final Report on IVRSAbhishek Nadkarni
 
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)Wouter Verbeek
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on SteroidsAdam Blevins
 
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Alexander Zhdanov
 
[PFE] Design and implementation of an AoA, AS and DS estimator on FPGA-based...
[PFE]  Design and implementation of an AoA, AS and DS estimator on FPGA-based...[PFE]  Design and implementation of an AoA, AS and DS estimator on FPGA-based...
[PFE] Design and implementation of an AoA, AS and DS estimator on FPGA-based...Yassine Selmi
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfjeevanbasnyat1
 

Semelhante a A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf (20)

Implementation of a Localization System for Sensor Networks-berkley
Implementation of a Localization System for Sensor Networks-berkleyImplementation of a Localization System for Sensor Networks-berkley
Implementation of a Localization System for Sensor Networks-berkley
 
thesis
thesisthesis
thesis
 
A Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency AlgorithmsA Comparative Study Of Generalized Arc-Consistency Algorithms
A Comparative Study Of Generalized Arc-Consistency Algorithms
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
 
Maxime Javaux - Automated spike analysis
Maxime Javaux - Automated spike analysisMaxime Javaux - Automated spike analysis
Maxime Javaux - Automated spike analysis
 
esnq_control
esnq_controlesnq_control
esnq_control
 
Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_PlanningThesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
 
document
documentdocument
document
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
 
Big data-and-the-web
Big data-and-the-webBig data-and-the-web
Big data-and-the-web
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
 
Jmetal4.5.user manual
Jmetal4.5.user manualJmetal4.5.user manual
Jmetal4.5.user manual
 
Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesis
 
Milan_thesis.pdf
Milan_thesis.pdfMilan_thesis.pdf
Milan_thesis.pdf
 
BE Project Final Report on IVRS
BE Project Final Report on IVRSBE Project Final Report on IVRS
BE Project Final Report on IVRS
 
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on Steroids
 
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
Efficiency Optimization of Realtime GPU Raytracing in Modeling of Car2Car Com...
 
[PFE] Design and implementation of an AoA, AS and DS estimator on FPGA-based...
[PFE]  Design and implementation of an AoA, AS and DS estimator on FPGA-based...[PFE]  Design and implementation of an AoA, AS and DS estimator on FPGA-based...
[PFE] Design and implementation of an AoA, AS and DS estimator on FPGA-based...
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdf
 

Mais de Man_Ebook

BÀI GIẢNG MÔN HỌC CƠ SỞ NGÔN NGỮ, Dùng cho hệ Cao đẳng chuyên nghiệp.pdf
BÀI GIẢNG MÔN HỌC CƠ SỞ NGÔN NGỮ, Dùng cho hệ Cao đẳng chuyên nghiệp.pdfBÀI GIẢNG MÔN HỌC CƠ SỞ NGÔN NGỮ, Dùng cho hệ Cao đẳng chuyên nghiệp.pdf
BÀI GIẢNG MÔN HỌC CƠ SỞ NGÔN NGỮ, Dùng cho hệ Cao đẳng chuyên nghiệp.pdfMan_Ebook
 
TL Báo cáo Thực tập tại Nissan Đà Nẵng.doc
TL Báo cáo Thực tập tại Nissan Đà Nẵng.docTL Báo cáo Thực tập tại Nissan Đà Nẵng.doc
TL Báo cáo Thực tập tại Nissan Đà Nẵng.docMan_Ebook
 
Giáo trình thực vật học 2 - Trường ĐH Cần Thơ.pdf
Giáo trình thực vật học 2 - Trường ĐH Cần Thơ.pdfGiáo trình thực vật học 2 - Trường ĐH Cần Thơ.pdf
Giáo trình thực vật học 2 - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình mô động vật - Trường ĐH Cần Thơ.pdf
Giáo trình mô động vật - Trường ĐH Cần Thơ.pdfGiáo trình mô động vật - Trường ĐH Cần Thơ.pdf
Giáo trình mô động vật - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình ngôn ngữ hệ thống A - Trường ĐH Cần Thơ.pdf
Giáo trình ngôn ngữ hệ thống A - Trường ĐH Cần Thơ.pdfGiáo trình ngôn ngữ hệ thống A - Trường ĐH Cần Thơ.pdf
Giáo trình ngôn ngữ hệ thống A - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình ngôn ngữ mô hình hóa UML - Trường ĐH Cần Thơ.pdf
Giáo trình ngôn ngữ mô hình hóa UML - Trường ĐH Cần Thơ.pdfGiáo trình ngôn ngữ mô hình hóa UML - Trường ĐH Cần Thơ.pdf
Giáo trình ngôn ngữ mô hình hóa UML - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình nguyên lý máy học - Trường ĐH Cần Thơ.pdf
Giáo trình nguyên lý máy học - Trường ĐH Cần Thơ.pdfGiáo trình nguyên lý máy học - Trường ĐH Cần Thơ.pdf
Giáo trình nguyên lý máy học - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình mô hình hóa quyết định - Trường ĐH Cần Thơ.pdf
Giáo trình mô hình hóa quyết định - Trường ĐH Cần Thơ.pdfGiáo trình mô hình hóa quyết định - Trường ĐH Cần Thơ.pdf
Giáo trình mô hình hóa quyết định - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình Linux và phần mềm nguồn mở.pdf
Giáo trình Linux và phần mềm nguồn mở.pdfGiáo trình Linux và phần mềm nguồn mở.pdf
Giáo trình Linux và phần mềm nguồn mở.pdfMan_Ebook
 
Giáo trình logic học đại cương - Trường ĐH Cần Thơ.pdf
Giáo trình logic học đại cương - Trường ĐH Cần Thơ.pdfGiáo trình logic học đại cương - Trường ĐH Cần Thơ.pdf
Giáo trình logic học đại cương - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình lý thuyết điều khiển tự động.pdf
Giáo trình lý thuyết điều khiển tự động.pdfGiáo trình lý thuyết điều khiển tự động.pdf
Giáo trình lý thuyết điều khiển tự động.pdfMan_Ebook
 
Giáo trình mạng máy tính - Trường ĐH Cần Thơ.pdf
Giáo trình mạng máy tính - Trường ĐH Cần Thơ.pdfGiáo trình mạng máy tính - Trường ĐH Cần Thơ.pdf
Giáo trình mạng máy tính - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình lý thuyết xếp hàng và ứng dụng đánh giá hệ thống.pdf
Giáo trình lý thuyết xếp hàng và ứng dụng đánh giá hệ thống.pdfGiáo trình lý thuyết xếp hàng và ứng dụng đánh giá hệ thống.pdf
Giáo trình lý thuyết xếp hàng và ứng dụng đánh giá hệ thống.pdfMan_Ebook
 
Giáo trình lập trình cho thiết bị di động.pdf
Giáo trình lập trình cho thiết bị di động.pdfGiáo trình lập trình cho thiết bị di động.pdf
Giáo trình lập trình cho thiết bị di động.pdfMan_Ebook
 
Giáo trình lập trình web - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình web  - Trường ĐH Cần Thơ.pdfGiáo trình lập trình web  - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình web - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình lập trình .Net - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình .Net  - Trường ĐH Cần Thơ.pdfGiáo trình lập trình .Net  - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình .Net - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình lập trình song song - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình song song  - Trường ĐH Cần Thơ.pdfGiáo trình lập trình song song  - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình song song - Trường ĐH Cần Thơ.pdfMan_Ebook
 
Giáo trình lập trình hướng đối tượng.pdf
Giáo trình lập trình hướng đối tượng.pdfGiáo trình lập trình hướng đối tượng.pdf
Giáo trình lập trình hướng đối tượng.pdfMan_Ebook
 
Giáo trình lập trình hướng đối tượng Java.pdf
Giáo trình lập trình hướng đối tượng Java.pdfGiáo trình lập trình hướng đối tượng Java.pdf
Giáo trình lập trình hướng đối tượng Java.pdfMan_Ebook
 
Giáo trình kỹ thuật phản ứng - Trường ĐH Cần Thơ.pdf
Giáo trình kỹ thuật phản ứng  - Trường ĐH Cần Thơ.pdfGiáo trình kỹ thuật phản ứng  - Trường ĐH Cần Thơ.pdf
Giáo trình kỹ thuật phản ứng - Trường ĐH Cần Thơ.pdfMan_Ebook
 

Mais de Man_Ebook (20)

BÀI GIẢNG MÔN HỌC CƠ SỞ NGÔN NGỮ, Dùng cho hệ Cao đẳng chuyên nghiệp.pdf
BÀI GIẢNG MÔN HỌC CƠ SỞ NGÔN NGỮ, Dùng cho hệ Cao đẳng chuyên nghiệp.pdfBÀI GIẢNG MÔN HỌC CƠ SỞ NGÔN NGỮ, Dùng cho hệ Cao đẳng chuyên nghiệp.pdf
BÀI GIẢNG MÔN HỌC CƠ SỞ NGÔN NGỮ, Dùng cho hệ Cao đẳng chuyên nghiệp.pdf
 
TL Báo cáo Thực tập tại Nissan Đà Nẵng.doc
TL Báo cáo Thực tập tại Nissan Đà Nẵng.docTL Báo cáo Thực tập tại Nissan Đà Nẵng.doc
TL Báo cáo Thực tập tại Nissan Đà Nẵng.doc
 
Giáo trình thực vật học 2 - Trường ĐH Cần Thơ.pdf
Giáo trình thực vật học 2 - Trường ĐH Cần Thơ.pdfGiáo trình thực vật học 2 - Trường ĐH Cần Thơ.pdf
Giáo trình thực vật học 2 - Trường ĐH Cần Thơ.pdf
 
Giáo trình mô động vật - Trường ĐH Cần Thơ.pdf
Giáo trình mô động vật - Trường ĐH Cần Thơ.pdfGiáo trình mô động vật - Trường ĐH Cần Thơ.pdf
Giáo trình mô động vật - Trường ĐH Cần Thơ.pdf
 
Giáo trình ngôn ngữ hệ thống A - Trường ĐH Cần Thơ.pdf
Giáo trình ngôn ngữ hệ thống A - Trường ĐH Cần Thơ.pdfGiáo trình ngôn ngữ hệ thống A - Trường ĐH Cần Thơ.pdf
Giáo trình ngôn ngữ hệ thống A - Trường ĐH Cần Thơ.pdf
 
Giáo trình ngôn ngữ mô hình hóa UML - Trường ĐH Cần Thơ.pdf
Giáo trình ngôn ngữ mô hình hóa UML - Trường ĐH Cần Thơ.pdfGiáo trình ngôn ngữ mô hình hóa UML - Trường ĐH Cần Thơ.pdf
Giáo trình ngôn ngữ mô hình hóa UML - Trường ĐH Cần Thơ.pdf
 
Giáo trình nguyên lý máy học - Trường ĐH Cần Thơ.pdf
Giáo trình nguyên lý máy học - Trường ĐH Cần Thơ.pdfGiáo trình nguyên lý máy học - Trường ĐH Cần Thơ.pdf
Giáo trình nguyên lý máy học - Trường ĐH Cần Thơ.pdf
 
Giáo trình mô hình hóa quyết định - Trường ĐH Cần Thơ.pdf
Giáo trình mô hình hóa quyết định - Trường ĐH Cần Thơ.pdfGiáo trình mô hình hóa quyết định - Trường ĐH Cần Thơ.pdf
Giáo trình mô hình hóa quyết định - Trường ĐH Cần Thơ.pdf
 
Giáo trình Linux và phần mềm nguồn mở.pdf
Giáo trình Linux và phần mềm nguồn mở.pdfGiáo trình Linux và phần mềm nguồn mở.pdf
Giáo trình Linux và phần mềm nguồn mở.pdf
 
Giáo trình logic học đại cương - Trường ĐH Cần Thơ.pdf
Giáo trình logic học đại cương - Trường ĐH Cần Thơ.pdfGiáo trình logic học đại cương - Trường ĐH Cần Thơ.pdf
Giáo trình logic học đại cương - Trường ĐH Cần Thơ.pdf
 
Giáo trình lý thuyết điều khiển tự động.pdf
Giáo trình lý thuyết điều khiển tự động.pdfGiáo trình lý thuyết điều khiển tự động.pdf
Giáo trình lý thuyết điều khiển tự động.pdf
 
Giáo trình mạng máy tính - Trường ĐH Cần Thơ.pdf
Giáo trình mạng máy tính - Trường ĐH Cần Thơ.pdfGiáo trình mạng máy tính - Trường ĐH Cần Thơ.pdf
Giáo trình mạng máy tính - Trường ĐH Cần Thơ.pdf
 
Giáo trình lý thuyết xếp hàng và ứng dụng đánh giá hệ thống.pdf
Giáo trình lý thuyết xếp hàng và ứng dụng đánh giá hệ thống.pdfGiáo trình lý thuyết xếp hàng và ứng dụng đánh giá hệ thống.pdf
Giáo trình lý thuyết xếp hàng và ứng dụng đánh giá hệ thống.pdf
 
Giáo trình lập trình cho thiết bị di động.pdf
Giáo trình lập trình cho thiết bị di động.pdfGiáo trình lập trình cho thiết bị di động.pdf
Giáo trình lập trình cho thiết bị di động.pdf
 
Giáo trình lập trình web - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình web  - Trường ĐH Cần Thơ.pdfGiáo trình lập trình web  - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình web - Trường ĐH Cần Thơ.pdf
 
Giáo trình lập trình .Net - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình .Net  - Trường ĐH Cần Thơ.pdfGiáo trình lập trình .Net  - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình .Net - Trường ĐH Cần Thơ.pdf
 
Giáo trình lập trình song song - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình song song  - Trường ĐH Cần Thơ.pdfGiáo trình lập trình song song  - Trường ĐH Cần Thơ.pdf
Giáo trình lập trình song song - Trường ĐH Cần Thơ.pdf
 
Giáo trình lập trình hướng đối tượng.pdf
Giáo trình lập trình hướng đối tượng.pdfGiáo trình lập trình hướng đối tượng.pdf
Giáo trình lập trình hướng đối tượng.pdf
 
Giáo trình lập trình hướng đối tượng Java.pdf
Giáo trình lập trình hướng đối tượng Java.pdfGiáo trình lập trình hướng đối tượng Java.pdf
Giáo trình lập trình hướng đối tượng Java.pdf
 
Giáo trình kỹ thuật phản ứng - Trường ĐH Cần Thơ.pdf
Giáo trình kỹ thuật phản ứng  - Trường ĐH Cần Thơ.pdfGiáo trình kỹ thuật phản ứng  - Trường ĐH Cần Thơ.pdf
Giáo trình kỹ thuật phản ứng - Trường ĐH Cần Thơ.pdf
 

Último

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 

Último (20)

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 

A study on improving speaker diarization system = Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói.pdf

  • 1. HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY MASTER THESIS A Study on Improving Speaker Diarization System TUNG LAM NGUYEN lamfm95@gmail.com Dept. of Control Engineering and Automation Supervisor Dr. T. Anh Xuan Tran School School of Electrical Engineering Hanoi, March 1, 2022
  • 2.
  • 3.
  • 4.
  • 5. i Declaration of Authorship I, Tung Lam N , declare that this thesis titled, “A Study on Improving Speaker GUYEN Diarization System” and the work presented in it are my own. I confirm that: • This work was done wholly or mainly while in candidature for a research degree at this University. • Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. • Where I have consulted the published work of others, this is always clearly at- tributed. • Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help. Signed: Date:
  • 6.
  • 7. iii “I’m not much but I’m all I have.” - Philip K Dick, Martian Time-Slip
  • 8.
  • 9. v Abstracts Speaker diarization is the method of dividing a conversation into segments spoken by the same speaker, usually referred to as “who spoke when”. At Viettel, this task is es- pecially important to the IP contact center (IPCC) automatic quality assurance system, by which hundreds of thousands of calls are processed everyday. Integrated within a speaker recognition system, speaker diarization helps distinguishing between agents and customers within each support call and giving further useful insights (e.g: agent attitude and customer satisfaction,..). The key to accurately do such task is to learn discrimi- native speaker representations. X-Vectors, bottle-neck features of a time-delayed neu- ral network (TDNN), have emerged as the speaker representations of choice for many speaker diarization system. On the other hand, ECAPA-TDNN, a recent development over X-Vectors’ neural network with residual connections and attention on both time and feature channels, has shown state-of-the-art results on popular English corpora. There- fore, the aim of this work is to explore capability of ECAPA-TDNN versus X-Vectors in the current Vietnamese speaker diarization system. Both baseline and proposed systems are evaluated in two tasks: speaker verification, to evaluate the discriminative character- istics of speaker representations; and speaker diarization, to evaluate how these speaker representations affect the whole complex system. Used data include private data sets (IPCC_110000, VTR_1350) and a public data set (ZALO_400). In general, conducted experiments show the proposed system out-perform the baseline system on all tasks and on all data sets.
  • 10.
  • 11. vii Acknowledgements First and foremost, I would like to express my deep gratitude to my main supervisor, Dr. T. Anh Xuan Tran. Without her outstanding guidance and patience, I would never finish this thesis. I would like to thank Dr. Van Hai Do, Mr. Nhat Minh Le and colleagues at Viettel Cyberspace Center as their kindness and tremendous technical assistance have made my days doing this thesis much more relieved. Finally, huge thanks to my friends for giving me stress-relief at weekends, and my family who did most of the cooking so I would have more time on working on this thesis. Hanoi, March 1, 2022
  • 12.
  • 13. ix Contents Declaration of Authorship i Abstracts v Acknowledgements vii 1 Introduction 1 1.1 Research Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Speaker Diarization System 7 2.1 Front-end Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Front-end Post-processing . . . . . . . . . . . . . . . . . . . . 10 2.1.2.1 Speech Enhancement . . . . . . . . . . . . . . . . . 10 2.1.2.2 De-reverbation . . . . . . . . . . . . . . . . . . . . . 10 2.1.2.3 Speech Separation . . . . . . . . . . . . . . . . . . . 10 2.2 Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Speaker Representations . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 X-Vector Embeddings . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1.1 Frame Level . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1.2 Segment Level . . . . . . . . . . . . . . . . . . . . . 15
  • 14. x 2.4.2 ECAPA-TDNN Embeddings . . . . . . . . . . . . . . . . . . . 16 2.4.2.1 Frame-level . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2.1.1 1D Convolutional Layer . . . . . . . . . . . 17 2.4.2.1.2 1D Squeeze-and-Excitation Block . . . . . 18 2.4.2.1.3 Res2Net-with-Squeeze-Excitation Block . . 19 2.4.2.2 Segment-level . . . . . . . . . . . . . . . . . . . . . 19 2.4.2.2.1 Attentive Statistical Pooling . . . . . . . . . 20 2.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.1 PLDA Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.2 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . 26 3 Experiments 29 3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Equal Error Rate and Minimum Decision Cost Function . . . . 29 3.1.2 Diarization Error Rate . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Kaldi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 SpeechBrain . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Kal-Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 IPCC_110000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1.1 IPCC_110000 Verification Test Set . . . . . . . . . . 36 3.3.1.2 IPCC_110000 Diarization Test Set . . . . . . . . . . 37 3.3.2 VTR_1350 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2.1 VTR_1350 Verification Test Set . . . . . . . . . . . 39 3.3.2.2 VTR_1350 Diarization Test Set . . . . . . . . . . . . 39 3.3.3 ZALO_400 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.3.1 ZALO_400 Verification Test Set . . . . . . . . . . . 40
  • 15. xi 3.3.3.2 ZALO_400 Diarization Test Set . . . . . . . . . . . . 41 3.4 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Speaker Diarization System . . . . . . . . . . . . . . . . . . . 41 3.4.2 Speaker Verification System . . . . . . . . . . . . . . . . . . . 43 3.5 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 Results 51 4.1 Speaker Verification Task . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Speaker Diarization Task . . . . . . . . . . . . . . . . . . . . . . . . . 54 5 Conclusions and Future Works 57
  • 16.
  • 17. xiii List of Figures 1.1 A traditional speaker diarization system diagram. . . . . . . . . . . . . 1 1.2 An example speaker diarization result. . . . . . . . . . . . . . . . . . . 2 1.3 An example clustering result of a 3-way conversation (adapted from [ ]). 8 Each dot represents a speech segment in 2D dimension. . . . . . . . . . 3 1.4 Generic speaker diarization system diagram, including 3 phases: embed- dings extractor training, PLDA backend training and speaker diarization. In this thesis, two state-of-the-art embeddings extractor: X-Vector and ECAPA-TDNN are experimented. . . . . . . . . . . . . . . . . . . . . 4 1.5 Generic speaker verification system diagram, employing the same em- beddings extractor and PLDA backend as used in 1.4. This system is pri- marily used to optimize the speaker diarization system. The EER thresh- old can be used for clustering without knowing the number of speakers in system 1.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Diagram of a F-banks / MFCCs extraction process (adapted from [ ]). 11 8 2.2 N=10 Mel filters for signal samples sampled at 16000Hz. . . . . . . . . 9 2.3 Example output of a VAD system visualized in Audacity (audio editor) [ ]. 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Diagram of X-Vectors DNN (adapted from [ ]). 58 . . . . . . . . . . . . 14 2.5 Diagram of X-Vectors’ frame-level TDNN with sub-sampling (as con- figured in [ ]). 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Diagram of X-Vectors’ segment-level DNN (as configured in [ ]) 59 . . . 16 2.7 Complete network architecture of ECAPA-TDNN (adapted from [ ]). 62 . 17
  • 18. xiv 2.8 Kernel sliding across speech frames in a dilated 1D-CNN layer, with k=3, d=4 and c=6. Essentially this is a TDNN layer with context of {3,0,3}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.9 A 1D-Squeeze-and-Excitation block. Different colors represent different scales for channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.10 A Res2Net-with-Squeeze-Excitation Block. . . . . . . . . . . . . . . . 20 2.11 Attentive Statistics Pooling (on both time frames and channels). . . . . 21 2.12 An example of LDA transformation from 2D to 1D (taken from [ ]). 76 . 23 2.13 Fitting the parameters of the PLDA model (taken from [ ]). 77 . . . . . . 25 2.14 Agglomerative hierarchical clustering flowchart. . . . . . . . . . . . . . 26 2.15 An example iterative process of agglomerative hierarchical clustering (taken from [ ]). 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.16 Visualization of the result of hierarchical clustering (taken from [ ]). 80 . 27 3.1 An EER plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Kaldi logo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Kaldi general architecture diagram. . . . . . . . . . . . . . . . . . . . . 32 3.4 Filtering VTR_1350 data set by utterances’ durations and number of utterances per speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Generating 200 5-way conversations from VTR_1350 data set. The max. and min. numbers of utterances picked from each conversation are 2 and 30 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 IPCC_110000 data distributions. . . . . . . . . . . . . . . . . . . . . . 35 3.7 VTR_1350 data distributions. . . . . . . . . . . . . . . . . . . . . . . . 39 3.8 ZALO_400 data distributions. . . . . . . . . . . . . . . . . . . . . . . 40 3.9 Baseline speaker diarization system diagram. . . . . . . . . . . . . . . 44 3.10 Baseline speaker verification system diagram. . . . . . . . . . . . . . . 45 3.11 Proposed speaker diarization system diagram. . . . . . . . . . . . . . . 48 3.12 Proposed speaker verification system diagram. . . . . . . . . . . . . . . 49
  • 19. xv 4.1 A speaker diarization output of a 3-way conversation in VTR_1350 test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
  • 20.
  • 21. xvii List of Tables 3.1 List of speech tasks and corpora that are currently supported by Speech- Brain (taken from [ ]) 81 . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 IPCC_110000 data set overview. . . . . . . . . . . . . . . . . . . . . . 35 3.3 IPCC_110000 data subsets. . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 VTR_1350 data set overview. . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 ZALO_400 data set overview. . . . . . . . . . . . . . . . . . . . . . . 40 3.6 EER and MinDCF performance of all systems on the standard Vox- Celeb1 and VoxSRC 2019 test sets (taken from [ ]). 62 . . . . . . . . . . 46 3.7 Diarization Error Rates (DERs) on AMI dataset using the beamformed array signal on baseline and proposed systems (taken from [ ]). 88 . . . . 47 4.1 EER and MinDCF performance. . . . . . . . . . . . . . . . . . . . . . 53 4.2 DER performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
  • 22.
  • 23. xix List of Abbreviations IPCC IP Contact Center DNN Deep Neural Network CNN Convolutional Neural Network TDNN Time-Delayed Neural Network RTTM Rich Transcription Time Marked RNN Recurrent Neural Network LPC Linear Prediction Coding PLP Perceptual Linear Prediction DWT Discrete Wavelet Transform MFBC Mel Filterbank Coefficients MFCC Mel Frequency Cepstral Coefficients STFT Short-time Discrete Fourier Transform DCT Discrete Cosine Transform WPE Weighted Prediction Error MLE Maximum Likelihood Estimation PIT Permutation Invariant Training VAD Voice Activity Detection SAD Speech Activity Detection HMM Hidden Markov Model GMM Gaussian Mixture Model GLR Generalized Likelihood Ratio BIC Bayesian Information Criterion UBM Universal Background Model LDA Linear Discriminant Analysis PLDA Probabilistic Linear Discriminant Analysis LSTM Long Short-Term Memory SE-Res2Net Res2Net-with-Squeeze-Excitation ReLU Rectified Linear Unit
  • 24. xx AAM Additive Angular Margin AHC Agglomerative Hierarchical Clustering EER Equal Error Rate CER Crossover Error Rate FAR False Acceptance Rate FRR False Rejection Rate TPR True Positive Rate FPR False Positive Rate FNR False Negative Rate MinDCF Minimum Decision Cost Function DER Diarization Error Rate PCM Pulse-Code Modulation SNR Signal-to-Noise Ratio
  • 25. 1 Chapter 1 Introduction 1.1 Research Interest Speaker diarization, usually referred as "who spoke when”, is the method of dividing a conversation that often includes a number of speakers into segments spoken by the same speaker. This task is especially important to Viettel IP contact center (IPCC) automatic quality assurance system, where hundreds of thousands of calls are processed everyday, and the human resources are limited and costly. In the scenarios that only single-channel recordings are provided, speaker diarization, integrated within a speaker recognition system, helps distinguishing between agents and customers within each support call and giving further useful insights (e.g: agent attitude and customer satisfaction,..). Never- theless, speaker diarization can also be applied in analyzing other forms of recorded conversations such as meetings, medical therapy sessions, court sessions, talk shows,... Front-End Processing Voice Activity Detection Segmentation Speaker Representation Clustering Post- Processing Audio Input Diarization Output F 1.1: A traditional speaker diarization system diagram. IGURE A traditional speaker diarization system (figure ) is built by six modules: front-end 1.2 processing, voice activity detection, segmentation, speaker representation, clustering, and post-processing. All output information, including the number of speakers, the be- ginning time and duration of each of their speech segments, is encapsulated in form of Rich Transcription Time Marked (RTTM) file [ ] (figure ). 1 1.2
  • 26. 2 Chapter 1. Introduction F 1.2: An example speaker diarization result. IGURE An important factor that affects the speaker diarization accuracy is the number of par- ticipating speakers in the conversation. This number could be revealed or hidden from the system before the diarization process, depending on the nature of the conversations. An example of the case where it’s revealed is a check-up call between a doctor and a pa- tient, or a support call between a customer and an agent, which is usually a conversation between only two people (i.e: a 2-way conversation), assuming there’s no new speaker interrupting or joining the conversation. By acknowledging that the conversation have only a defined number of speaker, the speaker diarization system can simply slice the recorded conversation into smaller speech partitions and classify them into a known number of clusters (e.g: using k-means [ ]). 2 However, in case that the number of speakers is unknown to the system, the system must guess it first. The guessing ends when a stopping criterion (or a decision threshold) is met. For example, in company meeting with shareholders, although the number people participating in the meeting is on the record, the number of people actually speak in that meeting is unknown. While most people might remain silent throughout the meeting, only board members would take the mic. In this case, the diarization system can em- ploy an unsupervised clustering method (e.g: X-means [ ]), or an supervised clustering 3 method (e.g; UIS-RNN) [ ]) 4 In fact, multiple attempts to build an end-to-end speaker diarization system without a clustering module have been made in [ ], [ ], and [ ]. However, this thesis only focuses 5 6 7 on a traditional speaker diarization system, which employs a speaker clustering module. Figure demonstrates an example clustering result. 1.3
  • 27. 1.1. Research Interest 3 F 1.3: An example clustering result of a 3-way conversation (adapted IGURE from [ ]). Each dot represents a speech segment in 2D dimension. 8 In this case, extracting speaker representations with discriminative characteristics is ex- tremely important, since these representations as input data have a huge influence on the accuracy of clustering stage. The discriminative characteristics of a speaker representing method can be tested indirectly via speaker diarization system, which is the main focus of this thesis. On the other hand, it can be tested directly via a simple speaker verifica- tion system. A speaker verification system verifies the identity of a questioned speaker by comparing the voice data that supposedly belong to himself with his enrolled voice data. If the similarity between the enrolled and the input is lower than a determined threshold, the imposter gets rejected. In summary, the speaker diarization system and the speaker verification system are cor- related in the sense that they are using the same way of representing speakers. Hence optimizations in the speaker verification system would also led to improvements in the speaker diarization system, which is the main approach of this thesis. Figure demon- 1.4 strates a generic system that employs both speaker diarization and speaker verification. The speaker verification system, employing the same embeddings extractor and PLDA backend used the speaker diarization system, is primarily used for optimizing the di- arization performance. In this thesis, both baseline and proposed systems are based on this generic model. It’s also noted that the post-processing module is left out.
  • 28. 4 Chapter 1. Introduction Phase 1: Embeddings Extractor Training Phase 2: PLDA Backend Training Training Data Front-end Processing Voice Activity Detection Extract Speaker Embeddings Training Data Training Embeddings Extractor Training PLDA Backend Phase 3: Speaker Diarization Front-end Processing Voice Activity Detection Segmentation Extract Speaker Embeddings Speaker Embeddings PLDA Scoring Affinity Scoring Matrix Clustering RTTM Output Audio Input (Enrolled) Training Speaker Embeddings F 1.4: Generic speaker diarization system diagram, including 3 IGURE phases: embeddings extractor training, PLDA backend training and speaker diarization. In this thesis, two state-of-the-art embeddings extractor: X- Vector and ECAPA-TDNN are experimented.
  • 29. 1.1. Research Interest 5 Testing Data Generate Verification Pairs Phase *: Speaker Verification (for optimization) Verification Pairs utt-4ae utt-c3d target utt-f34 utt-ced target utt-ae3 utt-b2e nontarget ... Audio Input (Enrolled) Audio Input (Imposter) All pairs are scored? Fetch Verification Pair No Front-end Processing Voice Activity Detection Extract Speaker Embeddings Speaker Embed. (Enrolled) Speaker Embed. (Imposter) PLDA Scoring Scoring Value Verification Scores utt-4ae utt-c3d 0.6554 utt-f34 utt-ced 0.9786 utt-ae3 utt-b2e 0.4587 ... Yes Embeddings Extractor PLDA Backend EER Plotting EER EER Threshold (from Phase 1) (from Phase 2) F 1.5: Generic speaker verification system diagram, employing the IGURE same embeddings extractor and PLDA backend as used in . This sys- 1.4 tem is primarily used to optimize the speaker diarization system. The EER threshold can be used for clustering without knowing the number of speak- ers in system . 1.4
  • 30. 6 Chapter 1. Introduction 1.2 Thesis Outline This thesis is organized into 4 chapters with the following contents: Chapter 1 The current chapter gives general information about the research interest, a general overview on speaker diarization and its co-related task, speaker verification. Chapter 2 This chapter presents components of a speaker diarization system, including all components of the implemented speaker verification system. Notable topics are X- Vector and ECAPA-TDNN speaker representations. Chapter 3 This chapter discusses evaluation metrics, used data sets, and applied meth- ods. Chapter 4 This chapter closely examines experiments’ results. Chapter 5 This chapter summarizes the work in this thesis and give some future direc- tions.
  • 31. 7 Chapter 2 Speaker Diarization System 2.1 Front-end Processing The very first stage of a speaker diarization system (or more generally speaking, a speech processing system) is front-end processing. At this stage, acoustic features are curated and processed in such a way that is considered most favorable for the system. The fea- tures must be in balance between simplicity (not much correlated and simple enough to be input of the learning network), and complexity (still possessing useful information, making space for the network to learn). Afterwards, some front-end post-processing techniques, such as speech enhancement, speech de-reverberation, speech separation and target speaker extraction, could be performed to further enhance speech features towards better speaker diarization performance. 2.1.1 Features Extraction There’s a wide variety of methods to represent the speech signal parametrically, such as linear prediction coding (LPC), perceptual linear prediction (PLP), discrete wavelet transform (DWT), Mel filterbank coefficients (MFBCs), and Mel frequency cepstral coefficients (MFCCs). However, in the last twenty years, the last two methods emerge as features of choice in the field of speech processing. [ ] [ ]. Figure demonstrates 9 10 4.1 a typical F-banks / MFCCs extraction process. At the beginning of F-banks / MFCCs extraction process, the input speech signal is di- vided into homogeneous overlapping short frames (in most cases into short frames of 25ms with overlaps of 10ms), and windowed (usually with Hamming / Hanning win- dow) to eliminate noises in later signal transforms [ ] (i.e: with 16000Hz sampled 12
  • 32. 8 Chapter 2. Speaker Diarization System Speech Signal Framing and Windowing Short Time Fourier Transform (STFT) Log-amplitude N Mel filters Discrete Cosine Transform (DCT) MFCCs MFBCs Sum F 2.1: Diagram of a F-banks / MFCCs extraction process (adapted IGURE from [ ]). 11 speech signal, each frame contains 16*1000*25/1000 = 400 samples, with an overlap of 10/25*400 = 160 samples with the previous frame). Afterwards, the framed and windowed signal is then analysed by a Short-Time Dis- crete Fourier Transform (STFT), through which the signal sample is converted from amplitude-time domain to amplitude-frequency domain. Next, the y-axis (amplitude) is converted to a log scale as it better represents the human perception of loudness [ ], according to Weber-Fechner Law [ ]. 13 14 Then, the x-axis (frequency) is also converted, but into mel-scale [ ]. The reason for 15 this conversion is the fact that out human ears have lower resolution at higher frequen- cies than lower frequencies (i.e: it’s fairly easy for us to distinguish between sounds at 300Hz and 400Hz, but it gets much harder when we have to compare sounds at 1300Hz and 1400Hz, even when the difference is still 100Hz). The mel-scale formula is purely discovered via psychological experiments with many variation, one of which can be ex- pressed as follow: m = 2595log10 1+ f 700 = 1127ln 1+ f 700 (2.1) At this stage, Mel filterbank coefficients (MFBCs) can be computed after 2 steps: • Step 1: Apply a chosen number N of triangular band-pass filters linearly spaced in Mel scale: – The lowest and highest frequencies of the bands correspond to the lowest and highest frequency of the initially sampled signal.
  • 33. 2.1. Front-end Processing 9 – In mel-frequency domain, these filters have the same bandwidth, and they over- lap each other by half of the bandwidth. – Each filter is a triangular filter with the frequency response of 1 at the center frequency, and decreases linearly towards zero till it reaches the center frequen- cies of the two adjacent filters. • Step 2: Compute N mean log-amplitude values of N filtered signal samples (from the original signal sample). These values are taken as N Mel filterbank coefficients. For example, for signal samples sampled at 16000Hz, N=10 Mel filters can be visualized in the following figure: F 2.2: N=10 Mel filters for signal samples sampled at 16000Hz. IGURE Going further, to obtain Mel frequency cepstral coefficients (MFCCs), the following steps, which continues step 2 of MFBCs’ computation above, are done: • Step 1: Sum all N Mel-filtered signal samples to obtain a Mel-weighted signal sample. • Step 2: Apply discrete cosine transform (DCT) to transform the signal from log-mel frequency domain to quefrency [ ] domain. 16 • Step 3: Take first K coefficients (usually K=13) as K MFCCs. The next steps are optional to generate high resolution MFCCs. • Step 4: Compute first order and second order time derivatives of each coefficients to yield K*2 more coefficients.
  • 34. 10 Chapter 2. Speaker Diarization System • Step 5: Compute sum of squares of amplitude of the signal sample to obtain one energy coefficient. After this step, the total number of MFCCs are K*3+1 (i.e: K=13 corresponds to 40 MFCCs) 2.1.2 Front-end Post-processing 2.1.2.1 Speech Enhancement Speech enhancements techniques primarily focus on diminishing noises from noisy au- dio. These techniques include: classical signal processing based de-noising [ ]; deep- 17 learning based de-noising [ ][ ][ ][ ][ ], and multi-channel processing [ ]. 17 18 19 20 21 22 2.1.2.2 De-reverbation De-reverberation techniques are utilized to remove effects of reverberation from input signal. A popular method is Weighted Prediction Error (WPE), which is the dom- inant method used in top-performing systems in DiHARD and CHiME competitions [ ][ ][ ][ ][ ]. The basic idea of WPE is to decompose the original signal model 23 24 25 26 27 into early reflection and late reverberation. Then it tries to estimate a filter to maintain the early reflection while suppressing the late reverberation basing on maximum likeli- hood estimation (MLE) method [ ]. The improvement WPE gives is not significant, 28 but it is solid across all tasks. It also shows additional performance improvements when applied on multi-channel signals. 2.1.2.3 Speech Separation Speech separation is primarily useful when overlapping speech regions are significantly large. Two main branches under this approach are: • Deep-learning based speech separation: Some early attempts are Deep Cluster- ing [ ], Permutation Invariant Training(PIT) [ ] and Conv-Tasnet [ ]. How- 29 30 31 ever, single-channel speech separation systems often produce a redundant non- speech or even a duplicated speech signal for the non-overlap regions (leakage). Leakage-filtering for single-channel systems were proposed and significantly im- proved speaker diarization performance [ ][ ]. 32 33
  • 35. 2.2. Voice Activity Detection 11 • Beam-forming based speech separation: This method appears in top-performing systems in CHiME-6 challenge [ ][ ]. 34 35 2.2 Voice Activity Detection Voice activity detection (VAD) (also known as speech activity detection - SAD), is a technique to address the presence or absence of human speech in a given audio signal, which is an indispensable component of most speech processing systems: • In a speech synthesis system, VAD helps removing noises in training data, thus, reduce noises in synthesized audio. • In a speech recognition system, VAD helps dropping frames of noises to save com- puting power and reduce the number of insertion errors in decoded texts. • In a speaker diarization system, VAD helps generating better speaker representa- tions, which is the most important factor affecting the whole system’s performance in terms of precision. VAD systems can be classified into two types: • Two-phases VAD systems, which mostly comprises of two parts: a feature extrac- tion front end, where acoustic features such as MFCCs are extracted; and a classi- fier, where a model predicts whether the input frame is speech or not. • ASR-based VAD systems, where VAD timestamps are inferred directly from word alignments. In this case, the ASR system preceded the VAD system. F 2.3: Example output of a VAD system visualized in Audacity (au- IGURE dio editor) [ ]. 36
  • 36. 12 Chapter 2. Speaker Diarization System VAD techniques have been sporadically developed throughout the years. In 1997, Benyas- sine et al. presented a silence compression scheme that reduces transmission bandwidth during silence period [ ]. This system employed a VAD algorithm that was later usu- 37 ally referred to as G729B algorithm. In 1998, Sohn et al. introduced the first statistical model-based VAD that accounts for time-varying noise statistics [ ]. Just one year 38 later, they proposed an improved version with a Hidden Markov Model (HMM)[ ] 39 based hang-over scheme [ ]. In 2011, Ying et al. proposed a Gaussian mixture mod- 40 els (GMM)[ ] based VAD trained on a unsupervised learning framework. A popular 41 implementation of this method is Google WebRTC VAD [ ] [ ]. 42 43 Later, with the rise of neural networks, in 2013, Thad Hughes introduced recurrent neu- ral network (RNN) based VAD [ ] which was claimed to outperform existing GMM- 44 based VAD. Then, in 2018, a team from Johns Hopkins University proposed a Time- Delay Neural Network VAD [ ], which is trained using alignments from the GMM- 45 HMM process and known to perform much faster than RNN-based VAD. In this thesis, Google WebRTC VAD is adopted as the main VAD technique, considering it’s simplicity of installation, as well as being used by Google in production environment for a long period of time. 2.3 Segmentation In a speaker diarization system, Speech segmentation breaks the input audio stream into multiple segments so that each segment can be assigned to a speaker label. The simplest method of segmentation is uniform segmentation, in which audio input is segmented with a consistent window length and overlap length. The window must be sufficiently short to safely assume that they do not contain multiple speakers, but at the same time long enough to capture enough acoustic information (usually from 1 to 2 seconds). A more complex method is speaker change point detection, in which speaker change points are detected by comparing two hypotheses: Hypothesis H0 assuming both left and right samples are from the same speaker and hypothesis H1 assuming the two samples are from the different speakers. Some notable approaches are Kullback Leibler 2 (KL2) algorithm [ ], Generalized Likelihood Ratio (GLR) [ ], and Bayesian Information 46 47 Criterion (BIC) [ ][ ]. 48 49
  • 37. 2.4. Speaker Representations 13 2.4 Speaker Representations Speaker representations play a critical role in measuring the similarity of speech seg- ments, basing on which speech segments are classified into a known or unknown number of speakers. While features such as MFCCs or MFBCs are discriminative enough for speech recog- nition, they are considered too noisy for speaker diarization. In order to overcome this limitation, numerous studies have been carried out. From 2010 to 2015, the dominant approach is to train a probabilistic model (e.g.: Gaus- sian Mixture Model-Universal Background Model – GMM UBM) to extract speaker representations in a new low dimensional speaker and channel-dependent space. A probabilistic linear discriminant analysis (PLDA) [ ] could also be trained to further 50 improve scoring stage. Those representations are commonly referred to as I-Vectors [ ][ ][ ][ ]. 51 52 53 54 Since late-2015, deep learning has emerged as the dominant approach for this task. The main concept is to train a deep neural network (DNN) to classify all speakers in a data set, and then, in the testing stage, use bottleneck features as a speaker representation. In this year, Heigold et al. proposed an end-to-end text-dependent speaker verification system [ ] that learns speaker embeddings (commonly known as D-Vectors) based on 55 the cosine similarity. It is developed to handle variable length input in a text-independent verification task through a temporal pooling layer and data augmentation. The model was trained entirely on Google’s proprietary datasets. D-Vectors are later improved by a short-term memory (LSTM) [ ] network and a triplet loss function [ ]. 56 57 In 2017, a group of researchers from Johns Hopkins university proposed a modified version of D-Vectors trained on smaller, publicly available datasets, and pre-processed with a different strategy [ ]. In 2018, by exploiting data augmentation, they further 58 improve their speaker representations and referred to these representation as X-Vectors [ ]. 59 2.4.1 X-Vector Embeddings X-Vectors are bottleneck features of a deep neural network trained to classify a large number of speakers. The training data is divided into small batches of K speakers and
  • 38. 14 Chapter 2. Speaker Diarization System M speech segments (each of which must have more than T frames). The loss function (multi-class cross entropy) goes as follow: E = − N ∑ n=1 K ∑ k=1 dnk ln  P  spkk | x ( ) m 1:T (2.2) where: • P  spkk | x ( ) m 1:T  is probability of speaker n given T input frames x ( ) m 1 ,x ( ) m 2 ,...x ( ) m T • dnk is 1 if the speaker label for segment n is k, and is 0 otherwise. The network operates at 2 levels: frame-level and segment-level, connected by a statis- tics pooling layer (as shown in figure 2). This multi-level structure allows the DNN to be trained with segments of different lengths. Hence, the training data would be better uti- lized and the extracted X-Vector would be more robust against the variance of segment’s length. F 2.4: Diagram of X-Vectors DNN (adapted from [ ]). IGURE 58
  • 39. 2.4. Speaker Representations 15 2.4.1.1 Frame Level At frame level, the network is essentially a time delayed neural network (TDNN) [ ] 60 with sub-sampling. The default configuration is shown in figure : Input features 2.5 are 30 MFCCs extracted from frames of 25ms with the overlaps of 10ms. The TDNN has 5 layers with different context specifications. Layer 3, 4 and 5 are fully connected. To account for the lack of context at the first and the last frames, speech segments are padded at both ends. t t t t-5 t-3 t-1 t+1 t t+5 t-3 t+3 layer 1 (dim=512) layer 2 (dim=512) layer 3 (dim=512) layer 4 (dim=512) layer 5 (dim=1500) (to statistical pooling) Layer Input context Input context with sub-sampling 5 {0} {0} 4 {0} {0} 3 [-3,3] {-3,0,3} 2 [-2,2] {-2,0,2} 1 [-2,2] [-2,2] INPUT (dim=30) t-7 t+7 F 2.5: Diagram of X-Vectors’ frame-level TDNN with sub-sampling IGURE (as configured in [ ]). 59 2.4.1.2 Segment Level At segment level, the network is a fully-connected feed-forward DNN with a pooled in- put. Figure demonstrates the default setup: All frame-level outputs 2.6 ht (t T = 1, , ··· ) (layer 5 of the TDNN) are aggregated to compute mean and standard deviation . µ σ µ = 1 T T ∑ t ht (2.3) σ = s 1 T T ∑ t ht ht −  µ µ (2.4)
  • 40. 16 Chapter 2. Speaker Diarization System Both of those statistics are concatenated together to one vector that represents the whole segment. This vector is then passed through 2 fully-connected layers, each of which has a rectified linear unit (ReLU) [ ]. At last, the output layer (a log-softmax classi- 61 fier) gives a probability distribution of all speakers in the training data. In this network configuration, it could be the output of layer 6 or layer 7. t t t layer 7 (dim=512) layer 6 (dim=512) OUTPUT (dim=TotalSpeakers) t statistical pooling (dim=3000) t-2 t-1 t t - T layer 5 (dim=1500) F 2.6: Diagram of X-Vectors’ segment-level DNN (as configured in IGURE [ ]) 59 As mentioned earlier, X-Vectors are essentially bottleneck features of a DNN. In this network configuration, it could be the output of layer 6 or layer 7. However, the latter is selected since it’s proven experimentally to perform better in the speaker identification task [ ]. 59 2.4.2 ECAPA-TDNN Embeddings In late 2020, Desplanques et al. proposed ECAPA-TDNN [ ], an enhanced structure 62 based on X-Vectors’ network. The basic TDNN layers are replaced with 1D-Convolutional Layers [ ] and Res2Net-with-Squeeze-Excitation (SE-Res2Net) Blocks [ ][ ][ ], 63 64 65 66 while the basic statistics pooling layer is replaced with an Attentive Statiscal Pooling, which utilizes both frame-wise and channel-wise attention mechanism [ ]. The com- 67 plete architecture of ECAPA-TDNN is visualized in figure . 2.7
  • 41. 2.4. Speaker Representations 17 Conv1D (k=5, d=1) (+ ReLU + BatchNorm) SE-Res2Block (k=3, d=2) SE-Res2Block (k=3, d=3) SE-Res2Block (k=3, d=4) Conv1D (k=1, d=1) (+ ReLU) + Attentive Stat Pooling (+ BatchNorm) C T x C T x 80 T x Fully Connected (+ BatchNorm) Additive Angular Margin Softmax 1536 T x C T x 3 C T x ( x ) 2 1536 T x x 192 1 x num. speakers 1 x INPUT OUTPUT Frame Level Segment Level F 2.7: Complete network architecture of ECAPA-TDNN (adapted IGURE from [ ]). 62 2.4.2.1 Frame-level At frame-level, ECAPA-TDNN network consists of 1D Convolutional layers (with ReLU and optional batch normalization), and 1D Squeeze-and-Excitation blocks. The network also utilizes residual connections at high-level to reduce the effect of vanishing gradients. 2.4.2.1.1 1D Convolutional Layer In a 1D convolution layer, instead of sliding along two dimensions as in well-known CNNs in image processing [ ][ ][ ], a kernel of size k and dilation d slides along 68 69 70 dimension of time frames. Figure demonstrate how a 1D Convolutional (1D-Conv) 2.8 kernel works. Where: • k denotes kernel size. • d denotes dilation spacing (i.e: if d is larger than 1 the 1D-Conv layer is dilated).
  • 42. 18 Chapter 2. Speaker Diarization System • c denotes number of channels (i.e: number of extracted feature coefficients, e.g: 40 MFCCs) frames d: dialation spacing k: kernel size c: number of channels F 2.8: Kernel sliding across speech frames in a dilated 1D-CNN IGURE layer, with k=3, d=4 and c=6. Essentially this is a TDNN layer with context of {3,0,3}. 2.4.2.1.2 1D Squeeze-and-Excitation Block For computer vision tasks, Squeeze-and-Excitation blocks [ ] has proven to be very ef- 66 fective in improving channel inter-dependencies at low computational cost. In ECAPA- TDNN architecture, this approach help re-scaling the frame-level features basing on global properties of the signal sample. A 1D-Squeeze-and-Excitation block consists of 3 components, as shown in figure : 2.9 C: number of channels Fsqueeze( ) · F excitation( , ) · W Fscale( , ) · · T frames ht z s C T frames F 2.9: A 1D-Squeeze-and-Excitation block. Different colors repre- IGURE sent different scales for channels.
  • 43. 2.4. Speaker Representations 19 • Squeeze operation, where frame-wise mean vector is calculated from inputs: z z = 1 T T ∑ t ht (2.5) • Excitation operation, where is used to calculate channel-wise scale vector through z two bottle-neck fully connected layers that generate outputs of the same dimension as inputs’: s W = ( σ 2 f (W1z b + 1)+b 2) (2.6) where denotes the sigmoid function; denotes a non-linearity (e.g: a ReLU), σ( ) · f( ) · Wk and b k denotes the learnable weight and bias of bottle-neck fully connected layer k. • Scale operation, where the original input frames are scaled with : s h̃t c , = s t c , ht c , (2.7) 2.4.2.1.3 Res2Net-with-Squeeze-Excitation Block In 2019, Shang-Hua Gao et. all proposed Res2Net [ ], a multi-scale backbone network 65 for computer vision tasks, based on ResNet [ ]. This computer In ECAPA-TDNN, 64 Res2Net is integrated with a SE-Block, forming a Res2Net-with-Squeeze-Excitation (SE-Res2Net) block, to benefit from residual connections (i.e: to reduce vanishing gra- dients) why keeping the number of parameters at a reasonable figure. Under this setup, the number of channel subsets corresponds to number of intermediate 1D-COnv blocks, and thus, may increase the number of parameters. The SE-Block is then used to amplify attention to channels while adding only a small number of parameters. Figure 2.10 visualizes a SE-Res2Net block. 2.4.2.2 Segment-level At segment level, ECAPA-TDNN employs a soft multi-head self-attention [ ] model to 67 calculate weighted statistics at the pooling layer, which accounts for signal samples of varied lengths. The statistic outputs (weighted mean and weighted standard deviation) are then concatenated together. The result vector is propagated through a single fully- connected layer with batch normalization, and then, the final layer, Additive Angular
  • 44. 20 Chapter 2. Speaker Diarization System C channels T frames Conv1D + + T T subset concat T + Conv1D Conv1D SE Block INPUT OUTPUT s subsets; C/s channels each Conv1D C Conv1D C F 2.10: A Res2Net-with-Squeeze-Excitation Block. IGURE Margin Softmax (AAM-Softmax) [ ] layer. The final output is a N-dimension vector 71 where N is the total number of speakers in the training set (i.e: to classify N speakers in the training set). 2.4.2.2.1 Attentive Statistical Pooling In 2019, Okabe et al. proposed using attention mechanism to give different weights to different frames in the signal sample to calculate weighted mean and weighted standard deviation at X-Vectors’ pooling layer [ ]. In ECAPA-TDNN, the attention mechanism 72 is extended further: not only on time frames, but also on channels: The raw scalar channel-and-frame-wise score et c , and its normalized value αt c , , is calcu- lated as follow: et c , = v T c f (Wht + )+ b k c (2.8) αt c , = exp(et c , ) ∑T τ exp(eτ,c ) (2.9) where: • ht are the activations of the last layer frame at time step . t
  • 45. 2.4. Speaker Representations 21 C: number of channels Attention Model Fscale( , ) · · T frames ht C T frames A F 2.11: Attentive Statistics Pooling (on both time frames and chan- IGURE nels). • W ∈ R R×C and b ∈ R R×1 project the activation into a representation of smaller dimension (R). This projection is shared across all C channels. • f ( ) · denotes a non-linearity. • vc ∈ RR×C and kc transform the output of the non-linearity to a channel- f ( ) · dependent scalar score. The normalized score αt c , is then used to calculate weighted mean and weighted standard deviations vectors as follow: µ̃c = T ∑ t αt c , ht c , (2.10) σ̃c = s T ∑ t αt c , h2 t c , − µ̃2 c (2.11) where µ̃c and σ̃c are respectively the channel components of weighted mean and weighted standard deviation vectors µ̃ and σ̃. Moreover, the temporal context of the pooling layer is expanded by making the self- attention to look at global properties of the signal sample (e,g: to account for noise and recording conditions). The local input ht of equation are concatenated with the 2.8 non-weighted mean and non-weighted standard deviation of ht itself.
  • 46. 22 Chapter 2. Speaker Diarization System 2.5 Clustering 2.5.1 PLDA Scoring After generating the speaker representations for each segment, a clustering algorithm is applied to make clusters of segments. The distance or the similarity between each pair of observations can be computed using a wide variety of techniques, including Euclidean distance [ ], mean squared difference [ ], and cosine similarity [ ]. 73 74 75 Although the similarity metrics can be calculated directly from the pairs of extracted speaker embeddings, purely-statistical data reduction techniques such as Linear Discrim- inant Analysis (LDA) can be employed to further improve the discriminative character- istics of these features with a little computation cost. The LDA transformation can be given as in the following equation: z W = T ∗x (2.12) where: • x = {x i} ∈ RD is the D-dimension input vector. • z = {z i} ∈ RD0 (D0 ≤ D) is the representation of input vector in the new latent space. • W ∈ RDxD0 is a linear transformation matrix. In this case, LDA is formulated as an optimization problem to find a linear transfor- mation that maximize the ratio of the between-class scattering to the within-class W scattering: W = argmax W J( ) = W argmax W trace  WTSBW  trace(WT SWW) (2.13) where: • SW denotes within class scatter matrix SW = ∑n i=1 (xi − µ yi )(xi − µ yi )T . Here {yi} are class labels and µk is the sample mean of the k-th class. SW is positive definite. • SB denotes between-class scatter matrix SB = ∑m k=1 nk(µ k− µ µ )( k − µ) T . Here m is the number of classes, is the overall sample mean, and µ nk is the number of samples in the k-th class. SB is positive semi-definite.
  • 47. 2.5. Clustering 23 F 2.12: An example of LDA transformation from 2D to 1D (taken IGURE from [ ]). 76 However, LDA is a deterministic algorithm that works only well on seen data. This is not ideal in case of real speaker diarization applications, where enrolled or questioned speakers are not included in the training data set. Therefore, a probabilistic version of LDA, namely Probabilistic Linear Discriminant Analysis (PLDA) [ ][ ], is employed to take advantages of LDA while dealing with 77 78 unseen classes. Essentially, PLDA is formulated upon LDA by representing both the sample mean of the class and the data between class itself with separate distributions. The chosen distribution is usually a mixture of Gaussian distributions, i.e a Gaussian Mixture Model (GMM) [ ]. Let is a latent class variable representing the mean of a 41 y class within the GMM, then probability of generating same given class mean and the x y prior probability of class mean in the same space are given by the following equations: y P N ( ) = x y | (x y | ,S W) (2.14) P N m ( ) = y (y| ,S B) (2.15) where • SW and SB respectively denote the within-class and between-class scatter matrices as seen in equation . 2.13 • N(x y | ,SW) is a multivariate Gaussian distribution with mean and variance y SW.
  • 48. 24 Chapter 2. Speaker Diarization System • N m (y| ,SB) is a multivariate Gaussian distribution with mean and variance m SB. As proven in [ ], 77 SW and SB can be diagonalized as follow: VT ΦwV I = (2.16) VT ΦbV = Ψ (2.17) and by defining A V = −T , they can be rewritten as: SW = AIAT (2.18) SB = A A Ψ T (2.19) Then, from equation : 2.14 P N S ( ) = x y | (x y | , W ) = N  x m v | +A ,AAT  = + ( m A N ∗ u v | ,I) (2.20) and from equation : 2.15 P N S ( ) = y (y m | , B) = N  y m | ,A A Ψ T  = + ( ) m A N ∗ v | 0,Ψ (2.21) Let and re Gaussian random variables in the latent space: u v u v ∼ | N(. ,I) (2.22) v ∼ | N(. 0,Ψ) (2.23) then the relationship between and the latent variables can be written as follow: x y , u v , y m v = +A (2.24) x m u = +A (2.25)
  • 49. 2.5. Clustering 25 The unknown parameters of PLDA are the mean , the covariance matrix , and the m Ψ loading matrix . All of these parameters are learnable using the maximum likelihood A method. Figure demonstrates the training process of the PLDA model. 2.13 F 2.13: Fitting the parameters of the PLDA model (taken from [ ]). IGURE 77 The PLDA score between two given vector u1,u2 in latent space is calculated by taking the log of the likelihood ratio based on two hypotheses: if both of the vectors belong R the same class, or not. is given as: R Score log R u = ( ( 1,u2)) = log  P u ( 1,u2) P u ( 1) ( P u 2) = log R P u ( 1|v P u ) ( 2|v P v dv ) ( ) P u ( 1) ( P u2)  (2.26)
  • 50. 26 Chapter 2. Speaker Diarization System 2.5.2 Agglomerative Hierarchical Clustering Agglomerative hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps: (1) identify the two clus- ters that are closest together, and (2) merge the two most similar clusters. This iterative process continues until all the clusters are merged together (figure ). The cluster- 2.15 ing process can stopped after a defined number of clusters or a decision threshold is reached. The main output of Hierarchical Clustering is a dendrogram [ ], which shows 79 the hierarchical relationship between the clusters (figure ). 2.16 F 2.14: Agglomerative hierarchical clustering flowchart. IGURE
  • 51. 2.5. Clustering 27 F 2.15: An example iterative process of agglomerative hierarchical IGURE clustering (taken from [ ]). 80 F 2.16: Visualization of the result of hierarchical clustering (taken IGURE from [ ]). 80
  • 52.
  • 53. 29 Chapter 3 Experiments 3.1 Evaluation Metrics 3.1.1 Equal Error Rate and Minimum Decision Cost Function Equal error rate or crossover error rate (EER or CER) is the rate at which both acceptance and rejection errors are equal. In order to find the EER of a given system, an EER plot is created through the following steps: • Calculate acceptance rate (FAR - i.e.: false positive speaker identification, or false alarm), and false rejection rate (FRR - i.e.: false negative speaker identification, or missed detection) for a set of decision thresholds : t FARt = Number Of False Acceptancest Number of Identification Attempts (3.1) FRRt = Number Of False Rejectionst Number of Identification Attempts (3.2) • Plot FAR and FRR against decision threshold t. EER is y-value of the intersection of those lines. An example of EER plot is shown in figure . 3.1 As the decision threshold (i.e: sensitivity) increases, the false alarms will drop with the missed detection wil rise. In this case, the configured system is more secured by reduce the acceptance possibility. Conversely, when the decision threshold is lowered, the system is less secure against impostors.
  • 54. 30 Chapter 3. Experiments F 3.1: An EER plot. IGURE EER is originally used as main evaluation metric in a speaker identification system. In a tradition speaker diarization, it’s also used to effectively estimate the discriminated characteristic of speaker embedding extraction techniques system, due to the fact that this metric is not affected by some other modules like clustering or resegmentation. The lower EER, the better the system performs. An important metric usually coupled with EER is Minimum Decision Cost Function (MinDCF), representing the minimum value for the linear combination of the false alarms and missed detections at different threshold: MinDCF = min t (( ) 1− p ∗FAR t + p FRR ∗ t) (3.3) where: • t is the decision threshold. • FARt and FRRt are calculated as in equations and . 3.1 3.2 • p is the prior probability of the enrolled entity. Common values for are and p 0 01 . 0 001 . .
  • 55. 3.2. Frameworks 31 The lower EER and MinDCF, the better a speaker verification system performs. 3.1.2 Diarization Error Rate Diarization Error Rate (DER) is the most widely used metric for speaker diarization. It is measured as the fraction of time that is not attributed correctly to a speaker or to non-speech, calculated as in equation 3.4 DER = TFalseAlarm +TMiss +TConfusion TScored (3.4) where: • TScored is the total duration of the recording without overlapped speech. • TFalseAlarm is the scored time that a hypothesized speaker is labelled as a non-speech in the reference. • TMiss is the scored time that a hypothesized non-speech segment corresponds to a reference speaker segment. • T Confusion is the scored time that a speaker ID is assigned to a wrong speaker. The lower DER, the better a speaker diarization system performs. 3.2 Frameworks 3.2.1 Kaldi Kaldi is a toolkit originally written in C++, Perl, Shell and Python for speech recognition, speaker recognition, and many others. Kaldi used to be the framework of choice among speech processing researchers. In addition to the flexibility and high performance, Kaldi is enriched by many reproducible state-of-the-art recipes from researchers around the world. Some noteworthy features of Kaldi include: • Code-level integration with Finite State Transducers (FSTs).
  • 56. 32 Chapter 3. Experiments • Extensive linear algebra support: Both BLAS and LAPACK are supported • Extensible design: The algorithms are provided in the most generic form possible. • Open license: The code is licensed under Apache 2.0, which is one of the least restrictive licenses available. • Complete recipes: Recipes for building a complete speech recognition system is included. These work with widely available datasets such as those provided by the Linguistic Data Consortium (LDC). F 3.2: Kaldi logo. IGURE F 3.3: Kaldi general architecture diagram. IGURE 3.2.2 SpeechBrain SpeechBrain [ ] is an open-source and all-in-one conversational AI toolkit based on 81 PyTorch. The main purpose is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems
  • 57. 3.2. Frameworks 33 for speech recognition, speaker recognition, speech enhancement, speech separation, language identification, multi-microphone signal processing, and many others. Speech- Brain provides the implementation and experimental validation of both recent and long- established speech processing models with SotA or competitive performance on a variety of tasks (table ). 3.1 TABLE 3.1: List of speech tasks and corpora that are currently supported by SpeechBrain (taken from [ ]) 81 3.2.3 Kal-Star Kal-Star [ ], developed by the author while working at Viettel Cyberspace Center, is 82 a Shell/Python library wrapping around Kaldi and SpeechBrain to enhance data pre- processing and training processes. Kal-Star provides a wide variety of tools to prepare, train and test data, mostly with speaker verification and speaker diarization tasks (Figure 1.4 1.5 and Figure ). Kal-Star inherits the file-based data indexing from Kaldi, which treats view a given data set as a folder of , , spk2utt utt2spk wav.scp segments , and (if the data sets is segmented with a VAD) files. Further informations can be added later to the folder (e.g: the diarization result RTTM is added once the speaker diarization process is done.). Figures , 3.4
  • 58. 34 Chapter 3. Experiments F 3.4: Filtering VTR_1350 data set by utterances’ durations and IGURE number of utterances per speaker. F 3.5: Generating 200 5-way conversations from VTR_1350 data set. IGURE The max. and min. numbers of utterances picked from each conversation are 2 and 30 respectively. 3.3 Data Sets In this thesis, three main Vietnamese data sets are used: • IPCC_110000: splitted for training and testing. The test split is then used directly for speaker verification task, and used to generate mock conversations for speaker diarization task. • VTR_1350 ZALO_400 and : used directly for speaker verification task, and also used to generate mock conversation for speaker diarization task. These data sets are not used in training.
  • 59. 3.3. Data Sets 35 3.3.1 IPCC_110000 IPCC_120000 data set consists of 1046.37 hours of audio in telephone environment from approximately 110000 Vietnamese speakers. Data are recorded at Viettel Customer Service IP Contact Center (IPCC) and sampled at 8000 Hertz. Most recorded utterances are from 2 to 6 seconds in length, while each speaker have from 1 to 10 utterances, making up from 10 to 60 seconds of speech. The spoken topics revolve around technical difficulties that Viettel’s customers met in using mobile and internet services, as well as every questions about common knowledge, the weather, sport results, or lottery results. Table gives an overview of this data set and figure demonstrates how the data is 3.2 3.6 distributed. Data set IPCC_110000 Base sample rate 8000 Hz Environment Telephone # Speakers 112837 # Utterances 919608 Total duration 1046.4 hours TABLE 3.2: IPCC_110000 data set overview. F 3.6: IPCC_110000 data distributions. IGURE All speakers and agents each has his/her own recording channel, and it is assumed that each recording has only one speaker. In reality, a telephone conversation between a cus- tomer and an agent can be interfered by: other customers join the conversations, or other agents join the conversation. The latter case happens much less than the former, since the IPCC is designed in such a way that agents have good sound isolation, and even if the customer’s issue is passed to other agents, these agents would have their own recording channel. As for the case where more than one customers join the conversation, it’s tested
  • 60. 36 Chapter 3. Experiments that in 1000 randomly chosen conversations, only about 12 of them have more than 1 speakers (e.g: an infant interrupts his parent having a call with an IPCC agent). The ra- tio is only about 1.2 percents, and thus, this case could be mitigated with a little negative effect towards the discriminative characteristic of the trained embedding extractor. IPCC_110000 is randomly splitted into 3 subsets: train, dev, and test sets. Each of the test and dev data sets has 2000 speakers and about 20 hours of data. The train set contains the remaining data. Table displays the number of speakers, number of utterances 3.3 and total duration of each subsets. Split train dev test # Speakers 108837 2000 2000 # Utterances 886172 16975 16461 Total duration (hours) 1005.1 20.94 20.39 TABLE 3.3: IPCC_110000 data subsets. 3.3.1.1 IPCC_110000 Verification Test Set Generated from IPCC_110000 test split, the verification test set has verification 3888 pairs, generated by the following steps: • Step 1: With , for each speaker randomly pick utterances from this K = 3 K ∗ 2 speaker’ utterances pool to generate verification pairs. The labels are "target" K each, meaning utterances in each pair are from the same speaker. • Step 2: Randomly pick other utterances from the utterances pool (if there K = 3 are less than utterances left in the pool, discard all picked verification pairs and K skip to step 3) of the selected speaker at step 1, and other utterances from K = 3 utterances pools of all other speakers to generate more verification pairs. The K labels are "nontarget" each, meaning utterances in each pair are not from the same speaker. An example result after step 2: IPCC-131047_hanoi-3166 IPCC-131047_hanoi-1961 target IPCC-131047_hanoi-1319 IPCC-131047_hanoi-3657 target IPCC-131047_hanoi-2303 IPCC-131047_hanoi-1015 target
  • 61. 3.3. Data Sets 37 IPCC-131047_hanoi-2626 IPCC-203322_hanoi-2214 nontarget IPCC-131047_hanoi-0582 IPCC-203268_hanoi-3113 nontarget IPCC-131047_hanoi-1679 IPCC-203268_hanoi-5260 nontarget • Step 3: Go back to step 1, until all speakers are considered for enrollment. 3.3.1.2 IPCC_110000 Diarization Test Set Since original conversations from IPCC_110000 are shuffled by channels due to internal policies that protect customer privacy, mock conversations generated from utterances of distinguished speakers will be used instead. Furthermore, to expand the scope the di- arization not only in 2-way conversations, but also in conversations with more speakers, mock conversations of 3-way, 4-way, and 5-way conversations are also generated. From the test split of IPCC_110000, the following data sets are generated: • IPCC_110000-TEST-MOCK_200_2: 200 2-way conversations. • IPCC_110000-TEST-MOCK_200_3: 200 3-way conversations. • IPCC_110000-TEST-MOCK_200_4: 200 4-way conversations. • IPCC_110000-TEST-MOCK_200_5: 200 5-way conversations. Each conversation among conversations in each subset is generated by following steps: N • Step 1: Chose a number of speaker . S • Step 2: Randomly pick speaker from speakers pool of the whole data set. S • Step 3: With each picked speaker sj, randomly pick uj (where uj is a random integer in range [umin,umax]) utterances from the speaker’s utterances pool. If there’s not enough uj in speaker’s utterances pool, then discard picked utterances and go back to step 2. • Step 4: Go back to step 2 until utterances are picked from all speakers. S
  • 62. 38 Chapter 3. Experiments • Step 5: Shuffle the list of picked utterances, then concatenate utterances together (with in-between silences of duration randomly chosen ranging from 0.2 to 1.0 sec- onds) into a single audio file. Conversation ID, oracle timestamps and distinguished speakers information are bundled into an accompanying RTTM file. Four mock conversations subsets generated from IPCC_110000 test split mentioned above are generated with , , N = 200 S ∈ { } 2 3 4 5 , , , u min = 2 and u max = 20. 3.3.2 VTR_1350 VTR_1350 is a data set consisting of 491.1 hours of wide-band recordings, originally sampled at 16000Hz, recorded by a selected group of 1346 broadcasters using written transcripts (as shown in table ). The recording environment is significantly less 3.4 noisy than IPCC_110000’s, and thus it can be considered as a clean set. Most recorded utterances are from 3 to 10 seconds in length, while each speaker have from 200 to 300 utterances, making up from 10 to 33 minutes of speech. The topics are daily news’ topics, from politics, health care to sports. Although the number of speakers are much less than IPCC’s, the content of the recordings are much diverse and the duration are much longer. However, VTR_1350 is significantly less natural as a normal conversation than IPCC_110000, due to the mentioned fact that it’s recorded with planned transcripts under controlled environment. Table gives an overview of this data set and figure 3.4 3.7 demonstrates how the data is distributed. Data set VTR_1350 Base sample rate 16000 Hz Environment Controlled recording # Speakers 1346 # Utterances 318599 Total duration 491.1 hours TABLE 3.4: VTR_1350 data set overview. VTR_1350 is down-sampled to 8000Hz and used exclusively for testing in both speaker verification and speaker diarization task.
  • 63. 3.3. Data Sets 39 F 3.7: VTR_1350 data distributions. IGURE 3.3.2.1 VTR_1350 Verification Test Set From VTR_1350 utterance list, a verification list of 7902 pairs is generated using the same method described in , with (targets/non-targets per speakers). The 3.3.1.1 K = 3 followings are first 6 lines of the verification list: VTR_1350-yenvth4-071303 VTR_1350-yenvth4-049030 target VTR_1350-yenvth4-047494 VTR_1350-yenvth4-053126 target VTR_1350-yenvth4-067463 VTR_1350-yenvth4-001415 target VTR_1350-yenvth4-050310 VTR_1350-119706-073293 nontarget VTR_1350-yenvth4-105862 VTR_1350-oanhdtv-029068 nontarget VTR_1350-yenvth4-001159 VTR_1350-tungdhd-076747 nontarget 3.3.2.2 VTR_1350 Diarization Test Set VTR_1350 is a data set of 1-way conversations. Hence, to make use of this data set in speaker diarization task, four mock conversation subsets are generated form the data set itself, basing on the method and configuration described in : 3.3.1.2 • VTR_1350-MOCK_200_2: 200 2-way conversations. • VTR_1350-MOCK_200_3: 200 3-way conversations. • VTR_1350-MOCK_200_4: 200 4-way conversations. • VTR_1350-MOCK_200_5: 200 5-way conversations.
  • 64. 40 Chapter 3. Experiments 3.3.3 ZALO_400 ZALO_400, issued by ZALO AI Challenge 2020 [ ] is a wide-band data set consist- 83 ing of 8.7 hours of recordings, sampled at 48000Hz, recorded by a selected group of 400 broadcasters using planned transcripts. The content and recording environment quite resembles that of VTR_1350, while the data distributions are quite different from VTR_500’s. Most utteranes are from 4 to 12 seconds, while most speakers constitute from 15 to 40 utterances, making up from 1 to 2 minutes of speech. ZALO_400 was originally released as the train data set for the challenge. However, under the scope of this thesis, it’s used exclusive for testing both speaker verification and speaker diariza- tion tasks. Table gives an overview of this data set and figure demonstrates how 3.5 3.8 the data is distributed. Data set ZALO_400 Base sample rate 48000 Hz Environment controlled recording # Speakers 400 # Utterances 10555 Total duration 8.699 hours TABLE 3.5: ZALO_400 data set overview. F 3.8: ZALO_400 data distributions. IGURE 3.3.3.1 ZALO_400 Verification Test Set The verification set is generated by the same method described in section , with 3.3.1.1 K = 3 (targets/non-targets per speakers). The followings are first 6 lines of the verifica- tion list:
  • 65. 3.4. Baseline System 41 424-64 424-35 target 424-46 424-45 target 424-31 424-36 target 424-49 518 nontarget 424-39 500-12 nontarget 424-15 514-30 nontarget 3.3.3.2 ZALO_400 Diarization Test Set The diarization test set consists of four mock conversation subsets generated from VTR_400 data set, including: • ZALO_400-MOCK_200_2: 200 2-way conversations. • ZALO_400-MOCK_200_3: 200 3-way conversations. • ZALO_400-MOCK_200_4: 200 4-way conversations. • ZALO_400-MOCK_200_5: 200 5-way conversations. These conversations are generated by the same method and configuration described in section . 3.3.1.2 3.4 Baseline System The baseline system consists of three main phases:embeddings extractor training, PLDA backend training and speaker diarization. In addition to these phase, an asterisk phase, speaker verification is included to further optimize speaker diarization results. 3.4.1 Speaker Diarization System The speaker diarization subsystem takes a recorded conversation as input and lets the user know the number of speakers and the timestamps of their speaking within the con- versation. Without further recognizing, all speakers remain anonymous. The result of this subsystem can be encapsulated into a Rich Transcription Time Marked (RTTM) file [ ]. The main computation pipeline, which takes a recorded conversation as input and 1 gives RTTM file as output, can be described in the following stages, in order:
  • 66. 42 Chapter 3. Experiments • Stage 1 - Front-end Processing: The enrolled and the questioned utterance are windowed by a 10ms Hamming window with a 10ms frame shift to extract 30- dimensional MFCCs. • Stage 2 - Voice Activity Detection: WebRTC VAD [ ] is utilized to extract speech 43 partitions from the enrolled and questioned utterances. In this process, each utter- ance is sliced into uniform non-overlapping small chunk of 0.03 seconds and We- bRTC VAD will decide each frame is speech or not. The threshold of this decision, which is called "aggressiveness", is set to its max level - level 4. After that adjacent chunks recognized as speech are grouped together into a bigger speech segment. If the maximum PCM amplitude of a newly formed speech segment smaller than 0.05, the segment will be discarded. Furthermore, speech segments that are shorter than 0.2 seconds are also discarded. Finally, each speech segment are padded with 0.05 seconds at its head and at its tail; and if two consecutive speech segments overlap each other and the merged segment of those two is shorter than 15 seconds, they are merged together. • Stage 3 - Uniform Segmentation: Each speech partition is further sub-segmented into homogeneous overlapping sub-segments of length seconds with an overlap L of seconds. The value of L is chosen between 1.5 and 4 seconds. L/2 { } L L ; /2 denotes this segmenting strategy. The extracted features from stage 2 are mapped into these sub-segments for the next stage. • Stage 4 - Embeddings Extraction: An embeddings extractor is employed to extract embeddings from each sub-segments from the last stage. In the baseline system, it’s an X-Vector embeddings extractor. In the baseline system, the extractor is a X- Vectors embeddings extractor, trained from IPCC_110000 data through 3 epochs. The training process is executed by using Kaldi’s VoxCeleb recipe [ ][ ]. The 84 59 data augmentation strategies are slightly changed to the followings: – Additive Noises: adding random noise sequence to the input signal with a signal-to-noise ratio (SNR) [ ] randomly chosen between 0 and 15. 85 – Reverberation: involving simulated room impulse responses (RIR) [ ] with 86 input signal. – Additive Noises and Reverberation: combining the additive noises and rever- beration augmentation on a single input signal.
  • 67. 3.4. Baseline System 43 – Speed Perturbation: speeding up and speeding down the input signal. To avoid changing the speaker characteristics too much, speed perturbation is restricted to a maximum of . ±5% – Waveform dropout: replacing random chunks of the input waveform with zeros. – Frequency dropout: filtering the input signal with random band-stop filters to add zeros in the frequency spectrum. The output of this stage is two 128-dimensional embedding vectors that respectively represent the enrolled utterance and the questioned utterance. • Stage 5 - PLDA Scoring: The PLDA back-end that was trained on embeddings extracted using the same embeddings extractor from the last step is used to score the similarity (a single float value) between the enrolled and the questioned utterances. • Stage 6 - Agglomerative Hierarchical Clustering: Sub-segments’ embeddings are clustered using Agglomerative Hierarchical Clustering (AHC) method, in which the distance function directly takes results from the affinity scoring matrix gener- ated from the last stage. This method supports clustering with a known number speakers or an unknown number of speakers. The latter case requires a pre-defined decision threshold. At the end of this stage, each sub-segment embedding vector, or the sub-segment itself, is tagged with a number from 1 to - the number of distin- K guished speakers in the audio input. Each sub-segment’s begin time and duration, along with the speaker tag, are all recorded in the output Rich Transcription Time Marked (RTTM) file [ ]. 1 3.4.2 Speaker Verification System The main purpose of the implemented speaker verification system is to evaluate the discriminative characteristic of extracted embeddings, without being affected by other modules in a traditional speaker diarization system. In other words, the use of this system is to improve and optimize the baseline diarization system in terms of speaker representations. From a given data set with speaker information, one can generate a verification list - a list of utterances pairs with ground truth values that tell each of those pair is from the same
  • 68. 44 Chapter 3. Experiments Phase 1: Embeddings Extractor Training Phase 2: PLDA Backend Training Training Data Front-end Processing Voice Activity Detection Extract Speaker Embeddings Training Data Training Embeddings Extractor Training PLDA Backend Phase 3: Speaker Diarization Front-end Processing Voice Activity Detection Segmentation Extract Speaker Embeddings Speaker Embeddings PLDA Scoring Affinity Scoring Matrix Clustering RTTM Output Audio Input (Enrolled) Training Speaker Embeddings IPCC_100K (train + dev) WebRTC Uniform Agglomerative Hierarchical X-Vectors F 3.9: Baseline speaker diarization system diagram. IGURE speaker or not. A crucial assumption is that in the enrolled or questioned utterances, there is only one speaker participating. The speaker verification system takes the verification list as input and gives a scoring list as output, which tells a similarity value for each pair in the verification list. The higher similarity is, the higher chance two utterances are from the same speaker. Then, by choosing a decision threshold, which is the minimum value that the similarity value needs to reach to get the pair of utterances determined from the same speaker, one can generate a list of predictions. By comparing the verification list with the prediction list, binary classification metrics [ ] including False Positive Rate (FPR) and False Negative 87 Rate (FNR) are calculated.
  • 69. 3.4. Baseline System 45 Testing Data Generate Verification Pairs Phase *: Speaker Verification (for optimization) Verification Pairs utt-4ae utt-c3d target utt-f34 utt-ced target utt-ae3 utt-b2e nontarget ... Audio Input (Enrolled) Audio Input (Imposter) All pairs are scored? Fetch Verification Pair No Front-end Processing Voice Activity Detection Extract Speaker Embeddings Speaker Embed. (Enrolled) Speaker Embed. (Imposter) PLDA Scoring Scoring Value Verification Scores utt-4ae utt-c3d 0.6554 utt-f34 utt-ced 0.9786 utt-ae3 utt-b2e 0.4587 ... Yes Embeddings Extractor PLDA Backend EER Plotting EER EER Threshold (from ) Phase 1 (from ) Phase 2 X-Vectors WebRTC F 3.10: Baseline speaker verification system diagram. IGURE By repeatedly choosing all similarity values in the scoring list as decision thresholds, a plot of rates (FPR and FNR) against the decision threshold can be made. The intersection of the TPR and FNR lines on the plot is the Equal Error Rate (EER) (as its name suggests, it’s where the rate of errors are equal). Another important metric usually calculated at the point of EER on the graph is Minimum Decision Cost Function (MinDCF). The detailed computation methods for these metrics are described in section . 3.1.1 The main computation pipeline, which takes a pair of utterances in the verification list and gives their similarity, is divided into the following stages, in order: • Stage 1 - Front-end Processing: The enrolled and the questioned utterance are windowed by a 10ms Hamming window with a 10ms frame shift to extract 30- dimensional MFCCs.
  • 70. 46 Chapter 3. Experiments • Stage 2 - Voice Activity Detection: WebRTC VAD [ ] is utilized to extract speech 43 partitions from the conversation. The working configuration of WebRTC VAD is the same as the configuration described in the voice activity detection stage in sec- tion . 3.4.1 • Stage 3 - Embeddings Extraction: Each group of Speech partitions extracted from stage 2 are concatenated together and then an embeddings extractor is employed to extract speaker embeddings. It’s the same extractor used in section . The 3.4.1 output of this stage is 128-dimensional embedding vectors that represent sub- N N segments. • Stage 4 - PLDA Scoring: The PLDA back-end that was trained on embeddings extracted using the same embeddings extractor from the last step is used to score the similarity (a single float value) between the enrolled and the questioned utterances. 3.5 Proposed System In recent years, ECAPA-TDNN, a recent development over X-Vectors’ neurel network with residual connections and attention on both time and feature channels, has shown state-of-the-art results on popular English corpora. Table and reports the outper- 3.6 3.7 forming of ECAPA-TDNN against a strong X-Vector baseline system as experimented in [ ] in both speaker verification task and speaker diarization task with English corpora. 62 TABLE 3.6: EER and MinDCF performance of all systems on the standard VoxCeleb1 and VoxSRC 2019 test sets (taken from [ ]). 62 In the proposed system, X-Vectors-based extractor is replaced with ECAPA-TDNN- based extractor, and the PLDA backend is trained with ECAPA-TDNN embeddings in- stead. The employed ECAPA-TDNN embeddings extractor is trained on the same data set and data augmentation strategies, on which the X-Vectors embeddings extractor is
  • 71. 3.5. Proposed System 47 TABLE 3.7: Diarization Error Rates (DERs) on AMI dataset using the beamformed array signal on baseline and proposed systems (taken from [ ]). 88 trained. The network architecture is kept as described in section . The number of 2.4.2 MFCCs taken is reduced from 80 down to 40, the minimum learning rate is lowered by 10 times, and the number of epochs is doubled from 10 to 20. Figure visualizes the ?? proposed system.
  • 72. 48 Chapter 3. Experiments Phase 1: Embeddings Extractor Training Phase 2: PLDA Backend Training Training Data Front-end Processing Voice Activity Detection Extract Speaker Embeddings Training Data Training Embeddings Extractor Training PLDA Backend Phase 3: Speaker Diarization Front-end Processing Voice Activity Detection Segmentation Extract Speaker Embeddings Speaker Embeddings PLDA Scoring Affinity Scoring Matrix Clustering RTTM Output Audio Input (Enrolled) Training Speaker Embeddings IPCC_100K (train + dev) WebRTC Uniform Agglomerative Hierarchical ECAPA- TDNN F 3.11: Proposed speaker diarization system diagram. IGURE
  • 73. 3.5. Proposed System 49 Testing Data Generate Verification Pairs Phase *: Speaker Verification (for optimization) Verification Pairs utt-4ae utt-c3d target utt-f34 utt-ced target utt-ae3 utt-b2e nontarget ... Audio Input (Enrolled) Audio Input (Imposter) All pairs are scored? Fetch Verification Pair No Front-end Processing Voice Activity Detection Extract Speaker Embeddings Speaker Embed. (Enrolled) Speaker Embed. (Imposter) PLDA Scoring Scoring Value Verification Scores utt-4ae utt-c3d 0.6554 utt-f34 utt-ced 0.9786 utt-ae3 utt-b2e 0.4587 ... Yes Embeddings Extractor PLDA Backend EER Plotting EER EER Threshold (from ) Phase 1 (from ) Phase 2 ECAPA- TDNN WebRTC F 3.12: Proposed speaker verification system diagram. IGURE
  • 74.
  • 75. 51 Chapter 4 Results 4.1 Speaker Verification Task In this task, both baseline and proposed speaker verification sub-systems are tested with different PLDA dimension reduction configurations. The dimension reduction ratios ri are on the scale of . The corresponding reduced, { } 0 5 0 6 0 7 0 8 0 85 0 9 0 95 1 00 . , . , . , . , . , . , . , . or target dimension Di is calculated in equation , where V is the original embedding’s 4.1 dimension, which is 128 in case of the baseline system, and 192 in case of the proposed system. The PLDA backend is trained on the same training data set of the embeddings extractor. Di = 4∗  ri ∗V 4  (4.1) As reported in table , the proposed system with ECAPA-TDNN architecture out- 4.1 performs the baseline system regarding both EER and MinDCF performance in all test cases. In the tests with IPCC_110000 test split, the proposed system gives 64.5% relative improvement in EER, with corresponding 82.4% and 86.5% relative improvements on MinDCF(p=0.01) and MinDCF(p=0.001) respectively. The improvements on MinDCF are smaller in the tests with VTR_1350, where the proposed system gives 66.1% rela- tive improvement in EER, with corresponding 64.7% and 63.1% relative improvements on MinDCF(p=0.01) and MinDCF(p=0.001) respectively. The improvements given by the proposed system in the tests with ZALO_400 are smaller than in both of men- tioned tests, but still significant, where it gives 45.6% relative improvement in EER, with corresponding 22.5% and 22.5% relative improvements on MinDCF(p=0.01) and MinDCF(p=0.001) respectively.
  • 76. 52 Chapter 4. Results Furthermore, both systems show a consistent degradation in terms of equal error rate (EER) when the dimension reduction ratio increases with only two exceptions. The first exception occurs in the tests with IPCC_110000 test split: the EER of the proposed system falls from 1.44% down to 1.39% and then raise up to 1.54%, when the target dimension falls from 192 to 180, and then 172. The second exception happens occurs in the tests with ZALO_400 data set, when the EER of the proposed system falls from 8.08% down to 7.91% and then raise up to 8.16%, when the target dimension experience the same changes mentioned in the first exception. However, in both of these exceptions, the swings are insignificant and did not affect the trend of EER. As for MinDCF, this metric does not show a clear trend against the reduction of embedding dimensions. In summary, the proposed system shows significant improvements over the baseline sys- tem, and the embedding dimension shouldn’t be further reduced in PLDA scoring stage.
  • 77. 4.1. Speaker Verification Task 53 IPCC_110000 (test split) (8000Hz, K=3, # Trials=3888) X-Vector PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 128 0.6240 0.8112 3.91 0.95 120 0.6183 0.8117 3.96 0.90 112 0.6183 0.8066 4.01 0.85 108 0.6163 0.8102 4.17 0.80 100 0.6317 0.8050 4.27 0.70 88 0.6497 0.8020 4.32 0.60 76 0.6445 0.8138 4.53 0.50 64 0.6533 0.7917 4.78 ECAPA-TDNN PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 192 0.1024 0.1085 1.44 0.95 180 0.1065 0.1080 1.44 0.90 172 0.1096 0.1096 1.39 0.85 160 0.0998 0.0998 1.54 0.80 152 0.1070 0.1070 1.54 0.70 132 0.0983 0.0983 1.65 0.60 112 0.1101 0.1101 1.85 0.50 96 0.1240 0.1240 2.01 VTR_1350 (8000Hz (resampled), K=3, # Trials=7902) X-Vector PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 128 0.7588 0.8651 9.69 0.95 120 0.7603 0.8641 9.82 0.90 112 0.7472 0.8659 9.87 0.85 108 0.7325 0.8669 9.92 0.80 100 0.7327 0.8502 10.02 0.70 88 0.7423 0.8322 10.10 0.60 76 0.7261 0.8461 10.30 0.50 64 0.7459 0.8494 10.48 ECAPA-TDNN PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 192 0.2680 0.3192 3.29 0.95 180 0.2721 0.3680 3.49 0.90 172 0.2797 0.3936 3.54 0.85 160 0.3055 0.4052 3.59 0.80 152 0.2971 0.3941 3.62 0.70 132 0.3214 0.4432 3.70 0.60 112 0.3774 0.4450 3.77 0.50 96 0.3991 0.4938 3.77 ZALO_400 (8000Hz (resampled), K=3, # Trials=2376) X-Vector PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 128 0.9470 0.9470 14.39 0.95 120 0.9562 0.9562 14.56 0.90 112 0.9444 0.9444 14.65 0.85 108 0.9444 0.9444 14.73 0.80 100 0.9402 0.9402 14.90 0.70 88 0.9478 0.9478 14.90 0.60 76 0.9621 0.9621 15.24 0.50 64 0.9739 0.9739 14.65 ECAPA-TDNN PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 192 0.7340 0.7340 7.83 0.95 180 0.7424 0.7424 8.08 0.90 172 0.7306 0.7306 7.91 0.85 160 0.7214 0.7214 8.16 0.80 152 0.6987 0.6987 8.16 0.70 132 0.7079 0.7079 8.67 0.60 112 0.7374 0.7374 9.34 0.50 96 0.7332 0.7332 9.93 TABLE 4.1: EER and MinDCF performance.
  • 78. 54 Chapter 4. Results 4.2 Speaker Diarization Task In this task, the whole baseline and proposed systems are tested with mock conversa- tions, consisting of different number of engaging speakers, with different uniform sub- segmenting configurations. In this test, oracle VAD (i.e: ground-truth VAD) is used to diminish the effect of any voice activity detection modules, PLDA scoring is carried out without dimension reduction, and the exact number of engaging speakers in each con- versation is known before clustering process. Results are reported in table , where 4.2 { } x y : represents an uniform segmentation configuration of windows of length seconds x with seconds overlaps. y F 4.1: A speaker diarization output of a 3-way conversation in IGURE VTR_1350 test set. Both system performs relatively well with IPCC_110000 test split’s mock conversations, where DERs are all below 4.5 percent. This results match the fact that these conversa- tions are generated from the data set that is in-domain with the embedding extractor’s training data set. The results with ZALO_400 mock conversations is much worse when the DER goes up to 17.15% in case of the baseline system, and 11.20% in case of the proposed system. With VTR_1350 mock conversations, the results are even worse than that. It goes up to 24.25% in case of the baseline system, and 22.33% in case of the proposed system. In most cases, the proposed system with ECAPA_TDNN out-perform the baseline sys- tem and with most conversation types and all sub-segmentation configurations. In each set of conversations participated by the same number of speaker, the best DER among different sub-segmenting configurations of the proposed system usually outperforms the baseline system by from 30% to 70%.
  • 79. 4.2. Speaker Diarization Task 55 Furthermore, while both systems performs better with wider provided context (i.e: large sub-segmentation window size), the proposed system doesn’t show a significant im- provement in the relative DER reduction against context windows size. In other words, the proposed system’s DER is expected to decrease faster than the baseline system’s DER since ECAPA-TDNN theoretically make use of the wide context thanks to its at- tention mechanism, while it isn’t. This experiment proved that in speaker diarization task, where speech segments are sub-segmented into sub-segments smaller than 4 sec- onds, the attention at time channel isn’t much effective. In conclusion, the proposed system gives significant improvement in DER performance over the baseline system.
  • 80. 56 Chapter 4. Results IPCC_110000.test - Mock Conversations (8000Hz; 2-way, 3-way, 4-way and 5-way conversations (200 each)) # spk subseg. DER (%) X-Vector ECAPA 2 {1.5 : 0.75} 2.93 2.66 2 {2.0 : 1.00} 2.72 2.06 2 {3.0 : 1.50} 2.54 1.70 2 {4.0 : 2.00} 2.17 1.50 3 {1.5 : 0.75} 2.65 2.65 3 {2.0 : 1.00} 2.72 1.36 3 {3.0 : 1.50} 2.34 0.93 3 {4.0 : 2.00} 2.10 0.89 # spk subseg. DER (%) X-Vector ECAPA 4 {1.5 : 0.75} 4.35 4.33 4 {2.0 : 1.00} 4.43 2.82 4 {3.0 : 1.50} 3.53 2.36 4 {4.0 : 2.00} 2.88 2.93 5 {1.5 : 0.75} 5.27 3.88 5 {2.0 : 1.00} 3.78 2.63 5 {3.0 : 1.50} 3.26 2.91 5 {4.0 : 2.00} 3.03 3.14 VTR_1350 - Mock Conversations (8000Hz (resampled); 2-way, 3-way, 4-way and 5-way conversations (200 each)) # spk subseg. DER (%) X-Vector ECAPA 2 {1.5 : 0.75} 11.31 18.59 2 {2.0 : 1.00} 9.32 8.41 2 {3.0 : 1.50} 5.31 2.66 2 {4.0 : 2.00} 2.45 1.31 3 {1.5 : 0.75} 17.77 20.13 3 {2.0 : 1.00} 12.52 12.80 3 {3.0 : 1.50} 8.39 5.63 3 {4.0 : 2.00} 7.08 3.42 # spk subseg. DER (%) X-Vector ECAPA 4 {1.5 : 0.75} 21.18 21.63 4 {2.0 : 1.00} 16.54 12.71 4 {3.0 : 1.50} 10.34 5.34 4 {4.0 : 2.00} 8.20 4.44 5 {1.5 : 0.75} 24.25 22.33 5 {2.0 : 1.00} 18.24 11.70 5 {3.0 : 1.50} 12.16 4.90 5 {4.0 : 2.00} 8.63 2.95 ZALO_400 - Mock Conversations (8000Hz (resampled); 2-way, 3-way, 4-way and 5-way conversations (200 each)) # spk subseg. DER (%) X-Vector ECAPA 2 {1.5 : 0.75} 5.62 4.86 2 {2.0 : 1.00} 4.78 3.60 2 {3.0 : 1.50} 5.40 2.72 2 {4.0 : 2.00} 6.61 6.06 3 {1.5 : 0.75} 10.85 7.11 3 {2.0 : 1.00} 10.11 6.15 3 {3.0 : 1.50} 9.72 5.77 3 {4.0 : 2.00} 10.92 7.59 # spk subseg. DER (%) X-Vector ECAPA 4 {1.5 : 0.75} 15.35 8.85 4 {2.0 : 1.00} 12.83 6.59 4 {3.0 : 1.50} 11.06 6.14 4 {4.0 : 2.00} 12.17 6.58 5 {1.5 : 0.75} 17.15 11.20 5 {2.0 : 1.00} 13.79 8.72 5 {3.0 : 1.50} 12.31 6.76 5 {4.0 : 2.00} 12.44 7.09 TABLE 4.2: DER performance.
  • 81. 57 Chapter 5 Conclusions and Future Works In this thesis, a new architecture of deep neural network, ECAPA-TDNN, was experi- mented in comparison the the baseline system based on X-Vectors and showed signifi- cant overall improvements. The proposed system outperformed the baseline on all Viet- namese data sets and on on tasks: speaker verification and speaker diarization. Thanks to the attention mechanism that operates on both time and features channels, the proposed network can learn which data in the context and which features are more important. This context-aware feature of ECAPA-TDNN is remarkably important since different languages have different ways of constructing sentences, and the word positioning in Vietnamese is totally different from that of English or French. In this sense, ECAPA- TDNN can be adapted to a wide variety of languages of different writing styles, and indeed it worked with Vietnamese conversations. Followings are some highlighted pros and cons in using ECAPA-TDNN in the speaker diarization system: Pros: • ECAPA-TDNN provides context-aware embeddings with attention on both time and features channels that work exceptionally well with Vietnamese data. • Based entirely on Pytorch framework, ECAPA-TDNN is much easier to train, to test and to customize than in Kaldi. Cons: • Both the training and inference processes are slower due to complexity of the net- work. With an NVIDIA-A100 GPU it still takes 80 hours to complete 20 training epochs with IPCC_110000 data set.
  • 82. 58 Chapter 5. Conclusions and Future Works • The network is not yet production-ready, while X-Vectors network, trained with Kaldi has long been used in production, for both speaker verification and diarization system. Further research directions can be explored to improve the understanding about the ca- pability of ECAPA-TDNN in speaker diarization includes: • Trials and errors with more configurations (in this thesis, only some minor changes were made to the original network configurations). • Exploring other types of clustering methods. • Learning how effective the proposed system is in case of conversations with multi- ple overlaps. • Apply post-processing methods to the diarization result. • Build a Vietnamese conversation data set based on real conversations.
  • 83. 59 Bibliography [1] Omid Sadjadi et al. . en. 2021. NIST 2021 Speaker Recognition Evaluation Plan URL: https://tsapps.nist.gov/publication/get%5Fpdf.cfm?pub %5Fid=932697. [2] David Arthur and Sergei Vassilvitskii. “k-means++: the advantages of careful seed- ing”. In: SODA ’07. 2007. [3] Dan Pelleg and Andrew W. Moore. “X-means: Extending K-means with Efficient Estimation of the Number of Clusters”. In: . 2000. ICML [4] Aonan Zhang et al. “Fully Supervised Speaker Diarization”. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 6301–6305. [5] Shota Horiguchi et al. “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors”. In: abs/2005.09921 ArXiv (2020). [6] Yuki Takashima et al. “End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection”. In: 2021 IEEE Spoken Language Technology Workshop (SLT) (2021), pp. 849–856. [7] Tsun-Yat Leung and Lahiru Samarakoon. “Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty”. In: Interspeech 2021 (2021). [8] Niruhan Viswarupan. K-Means Data Clustering. 2017. : URL https://towardsdatascience com/k-means-data-clustering-bce3335d2203 (visited on 12/09/2021). [9] Sabur Ajibola Alim and Nahrul Khair Alang Rashid. “From Natural to Artificial Intelligence - Algorithms and Applications”. In: IntechOpen, 2018. Chap. 1. [10] Urmila Shrawankar and Vilas M. Thakare. “Techniques for Feature Extraction In Speech Recognition System : A Comparative Study”. In: abs/1305.1145 CoRR (2013). arXiv: . : . 1305.1145 URL http://arxiv.org/abs/1305.1145
  • 84. 60 Bibliography [11] Smita Magre, Pooja Janse, and Ratnadeep Deshmukh. “A Review on Feature Ex- traction and Noise Reduction Technique”. In: (Feb. 2014). [12] Bob Meddins. “5 - The design of FIR filters”. In: Introduction to Digital Signal Processing. Ed. by Bob Meddins. Oxford: Newnes, 2000, pp. 102–136. : 978- ISBN 0-7506-5048-9. : DOI https :/ /doi.org /10. 1016/B978 - 075065048 - 9 / 50007-6 https://www.sciencedirect.com/science/article/pii/ . : URL B9780750650489500076. [13] Torben Poulsen. “Loudness of tone pulses in a free field”. In: Acoustical Society of America Journal 69.6 (June 1981), pp. 1786–1790. : . DOI 10.1121/1.385915 [14] Stanislas Dehaene. “The neural basis of the Weber–Fechner law: a logarithmic mental number line”. In: Trends in cognitive sciences 7.4 (2003), pp. 145–147. [15] S. S. Stevens. “A Scale for the Measurement of the Psychological Magnitude Pitch”. In: Acoustical Society of America Journal 8.3 (Jan. 1937), p. 185. : DOI 10.1121/1.1915893. [16] Robert B. Randall. “A history of cepstrum analysis and its application to mechan- ical problems”. In: Mechanical Systems and Signal Processing 97 (2017). Special Issue on Surveillance, pp. 3–19. : 0888-3270. : ISSN DOI https://doi.org/10. 1016/j.ymssp.2016.12.026 https://www.sciencedirect.com/ . : URL science/article/pii/S0888327016305556. [17] Philipos C. Loizou. Speech Enhancement: Theory and Practice. 2nd. USA: CRC Press, Inc., 2013. : 1466504218. ISBN [18] Xugang Lu et al. “Speech enhancement based on deep denoising autoencoder”. In: . 2013. INTERSPEECH [19] Yong Xu et al. “A Regression Approach to Speech Enhancement Based on Deep Neural Networks”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 23.1 (2015), pp. 7–19. : . DOI 10.1109/TASLP.2014.2364452 [20] Hakan Erdogan et al. “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks”. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 708–712. [21] Tian Gao et al. “Densely Connected Progressive Learning for LSTM-Based Speech Enhancement”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018, pp. 5054–5058. : DOI 10 . 1109 / ICASSP . 2018.8461861.
  • 85. Bibliography 61 [22] Desh Raj et al. Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis. 2020. arXiv: . 2011.02014 [eess.AS] [23] Gregory Sell et al. “Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge”. In: . INTERSPEECH 2018. [24] Neville Ryant et al. The Second DIHARD Diarization Challenge: Dataset, task, and baselines. 2019. arXiv: . 1906.07839 [eess.AS] [25] Mireia Díez et al. “BUT System for DIHARD Speech Diarization Challenge 2018”. In: . 2018. INTERSPEECH [26] Shinji Watanabe et al. “CHiME-6 Challenge: Tackling Multispeaker Speech Recog- nition for Unsegmented Recordings”. In: Proc. 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020). 2020, pp. 1–7. : DOI 10.21437/CHiME.2020-1. [27] Ashish Arora et al. The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge. 2020. arXiv: . 2006.07898 [eess.AS] [28] Wikipedia contributors. Maximum likelihood estimation — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Maximum_ likelihood_estimation&oldid=1051139067. [Online; accessed 17-November- 2021]. 2021. [29] John R. Hershey et al. Deep clustering: Discriminative embeddings for segmenta- tion and separation. 2015. arXiv: . 1508.04306 [cs.NE] [30] Morten Kolb k et al. æ Multi-talker Speech Separation with Utterance-level Per- mutation Invariant Training of Deep Recurrent Neural Networks. 2017. arXiv: 1703.06284 [cs.SD]. [31] Yi Luo and Nima Mesgarani. “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation”. In: IEEE/ACM Transactions on Au- dio, Speech, and Language Processing 27.8 (2019), 1256–1266. : 2329-9304. ISSN DOI URL : . 10.1109/taslp.2019.2915167 : http://dx.doi.org/10.1109/ TASLP.2019.2915167. [32] Xiong Xiao et al. Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020. 2020. arXiv: . 2010.11458 [eess.AS]
  • 86. 62 Bibliography [33] Arsha Nagrani et al. VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge. 2020. arXiv: . 2012.06867 [cs.SD] [34] Takuya Yoshioka et al. “Recognizing Overlapped Speech in Meetings: A Mul- tichannel Separation Approach Using Neural Networks”. In: Interspeech 2018 (2018). : . : DOI 10.21437/interspeech.2018-2284 URL http://dx.doi. org/10.21437/Interspeech.2018-2284. [35] Christoph Boeddecker et al. “Front-end processing for the CHiME-5 dinner party scenario”. In: Proc. 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018). 2018, pp. 35–40. : . DOI 10.21437/CHiME.2018-8 [36] Wikipedia contributors. Audacity (audio editor) — Wikipedia, The Free Encyclo- pedia. https :/ /en .wikipedia .org /w / index. php? title= Audacity_ (audio_editor)&oldid=1054771106. [Online; accessed 17-November-2021]. 2021. [37] A. Benyassine et al. “ITU-T Recommendation G.729 Annex B: a silence compres- sion scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications”. In: IEEE Communications Magazine 35.9 (1997), pp. 64–73. DOI: . 10.1109/35.620527 [38] Jongseo Sohn and Wonyong Sung. “A voice activity detector employing soft deci- sion based noise spectrum adaptation”. In: Proceedings of the 1998 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat. No.98CH36181). Vol. 1. 1998, 365–368 vol.1. : DOI 10.1109/ICASSP.1998. 674443. [39] Monica Franzese and Antonella Iuliano. “Hidden Markov Models”. In: Encyclo- pedia of Bioinformatics and Computational Biology. Ed. by Shoba Ranganathan et al. Oxford: Academic Press, 2019, pp. 753–762. : 978-0-12-811432-2. : ISBN DOI https://doi.org/10.1016/B978-0-12-809633-8.20488-3 https: . : URL //www.sciencedirect.com/science/article/pii/B9780128096338204883. [40] Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. “A statistical model-based voice activity detection”. In: IEEE Signal Processing Letters 6.1 (1999), pp. 1–3. DOI: . 10.1109/97.736233 [41] Jacob Benesty, M Mohan Sondhi, Yiteng Huang, et al. Springer handbook of speech processing. Vol. 1. Springer, 2008.
  • 87. Bibliography 63 [42] Wikipedia contributors. WebRTC — Wikipedia, The Free Encyclopedia. [Online; accessed 17-November-2021]. 2021. : URL https://en.wikipedia.org/w/ index.php?title=WebRTC&oldid=1053350113. [43] Webrtc/common_audio/VAD - external/webrtc - git at google. : URL https:// chromium.googlesource.com/external/webrtc/+/branch-heads/43/ webrtc/common%5Faudio/vad/. [44] Thad Hughes and Keir Mierle. “Recurrent neural networks for voice activity de- tection”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 2013, pp. 7378–7382. : . DOI 10.1109/ICASSP.2013.6639096 [45] Jesus Lopez et al. “Advances in Speaker Recognition for Telephone and Audio- Visual Data: the JHU-MIT Submission for NIST SRE19”. In: Nov. 2020, pp. 273– 280. : . DOI 10.21437/Odyssey.2020-39 [46] Matthew A. Siegler. “Automatic Segmentation, Classification and Clustering of Broadcast News Audio”. In: 1997. [47] “Step-by-step and integrated approaches in broadcast news speaker diarization”. In: Computer Speech & Language 20.2 (2006). Odyssey 2004: The speaker and Language Recognition Workshop, pp. 303–330. : 0885-2308. : ISSN DOI https:// doi.org/10.1016/j.csl.2005.08.002 https://www.sciencedirect. . : URL com/science/article/pii/S0885230805000471. [48] Scott Chen. “Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion”. In: 1998. [49] Perrine Delacourt and Christian Wellekens. “DISTBIC: A speaker-based segmen- tation for audio data indexing”. In: Speech Communication 32 (Sept. 2000), pp. 111– 126. : . DOI 10.1016/S0167-6393(00)00027-3 [50] Simon Prince and James H. Elder. “Probabilistic Linear Discriminant Analysis for Inferences About Identity”. In: 2007 IEEE 11th International Conference on Computer Vision (2007), pp. 1–8. [51] Daniel Garcia-Romero and Carol Y. Espy-Wilson. “Analysis of i-vector Length Normalization in Speaker Recognition Systems”. In: . 2011. INTERSPEECH [52] In: (). [53] Gregory Sell and Daniel Garcia-Romero. “Speaker diarization with plda i-vector scoring and unsupervised calibration”. In: 2014 IEEE Spoken Language Technol- ogy Workshop (SLT) (2014), pp. 413–417.