Deep Learning Based Voice Activity Detection and Speech Enhancement

Speech Recognition Front-End:
Voice Activity Detection & Speech
Enhancement
Juntae Kim, Ph. D Candidate
School of Electrical Engineering
KAIST
For NAVER

Voice Activity
Detection
She had your dark suit in greasy
wash water all year.
Local device (smart speaker, robot)
Overview
End Point
Detection
Speech
Enhancement
Speech
Recognition
Server
Today’s topic

Voice Activity Detection Using an Adaptive
Context Attention Model
Kim, Juntae, and Minsoo Hahn. "Voice Activity Detection Using an Adaptive Context Attention Model." IEEE Signal Processing Letters (2018).
The most famous VAD
repository in github

Voice activity detection (VAD)
Objective: From incoming signal, detecting the speech signal only.
Important Points for VAD:
① Robustness to the various real-world noise environments.
② Robustness to the distance variation.
③ Compact computational cost with low-latency.
Conventional methods:
① Statistical signal processing based approaches.  Assume DFT coefficients of speech and noise signal to
Gaussian random variable and conduct the decision by calculating the likelihood ratio.
② Feature engineering based approaches.  Harmonicity, energy, zero-crossing rate, entropy and etc.
③ Traditional machine learning based approaches.  SVM, LDA, KNN and etc.

Deep learning based VAD
Research branches for deep-learning based VAD
Acoustic
Feature
Extraction
Neural network:
DNN, CNN,
LSTM
Decision
Which acoustic features are useful for VAD?
 These approaches show outstanding performance,
however, generally needs to multiple features so that
computation cost can be increased.
Can we directly use this raw-waveform for neural network?  Generally raw-waveform based approach needs high
computational cost even the performance improvement is slight.
Which neural network architecture make the VAD robust to various noise
environments?  Generally many researchers try to apply state-of-the-art
architectures from other field such as LSTM, CNN and DNN but such architecture
show the trade-off according to the noise types.
Which neural network architecture can effectively use the context
information of speech signal for VAD?

Boosted deep neural network (bDNN)
Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP) 24.2 (2016): 252-264.
𝑥 𝑛 𝑥 𝑛+1
𝑥 𝑛−1
𝑥 𝑛+1+𝑢 𝑥 𝑛+𝑊
Inputs:
Frames subsampled with W=19, u=9 are used as
input.
Combine with average method
Extended outputs through time are used as
output.
Loss : Mean squared error
bDNN shows outstanding performance by
adopting the boosting strategy.  However it
only can use fixed context information.

Boosted deep neural network (bDNN)
bDNN shows outstanding performance by
adopting the boosting strategy.  However it
only can use fixed context information.
In order to solve this problem, Zhang et el.
built multi-stacking model with several
bDNNs that have various context size of inputs.
This structure shows state-of-the-art
performance but the computation cost was 11
times higher than single bDNN. However, this
result implies that if we use the context
information adaptively, we can get some
performance improvement.

Can a single model adaptively use the context information (CI) according to noise types and SNRs?
In noisy acoustic features, there is no ground truth what is proper usage of CI.
Let’s use the reinforcement-like method.
From the input acoustic features, repeatedly find the proper CI usage.
If found CI usage make the classification correct, give some rewards to model.
Motivation

Adaptive context attention model (ACAM)
• Decoder: Determine which context is important. (attention)
• Core network: Given previous hidden state (𝐡 𝑚, 𝑡−1) and the input information with noise
environment (𝐠 𝑚,𝑡), propose the next action to the succeeding module.
• Encoder: Aggregate the information of the results of current action.
Start
No
Yes
Stop
𝑚, ,
𝑚, 𝑡 ( 𝑚, 𝑡−1
𝑚, 𝑡 ( 𝑚, 𝑚, 𝑡
𝑚, 𝑡 ( 𝑚, 𝑡, 𝑚, 𝑡
𝐡 𝑚, 𝑡 (𝐡 𝑚, 𝑡−1, 𝐠 𝑚, 𝑡 𝑚 (𝐡 𝑚, 𝑡
⋯
Core network
Attention Attention
𝐡 𝑚, 𝑡−1 𝐡 𝑚, 𝑡
𝐡 𝑚, 𝑇
𝛒 𝑚, 𝑡
𝑚, 𝑡 [0.05, 0. , 0. , 0.5, 0. , 0. , 0.05] 𝑚, 𝑡+1 [0.05, 0. , 0.05 0.6 0.05, 0. , 0.05]
𝛒 𝑚, 𝑡+1 𝑚
𝐠 𝑚,𝑡 𝐠 𝑚,𝑡

Experimental Setup
Training Phase (20 h)
• Speech dataset: 4,620 utterances from the training set in the TIMIT corpus.
• Noise dataset: 20,000 types from Sound effect library.
• Noise addition was conducted with a randomly selected SNR between −10 and 12 dB.
Test phase
D1 dataset (2.4 h)
• Speech dataset: 192 utterances from the training set in the TIMIT corpus.
• Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine,
destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank,
machine gun, pink, Volvo, speech babble, and white).
• Noise addition was conducted with SNRs −5, 0, 5, 10 dB.
D2 dataset (2 h)
• Real world dataset recorded by Galaxy 8.
D3 dataset (72 h)
• Youtube dataset recorded in real-world.

Experimental Setup
Acoustic features: Multi-resolution cochleagram features (MRCG)

ACAM 1: Fix the attention.
ACAM 2: Train the model with 𝐽𝑠𝑣 only.
ACAM 3: Train the model with 𝐽.
D1: TIMIT+NoiseX92 (2.4h), D2: Recorded dataset(2h), D3: YouTube Dataset(72h)
Performance Measure: area under the ROC curve (AUC) in %.
Number of parameters for each model.
HFCL MSFI DNN bDNN LSTM ACAM
# param. ▪ ~2 k ~3015 k ~3018 k ~2097 k ~953 k
Computation time (ms) for each model. The number in bracket is the MRCG extraction time.
HFCL MSFI DNN bDNN LSTM ACAM
The average p
rocessing tim
e per second
of signal
9.04 67.73 0.88 + (206) 9.24 + (206) 31.61 + (206) 10.12 + (206)
Experimental Results – Investigation of the TSN framework

End-to-End Speech Enhancement Using
Boosting-Based Two-Step Neural Network
Kim, Juntae, and Minsoo Hahn. “End-to-End Speech Enhancement Using Boosting-Based Two-Step Neural Network" submitted to IEEE Signal Processing Letters (2018).

Speech enhancement (SE)
Objective: From incoming noisy speech signal, removing the noise signal, while conserving the speech signal.
𝑥 ( )  ො 𝑥
Important Points for SE:
① Perceptual quality (related with speech distortion).
② Noise reduction.
③ Computational cost.
Conventional methods:
)( )H w
ˆ( )y t( )y t
( ) 1
( )
( ) ( ) 1 ( ) / ( )
x
x n n x
S w
H w
S w S w S w S w
 
 
   2 2
( ) [ ( )] , ( ) [ ( )]x nS w E F x t S w E F n t 
* Assumption:
x(t), n(t) are independent.
Estimated from silence region (by
conducting VAD)
How we can find the H(w)?  Minimum mean squared error estimation (MMSE).
2
minimize [( ( ) ( ) ( )) ]E Y w H w X w
Trade off

Deep learning based SE
 ˆ |t tf x Y
 ˆ |t tf x Y
Method 1: Directly maps the noisy log power spectral features (LPS) to clean LPS.
where ොxt ∈ ℝN is an enhanced LPS vector, yt ∈ ℝN is a noisy LPS vector, t is the frame index, N is
the feature dimension, f ( ∙ | θ) denotes the neural network-based function.
Xu, Yong, et al. "A regression approach to speech enhancement based on deep neural networks." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23.1 (2015): 7-19.
1. Fully Convolutional Neural Network
2. Deep Neural Network
3. Etc.

Deep learning based SE
Method 2: SE is carried out by adopting the masking method.
where ොxt(k), xt k , and yt k are the kth element of ොxt, xt, and yt, respectively, and xt∈ℝN is a clean
LPS vector, k is frequency bin, and mt k is a mask value for kth element. The IRM in second line is
the most widely used one for the mask.
          
 1/2
ˆ , 2 log ty kmask
t t t tx k f x k y k m k e

   
      1/2 t tx k y kIRM
tm k e
 

Clean/Noisy
samples
Noisy
samples
Feature
extraction
DNN
Training
Mask
extractionx t , y(t)
y(t)
m(t)
Feature
extraction
Reconstructi
on
ොx(t)
Training stage
Enhancement stage
Y
Y
𝐘
ෝm(t)

Deep learning based SE – Summary
Conventional
Approaches
Method 2
LSTM
Method 1
Fully convolutional network
(FCN)
GAN
Multitask learning
Convolutional LSTM
DNN with skip connection
Multitask learning
Ideal ratio mask (IRM)
Complex ideal ratio mask (cIRM)
MMSE
Wiener filtering
Optimally modified log-
spectral amplitude
Minimum variance distortionless Response
(MVDR)
Two-step network (TSN)
Conventional Approaches
Pros:
① Good! but only in specific noise environment (stationary noise).
Cons:
① Vulnerable in non-stationary noise environment
② Computation cost is relatively high (matrix inversion operation)
③ Model size is small (only have to save the impulse response H(w)).
Method 1
Pros:
① Good! compared to conventional approaches. But some muffled sound (smoothing effect in the spectrogram)
is investigated.
② Strong performance in unseen noise environments.
③ Computation cost is cheap if we use simple acoustic features such as log-power spectra and simple
architecture such as DNN.
Cons:
① In order to reduce the speech distortion, GAN based SE methods were proposed. But training is quite hard
because of over-regularization.
② According to used architecture, there is some trade-off between noise types.
③ Model size is relatively big (many parameters.).
④ Some neural network types (bi-directional RNN) prevent the online enhancement.
⑤ Phase reconstruction seems to be hard in this framework.
Method 2 compared to Method 1
Pros:
① Phase reconstruction become easier than method 1.
② There is some performance improvement if we use method 1
framework. (Some paper deny this fact.)
Cons:
① Additional masking operation is needed.
② Even though the neural network can model any arbitrary function,
there aren’t solid reason the masking method outperform the
method 1. According to used architecture, there is some trade-off
between noise types.

Two-Step Network
Clean/Noisy
samples
Noisy
samples
Feature
extraction
Prior Net
Training
x t , y(t)
Feature
extraction
Reconstructi
on
ොx(t)
Training stage
Enhancement stage
Y
Y
𝐗, 𝐘
Post Net
Training
Prior Net
Enhancing
Post Net
Enhancing
𝐗∗, 𝐘
   *
ˆ , , ,mask post
t t t t tf f x x y X Y
Proposed Method: SE is carried out in end-to-end manner but implicitly considering the masking method (Our model
directly maps noisy features to clean features).
y(t)
Why end-to-end?
① We can share acoustic features with following modules (speech recognition system).
② We can save computation cost (no additional masking operation).
③ We can fully exploit the potential modeling power of neural network.

Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
  *
,pri
t t k k
g

 
X Y
  * ( )
,m pred CI
t t m k t tm k

 
   
 X x X X
Why multiple outputs for pri-NN?
 We can get multiple predictions Xt
pred
for xt.
From multiple predictions, we can adopt the
boosting method.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x

Two-Step Network
( )pri
g  pri-NN
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
( )1
ˆ m
t t
mM
 x x
෤𝐱 𝑡
෤𝐱 𝑡
1
෤𝐱 𝑡
−1
y 𝑡+1 y 𝑡+2y 𝑡y 𝑡−1 y 𝑡y 𝑡−2
Drawback 1: are from different context. 
The data distribution across the frequency dimension can be
different each other.
(1) (2) (3)
, ,t t tx x x
Input contexts are different.

Two-Step Network
( )pri
g  pri-NN
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Drawback 2: We cannot use Xt
CI which is highly related with 𝐱 𝑡,
because it corresponds to neighboring frames.
( )1
ˆ m
t t
mM
 x x
Drawback 3: We cannot use 𝐘t  We cannot implicitly model
the masking method.
   *
ˆ , , ,mask post
t t t t tf f x x y X Y
Also, according to noise types, 𝐘t can have intact clean speech
features in some frequency band.

Two-Step Network
( )pri
g  pri-NN
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Idea: Try to use all the useful information Xt
pred
, Xt
CI, Yt that
related with xt.
( )1
ˆ m
t t
mM
 x x
     2 1 2 2*
, ( ) .
N
t t tconcat set
   
 V X Y
However, we cannot use the simple averaging method because
of Xt
CI, Yt .
Convolution based boosting for filter out some noisy
information while aggregating target-related information from
each feature vectors.

Two-Step Network
( )pri
g  pri-NN
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
1
,l l l
t t

 H K H
     1
1 1
, , , 1, ,
l l
I S
l l l
t t
i u
f j u i j f u i 
 
 
    
 
H K H
where Ht
l∈ℝN×O l
and Kl∈ℝS l×I l×O l
are the output
feature maps and the convolutional kernel respectively, S l and
I l and O l denote the size of the convolutional kernel and the
number of input and output feature maps, respectively, from the
l-th convolutional layer.
( )1
ˆ m
t t
mM
 x x
Idea: Convolution based boosting for filter out some noisy
information while aggregating target-related information from
each feature vectors.

Two-Step Network
( )pri
g  pri-NN
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Loss function
22
ˆ ,prior
loss post pri t t t t F
t
J J J       x x X X
 , , , , ,t t t t  X x x x
prior
tX
If λ is set to 0, we cannot consider that Xt
*
contains Xt
pred
because there is no evidence that Xt
pred
includes the
prediction of xt, which means that the pri-NN loses the
characteristic as weak-predictors. Thus, the boosting effect
of TSN from multiple predictions would be negated.

Experimental Setup
Training Phase
• Speech dataset: 4,620 utterances from the training set in the TIMIT corpus.
• Noise dataset: 100 types of noise from HU dataset, 50 types from Sound effect library.
• Noise addition was conducted with a randomly selected SNR between −5 and 20 dB.
• We repeat this procedure until the length of the entire training dataset is approximately 50 h.
• 90% of the training dataset is used in training the model and the remaining 10% in validation.
Test phase
D1 dataset
• Speech dataset: 192 utterances from the training set in the TIMIT corpus.
• Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine,
destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank,
machine gun, pink, Volvo, speech babble, and white).
• Noise addition was conducted with SNRs −5, 0, 5, 10 dB.
D2 dataset
• Real world dataset recorded by Galaxy 8.

Experimental Setup
Additional Information
• Sampling rate: 8kHz.
• Window shift and length: 10 and 25ms
• Log-power spectra (LPS) were used for acoustic features.
• Z-score-normalization was conducted across the frequency dimension for LPS.
• When reconstructing the waveform, the noisy phase information was used.
Baseline Methods
• Deep neural network (DNN)
• DNN with skip connection (sDNN)
• Ideal ratio mask with DNN (IRM)
• Fully convolutional neural network (FCN)
• Long short term memory recurrent neural network (LSTM)

Experimental Setup
Pri-NN Post-NN
𝜏 =4
3 hidden layers={1024, 1024, 1024}
8 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (32, 5,
1), (32, 5, 1), (32, 5, 1), (32, 5, 1), (1, 5, 1)}
TSN
Pri-NN Post-NN
𝜏 =4
3 hidden layers={512, 512, 512}
4 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (1, 5,
1)}
Compact TSN (cTSN)
Model size comparison (in million)
DNN sDNN IRM FCN LSTM TSN cTSN
11.03 11.03 11.03 0.28 13.24 4.13 1.87
Evaluation metrics
• The perceptual evaluation of speech quality (PESQ).
• The short-time objective intelligibility (STOI, in %).
• The segmental signal-to-noise ratio (SSNR).
• The adopted log-spectral distortion (LSD).

Experimental Results – Investigation of the TSN framework
• TSN-1 was trained with Vt in which Xt
∗ is substituted to Xt
pred
to observe the influence of Xt
CI.
• TSN-2 was trained with Vt, from which Yt is removed to investigate the effect of implicitly modeling
the masking method.
• TSN-3, our proposed method, was trained with Vt.
Using 𝐕𝑡 is effective!
Well training the Pri-NN is more
important than post-NN.
22
ˆ ,prior
loss post pri t t t t F
t
J J J       x x X X
Boosting is important for
performance improvement.

Experimental Results – Performance Evaluation –D1
PESQ babble buccanner1 buccanner2 destroyer
destroyero
ps
f16 factory1 factory2 hfchannel leopard m109
machinegu
n
pink volvo white Average
Noisy 1.898 2.038 1.847 2.064 2.036 2.094 1.980 2.402 2.174 2.752 2.511 2.913 1.997 3.510 1.997 2.281
FNN 2.223 2.272 2.099 2.266 2.371 2.345 2.319 2.645 2.318 2.842 2.699 3.000 2.314 3.571 2.270 2.504
SDNN1 2.205 2.236 2.117 2.253 2.363 2.332 2.306 2.615 2.267 2.785 2.657 2.945 2.297 3.739 2.260 2.492
LSTM 2.233 2.145 1.985 2.152 2.384 2.241 2.293 2.604 1.997 2.666 2.695 2.944 2.220 3.647 2.301 2.434
FCN 2.075 1.979 1.986 2.248 2.068 2.118 2.019 2.159 2.316 2.377 2.201 2.349 1.997 2.603 2.146 2.176
IRM 2.198 2.220 2.110 2.248 2.358 2.322 2.309 2.610 2.271 2.781 2.648 2.932 2.300 3.731 2.266 2.487
cTSN 2.273 2.335 2.242 2.351 2.423 2.462 2.378 2.721 2.506 3.016 2.783 3.066 2.357 3.566 2.316 2.586
TSN 2.264 2.407 2.245 2.379 2.439 2.442 2.400 2.765 2.485 3.016 2.826 3.117 2.431 3.653 2.378 2.616
PESQ STOI SSNR LSD
SNR −5 dB 0 dB 5 dB 10 dB Avr. −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr
Noisy 1.627 1.926 2.247 2.572 2.093 60.62 70.5 79.86 87.77 74.69 −7.374 −5.632 −3.241 −0.259 −4.127 2.239 2.067 1.828 1.545 1.919
DNN 2.187 2.525 2.827 3.104 2.661 67.57 77.57 84.95 90.01 80.03 −1.463 0.167 1.903 3.54 1.037 1.539 1.369 1.21 1.077 1.299
sDNN1 2.158 2.504 2.813 3.096 2.643 66.89 77.03 84.66 90.12 79.67 −1.641 0.123 2.017 3.865 1.091 1.567 1.397 1.242 1.11 1.329
IRM 2.155 2.499 2.807 3.089 2.638 66.92 77.02 84.67 90.14 79.69 −1.641 0.118 2.035 3.873 1.096 1.566 1.393 1.235 1.102 1.324
FCN 1.9 2.178 2.451 2.703 2.308 60.51 69.32 76.56 82.1 72.12 −0.412 0.719 1.615 2.321 1.061 1.903 2.02 2.048 2.02 1.998
LSTM 2.073 2.445 2.783 3.085 2.597 66.15 77.04 85.15 90.62 79.74 −2.420 −0.267 1.757 3.545 0.654 1.693 1.449 1.252 1.099 1.373
cTSN 2.255 2.597 2.907 3.191 2.738 69.51 79.41 86.63 91.68 81.81 −0.619 1.145 3.024 4.825 2.094 1.479 1.319 1.173 1.048 1.255
TSN 2.281 2.629 2.939 3.225 2.769 70.47 80.38 87.45 92.31 82.66 −0.820 1.037 3.02 4.911 2.037 1.478 1.307 1.156 1.027 1.242
DNN SDNN1 LSTM FCN IRM cTSN TSN
13.91 14.15 63.98 30.09 13.54 13.10 23.87
The averaging process times per second of speech (ms) conducted on an Intel® Core™ i7-6700k workstation with an Nvidia GTX 1080 Ti .
Noisy (White 5dB) DNN cTSN

Experimental Results – Preference Test–D2
The number of participants: 20
Test: For each pair (noisy, DNN, TSN), choose the best one. (In the aspect of noise reduction and speech distortion.)
 20 pairs were used.
Result: TSN 85%, DNN 15%,
18
10p 

Noisy TSN DNN

Deep Learning Based Voice Activity Detection and Speech Enhancement

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Deep Learning Based Voice Activity Detection and Speech Enhancement

Semelhante a Deep Learning Based Voice Activity Detection and Speech Enhancement (20)

Mais de NAVER Engineering

Mais de NAVER Engineering (20)

Último

Último (20)

Deep Learning Based Voice Activity Detection and Speech Enhancement