SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Speech Recognition Front-End:
Voice Activity Detection & Speech
Enhancement
Juntae Kim, Ph. D Candidate
School of Electrical Engineering
KAIST
For NAVER
Voice Activity
Detection
She had your dark suit in greasy
wash water all year.
Local device (smart speaker, robot)
Overview
End Point
Detection
Speech
Enhancement
Speech
Recognition
Server
Today’s topic
Voice Activity Detection Using an Adaptive
Context Attention Model
Kim, Juntae, and Minsoo Hahn. "Voice Activity Detection Using an Adaptive Context Attention Model." IEEE Signal Processing Letters (2018).
The most famous VAD
repository in github
Voice activity detection (VAD)
Objective: From incoming signal, detecting the speech signal only.
Important Points for VAD:
① Robustness to the various real-world noise environments.
② Robustness to the distance variation.
③ Compact computational cost with low-latency.
Conventional methods:
① Statistical signal processing based approaches.  Assume DFT coefficients of speech and noise signal to
Gaussian random variable and conduct the decision by calculating the likelihood ratio.
② Feature engineering based approaches.  Harmonicity, energy, zero-crossing rate, entropy and etc.
③ Traditional machine learning based approaches.  SVM, LDA, KNN and etc.
Deep learning based VAD
Research branches for deep-learning based VAD
Acoustic
Feature
Extraction
Neural network:
DNN, CNN,
LSTM
Decision
Which acoustic features are useful for VAD?
 These approaches show outstanding performance,
however, generally needs to multiple features so that
computation cost can be increased.
Can we directly use this raw-waveform for neural network?  Generally raw-waveform based approach needs high
computational cost even the performance improvement is slight.
Which neural network architecture make the VAD robust to various noise
environments?  Generally many researchers try to apply state-of-the-art
architectures from other field such as LSTM, CNN and DNN but such architecture
show the trade-off according to the noise types.
Which neural network architecture can effectively use the context
information of speech signal for VAD?
Deep learning based VAD
Boosted deep neural network (bDNN)
Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP) 24.2 (2016): 252-264.
𝑥 𝑛 𝑥 𝑛+1
𝑥 𝑛−1
𝑥 𝑛+1+𝑢 𝑥 𝑛+𝑊
Inputs:
Frames subsampled with W=19, u=9 are used as
input.
Combine with average method
Extended outputs through time are used as
output.
Loss : Mean squared error
bDNN shows outstanding performance by
adopting the boosting strategy.  However it
only can use fixed context information.
Deep learning based VAD
Boosted deep neural network (bDNN)
Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP) 24.2 (2016): 252-264.
bDNN shows outstanding performance by
adopting the boosting strategy.  However it
only can use fixed context information.
In order to solve this problem, Zhang et el.
built multi-stacking model with several
bDNNs that have various context size of inputs.
This structure shows state-of-the-art
performance but the computation cost was 11
times higher than single bDNN. However, this
result implies that if we use the context
information adaptively, we can get some
performance improvement.
Deep learning based VAD
Can a single model adaptively use the context information (CI) according to noise types and SNRs?
In noisy acoustic features, there is no ground truth what is proper usage of CI.
Let’s use the reinforcement-like method.
From the input acoustic features, repeatedly find the proper CI usage.
If found CI usage make the classification correct, give some rewards to model.
Motivation
Deep learning based VAD
Adaptive context attention model (ACAM)
• Decoder: Determine which context is important. (attention)
• Core network: Given previous hidden state (𝐡 𝑚, 𝑡−1) and the input information with noise
environment (𝐠 𝑚,𝑡), propose the next action to the succeeding module.
• Encoder: Aggregate the information of the results of current action.
Start
No
Yes
Stop
𝑚, ,
𝑚, 𝑡 ( 𝑚, 𝑡−1
𝑚, 𝑡 ( 𝑚, 𝑚, 𝑡
𝑚, 𝑡 ( 𝑚, 𝑡, 𝑚, 𝑡
𝐡 𝑚, 𝑡 (𝐡 𝑚, 𝑡−1, 𝐠 𝑚, 𝑡 𝑚 (𝐡 𝑚, 𝑡
⋯
Core network
Attention Attention
𝐡 𝑚, 𝑡−1 𝐡 𝑚, 𝑡
𝐡 𝑚, 𝑇
𝛒 𝑚, 𝑡
𝑚, 𝑡 [0.05, 0. , 0. , 0.5, 0. , 0. , 0.05] 𝑚, 𝑡+1 [0.05, 0. , 0.05 0.6 0.05, 0. , 0.05]
𝛒 𝑚, 𝑡+1 𝑚
𝐠 𝑚,𝑡 𝐠 𝑚,𝑡
Experimental Setup
Training Phase (20 h)
• Speech dataset: 4,620 utterances from the training set in the TIMIT corpus.
• Noise dataset: 20,000 types from Sound effect library.
• Noise addition was conducted with a randomly selected SNR between −10 and 12 dB.
Test phase
D1 dataset (2.4 h)
• Speech dataset: 192 utterances from the training set in the TIMIT corpus.
• Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine,
destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank,
machine gun, pink, Volvo, speech babble, and white).
• Noise addition was conducted with SNRs −5, 0, 5, 10 dB.
D2 dataset (2 h)
• Real world dataset recorded by Galaxy 8.
D3 dataset (72 h)
• Youtube dataset recorded in real-world.
Experimental Setup
Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP) 24.2 (2016): 252-264.
Acoustic features: Multi-resolution cochleagram features (MRCG)
ACAM 1: Fix the attention.
ACAM 2: Train the model with 𝐽𝑠𝑣 only.
ACAM 3: Train the model with 𝐽.
D1: TIMIT+NoiseX92 (2.4h), D2: Recorded dataset(2h), D3: YouTube Dataset(72h)
Performance Measure: area under the ROC curve (AUC) in %.
Number of parameters for each model.
HFCL MSFI DNN bDNN LSTM ACAM
# param. ▪ ~2 k ~3015 k ~3018 k ~2097 k ~953 k
Computation time (ms) for each model. The number in bracket is the MRCG extraction time.
HFCL MSFI DNN bDNN LSTM ACAM
The average p
rocessing tim
e per second
of signal
9.04 67.73 0.88 + (206) 9.24 + (206) 31.61 + (206) 10.12 + (206)
Experimental Results – Investigation of the TSN framework
End-to-End Speech Enhancement Using
Boosting-Based Two-Step Neural Network
Kim, Juntae, and Minsoo Hahn. “End-to-End Speech Enhancement Using Boosting-Based Two-Step Neural Network" submitted to IEEE Signal Processing Letters (2018).
Speech enhancement (SE)
Objective: From incoming noisy speech signal, removing the noise signal, while conserving the speech signal.
𝑥 ( )  ො 𝑥
Important Points for SE:
① Perceptual quality (related with speech distortion).
② Noise reduction.
③ Computational cost.
Conventional methods:
)( )H w
ˆ( )y t( )y t
( ) 1
( )
( ) ( ) 1 ( ) / ( )
x
x n n x
S w
H w
S w S w S w S w
 
 
   2 2
( ) [ ( )] , ( ) [ ( )]x nS w E F x t S w E F n t 
* Assumption:
x(t), n(t) are independent.
Estimated from silence region (by
conducting VAD)
How we can find the H(w)?  Minimum mean squared error estimation (MMSE).
2
minimize [( ( ) ( ) ( )) ]E Y w H w X w
Trade off
Deep learning based SE
 ˆ |t tf x Y
 ˆ |t tf x Y
Method 1: Directly maps the noisy log power spectral features (LPS) to clean LPS.
where ොxt ∈ ℝN is an enhanced LPS vector, yt ∈ ℝN is a noisy LPS vector, t is the frame index, N is
the feature dimension, f ( ∙ | θ) denotes the neural network-based function.
Xu, Yong, et al. "A regression approach to speech enhancement based on deep neural networks." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23.1 (2015): 7-19.
1. Fully Convolutional Neural Network
2. Deep Neural Network
3. Etc.
Deep learning based SE
Method 2: SE is carried out by adopting the masking method.
where ොxt(k), xt k , and yt k are the kth element of ොxt, xt, and yt, respectively, and xt∈ℝN is a clean
LPS vector, k is frequency bin, and mt k is a mask value for kth element. The IRM in second line is
the most widely used one for the mask.
          
 1/2
ˆ , 2 log ty kmask
t t t tx k f x k y k m k e

   
      1/2 t tx k y kIRM
tm k e
 

Clean/Noisy
samples
Noisy
samples
Feature
extraction
DNN
Training
Mask
extractionx t , y(t)
y(t)
m(t)
Feature
extraction
Reconstructi
on
ොx(t)
Training stage
Enhancement stage
Y
Y
𝐘
ෝm(t)
Deep learning based SE – Summary
Conventional
Approaches
Method 2
LSTM
Method 1
Fully convolutional network
(FCN)
GAN
Multitask learning
Convolutional LSTM
DNN with skip connection
Multitask learning
Ideal ratio mask (IRM)
Complex ideal ratio mask (cIRM)
MMSE
Wiener filtering
Optimally modified log-
spectral amplitude
Minimum variance distortionless Response
(MVDR)
Two-step network (TSN)
Conventional Approaches
Pros:
① Good! but only in specific noise environment (stationary noise).
Cons:
① Vulnerable in non-stationary noise environment
② Computation cost is relatively high (matrix inversion operation)
③ Model size is small (only have to save the impulse response H(w)).
Method 1
Pros:
① Good! compared to conventional approaches. But some muffled sound (smoothing effect in the spectrogram)
is investigated.
② Strong performance in unseen noise environments.
③ Computation cost is cheap if we use simple acoustic features such as log-power spectra and simple
architecture such as DNN.
Cons:
① In order to reduce the speech distortion, GAN based SE methods were proposed. But training is quite hard
because of over-regularization.
② According to used architecture, there is some trade-off between noise types.
③ Model size is relatively big (many parameters.).
④ Some neural network types (bi-directional RNN) prevent the online enhancement.
⑤ Phase reconstruction seems to be hard in this framework.
Method 2 compared to Method 1
Pros:
① Phase reconstruction become easier than method 1.
② There is some performance improvement if we use method 1
framework. (Some paper deny this fact.)
Cons:
① Additional masking operation is needed.
② Even though the neural network can model any arbitrary function,
there aren’t solid reason the masking method outperform the
method 1. According to used architecture, there is some trade-off
between noise types.
Two-Step Network
Clean/Noisy
samples
Noisy
samples
Feature
extraction
Prior Net
Training
x t , y(t)
Feature
extraction
Reconstructi
on
ොx(t)
Training stage
Enhancement stage
Y
Y
𝐗, 𝐘
Post Net
Training
Prior Net
Enhancing
Post Net
Enhancing
𝐗∗, 𝐘
   *
ˆ , , ,mask post
t t t t tf f x x y X Y
Proposed Method: SE is carried out in end-to-end manner but implicitly considering the masking method (Our model
directly maps noisy features to clean features).
y(t)
Why end-to-end?
① We can share acoustic features with following modules (speech recognition system).
② We can save computation cost (no additional masking operation).
③ We can fully exploit the potential modeling power of neural network.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
  *
,pri
t t k k
g

 
X Y
  * ( )
,m pred CI
t t m k t tm k

 
   
 X x X X
Why multiple outputs for pri-NN?
 We can get multiple predictions Xt
pred
for xt.
From multiple predictions, we can adopt the
boosting method.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
෤𝐱 𝑡
y 𝑡 y 𝑡+1y 𝑡−1
෤𝐱 𝑡
1
෤𝐱 𝑡
−1
y 𝑡+1 y 𝑡+2y 𝑡y 𝑡−1 y 𝑡y 𝑡−2
Drawback 1: are from different context. 
The data distribution across the frequency dimension can be
different each other.
(1) (2) (3)
, ,t t tx x x
Input contexts are different.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Drawback 2: We cannot use Xt
CI which is highly related with 𝐱 𝑡,
because it corresponds to neighboring frames.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
Drawback 3: We cannot use 𝐘t  We cannot implicitly model
the masking method.
   *
ˆ , , ,mask post
t t t t tf f x x y X Y
Also, according to noise types, 𝐘t can have intact clean speech
features in some frequency band.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Idea: Try to use all the useful information Xt
pred
, Xt
CI, Yt that
related with xt.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
     2 1 2 2*
, ( ) .
N
t t tconcat set
   
 V X Y
However, we cannot use the simple averaging method because
of Xt
CI, Yt .
Convolution based boosting for filter out some noisy
information while aggregating target-related information from
each feature vectors.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
1
,l l l
t t

 H K H
     1
1 1
, , , 1, ,
l l
I S
l l l
t t
i u
f j u i j f u i 
 
 
    
 
H K H
where Ht
l∈ℝN×O l
and Kl∈ℝS l×I l×O l
are the output
feature maps and the convolutional kernel respectively, S l and
I l and O l denote the size of the convolutional kernel and the
number of input and output feature maps, respectively, from the
l-th convolutional layer.
Simplest boosting method:
( )1
ˆ m
t t
mM
 x x
Idea: Convolution based boosting for filter out some noisy
information while aggregating target-related information from
each feature vectors.
Two-Step Network
( )pri
g  pri-NN
y 𝑡−1 y 𝑡y 𝑡−2
y 𝑡 y 𝑡+1y 𝑡−1
y 𝑡+1 y 𝑡+2y 𝑡
෤𝐱 𝑡−1 ෤𝐱 𝑡
1
෤𝐱 𝑡−2
−1
෤𝐱 𝑡 ෤𝐱 𝑡+1
1
෤𝐱 𝑡−1
−1
෤𝐱 𝑡+1 ෤𝐱 𝑡+2
1
෤𝐱 𝑡
−1
𝐗 𝑡
∗
𝐘𝑡
ො𝐱 𝑡
post-NN
𝐕𝑡
,CI
tX
pred
tX
Loss function
22
ˆ ,prior
loss post pri t t t t F
t
J J J       x x X X
 , , , , ,t t t t  X x x x
prior
tX
If λ is set to 0, we cannot consider that Xt
*
contains Xt
pred
because there is no evidence that Xt
pred
includes the
prediction of xt, which means that the pri-NN loses the
characteristic as weak-predictors. Thus, the boosting effect
of TSN from multiple predictions would be negated.
Experimental Setup
Training Phase
• Speech dataset: 4,620 utterances from the training set in the TIMIT corpus.
• Noise dataset: 100 types of noise from HU dataset, 50 types from Sound effect library.
• Noise addition was conducted with a randomly selected SNR between −5 and 20 dB.
• We repeat this procedure until the length of the entire training dataset is approximately 50 h.
• 90% of the training dataset is used in training the model and the remaining 10% in validation.
Test phase
D1 dataset
• Speech dataset: 192 utterances from the training set in the TIMIT corpus.
• Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine,
destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank,
machine gun, pink, Volvo, speech babble, and white).
• Noise addition was conducted with SNRs −5, 0, 5, 10 dB.
D2 dataset
• Real world dataset recorded by Galaxy 8.
Experimental Setup
Additional Information
• Sampling rate: 8kHz.
• Window shift and length: 10 and 25ms
• Log-power spectra (LPS) were used for acoustic features.
• Z-score-normalization was conducted across the frequency dimension for LPS.
• When reconstructing the waveform, the noisy phase information was used.
Baseline Methods
• Deep neural network (DNN)
• DNN with skip connection (sDNN)
• Ideal ratio mask with DNN (IRM)
• Fully convolutional neural network (FCN)
• Long short term memory recurrent neural network (LSTM)
Experimental Setup
Pri-NN Post-NN
𝜏 =4
3 hidden layers={1024, 1024, 1024}
8 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (32, 5,
1), (32, 5, 1), (32, 5, 1), (32, 5, 1), (1, 5, 1)}
TSN
Pri-NN Post-NN
𝜏 =4
3 hidden layers={512, 512, 512}
4 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (1, 5,
1)}
Compact TSN (cTSN)
Model size comparison (in million)
DNN sDNN IRM FCN LSTM TSN cTSN
11.03 11.03 11.03 0.28 13.24 4.13 1.87
Evaluation metrics
• The perceptual evaluation of speech quality (PESQ).
• The short-time objective intelligibility (STOI, in %).
• The segmental signal-to-noise ratio (SSNR).
• The adopted log-spectral distortion (LSD).
Experimental Results – Investigation of the TSN framework
• TSN-1 was trained with Vt in which Xt
∗ is substituted to Xt
pred
to observe the influence of Xt
CI.
• TSN-2 was trained with Vt, from which Yt is removed to investigate the effect of implicitly modeling
the masking method.
• TSN-3, our proposed method, was trained with Vt.
Using 𝐕𝑡 is effective!
Well training the Pri-NN is more
important than post-NN.
22
ˆ ,prior
loss post pri t t t t F
t
J J J       x x X X
Boosting is important for
performance improvement.
Experimental Results – Performance Evaluation –D1
PESQ babble buccanner1 buccanner2 destroyer
destroyero
ps
f16 factory1 factory2 hfchannel leopard m109
machinegu
n
pink volvo white Average
Noisy 1.898 2.038 1.847 2.064 2.036 2.094 1.980 2.402 2.174 2.752 2.511 2.913 1.997 3.510 1.997 2.281
FNN 2.223 2.272 2.099 2.266 2.371 2.345 2.319 2.645 2.318 2.842 2.699 3.000 2.314 3.571 2.270 2.504
SDNN1 2.205 2.236 2.117 2.253 2.363 2.332 2.306 2.615 2.267 2.785 2.657 2.945 2.297 3.739 2.260 2.492
LSTM 2.233 2.145 1.985 2.152 2.384 2.241 2.293 2.604 1.997 2.666 2.695 2.944 2.220 3.647 2.301 2.434
FCN 2.075 1.979 1.986 2.248 2.068 2.118 2.019 2.159 2.316 2.377 2.201 2.349 1.997 2.603 2.146 2.176
IRM 2.198 2.220 2.110 2.248 2.358 2.322 2.309 2.610 2.271 2.781 2.648 2.932 2.300 3.731 2.266 2.487
cTSN 2.273 2.335 2.242 2.351 2.423 2.462 2.378 2.721 2.506 3.016 2.783 3.066 2.357 3.566 2.316 2.586
TSN 2.264 2.407 2.245 2.379 2.439 2.442 2.400 2.765 2.485 3.016 2.826 3.117 2.431 3.653 2.378 2.616
PESQ STOI SSNR LSD
SNR −5 dB 0 dB 5 dB 10 dB Avr. −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr
Noisy 1.627 1.926 2.247 2.572 2.093 60.62 70.5 79.86 87.77 74.69 −7.374 −5.632 −3.241 −0.259 −4.127 2.239 2.067 1.828 1.545 1.919
DNN 2.187 2.525 2.827 3.104 2.661 67.57 77.57 84.95 90.01 80.03 −1.463 0.167 1.903 3.54 1.037 1.539 1.369 1.21 1.077 1.299
sDNN1 2.158 2.504 2.813 3.096 2.643 66.89 77.03 84.66 90.12 79.67 −1.641 0.123 2.017 3.865 1.091 1.567 1.397 1.242 1.11 1.329
IRM 2.155 2.499 2.807 3.089 2.638 66.92 77.02 84.67 90.14 79.69 −1.641 0.118 2.035 3.873 1.096 1.566 1.393 1.235 1.102 1.324
FCN 1.9 2.178 2.451 2.703 2.308 60.51 69.32 76.56 82.1 72.12 −0.412 0.719 1.615 2.321 1.061 1.903 2.02 2.048 2.02 1.998
LSTM 2.073 2.445 2.783 3.085 2.597 66.15 77.04 85.15 90.62 79.74 −2.420 −0.267 1.757 3.545 0.654 1.693 1.449 1.252 1.099 1.373
cTSN 2.255 2.597 2.907 3.191 2.738 69.51 79.41 86.63 91.68 81.81 −0.619 1.145 3.024 4.825 2.094 1.479 1.319 1.173 1.048 1.255
TSN 2.281 2.629 2.939 3.225 2.769 70.47 80.38 87.45 92.31 82.66 −0.820 1.037 3.02 4.911 2.037 1.478 1.307 1.156 1.027 1.242
DNN SDNN1 LSTM FCN IRM cTSN TSN
13.91 14.15 63.98 30.09 13.54 13.10 23.87
The averaging process times per second of speech (ms) conducted on an Intel® Core™ i7-6700k workstation with an Nvidia GTX 1080 Ti .
Noisy (White 5dB) DNN cTSN
Experimental Results – Preference Test–D2
The number of participants: 20
Test: For each pair (noisy, DNN, TSN), choose the best one. (In the aspect of noise reduction and speech distortion.)
 20 pairs were used.
Result: TSN 85%, DNN 15%,
18
10p 

Noisy TSN DNN
TSN
DNN
Thank you!

Mais conteúdo relacionado

Mais procurados

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 

Mais procurados (20)

Cnn method
Cnn methodCnn method
Cnn method
 
Dsp lecture vol 7 adaptive filter
Dsp lecture vol 7 adaptive filterDsp lecture vol 7 adaptive filter
Dsp lecture vol 7 adaptive filter
 
quantization
quantizationquantization
quantization
 
Justesen codes alternant codes goppa codes
Justesen codes alternant codes goppa codesJustesen codes alternant codes goppa codes
Justesen codes alternant codes goppa codes
 
Windowing ofdm
Windowing ofdmWindowing ofdm
Windowing ofdm
 
Huffman Coding
Huffman CodingHuffman Coding
Huffman Coding
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in Theano
 
Introduction to compressive sensing
Introduction to compressive sensingIntroduction to compressive sensing
Introduction to compressive sensing
 
HOPFIELD NETWORK
HOPFIELD NETWORKHOPFIELD NETWORK
HOPFIELD NETWORK
 
Signal Compression and JPEG
Signal Compression and JPEGSignal Compression and JPEG
Signal Compression and JPEG
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Lzw
LzwLzw
Lzw
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
Adaptive linear equalizer
Adaptive linear equalizerAdaptive linear equalizer
Adaptive linear equalizer
 
Adaptive filter
Adaptive filterAdaptive filter
Adaptive filter
 
FILTER BANKS
FILTER BANKSFILTER BANKS
FILTER BANKS
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Adaptive filter
Adaptive filterAdaptive filter
Adaptive filter
 
speech enhancement
speech enhancementspeech enhancement
speech enhancement
 

Semelhante a Deep Learning Based Voice Activity Detection and Speech Enhancement

"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
butest
 
Applications of ann_in_microwave_engineering
Applications of ann_in_microwave_engineeringApplications of ann_in_microwave_engineering
Applications of ann_in_microwave_engineering
prasadhegdegn
 
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
T. E. BOGALE
 
thesis presentation_liyang
thesis presentation_liyangthesis presentation_liyang
thesis presentation_liyang
Liyang Zhang
 
EBDSS Max Research Report - Final
EBDSS  Max  Research Report - FinalEBDSS  Max  Research Report - Final
EBDSS Max Research Report - Final
Max Robertson
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
butest
 

Semelhante a Deep Learning Based Voice Activity Detection and Speech Enhancement (20)

Attention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoisingAttention gated encoder-decoder for ultrasonic signal denoising
Attention gated encoder-decoder for ultrasonic signal denoising
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
 
Sound event detection using deep neural networks
Sound event detection using deep neural networksSound event detection using deep neural networks
Sound event detection using deep neural networks
 
Applications of ann_in_microwave_engineering
Applications of ann_in_microwave_engineeringApplications of ann_in_microwave_engineering
Applications of ann_in_microwave_engineering
 
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...
MULTI-DOMAIN UNPAIRED ULTRASOUND IMAGE ARTIFACT REMOVAL USING A SINGLE CONVOL...
 
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
Beamforming for Multiuser Massive MIMO Systems: Digital versus Hybrid Analog-...
 
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
Real-time neural text-to-speech with sequence-to-sequence acoustic model and ...
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
thesis presentation_liyang
thesis presentation_liyangthesis presentation_liyang
thesis presentation_liyang
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Introduction to adaptive filtering and its applications.ppt
Introduction to adaptive filtering and its applications.pptIntroduction to adaptive filtering and its applications.ppt
Introduction to adaptive filtering and its applications.ppt
 
Non-Linear Optimization Scheme for Non-Orthogonal Multiuser Access
Non-Linear Optimization Schemefor Non-Orthogonal Multiuser AccessNon-Linear Optimization Schemefor Non-Orthogonal Multiuser Access
Non-Linear Optimization Scheme for Non-Orthogonal Multiuser Access
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
EBDSS Max Research Report - Final
EBDSS  Max  Research Report - FinalEBDSS  Max  Research Report - Final
EBDSS Max Research Report - Final
 
Hairong Qi V Swaminathan
Hairong Qi V SwaminathanHairong Qi V Swaminathan
Hairong Qi V Swaminathan
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
Neural Networks for High Performance Time-Delay Estimation and Acoustic Sourc...
 
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
NEURAL NETWORKS FOR HIGH PERFORMANCE TIME-DELAY ESTIMATION AND ACOUSTIC SOURC...
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 

Mais de NAVER Engineering

Mais de NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Deep Learning Based Voice Activity Detection and Speech Enhancement

  • 1. Speech Recognition Front-End: Voice Activity Detection & Speech Enhancement Juntae Kim, Ph. D Candidate School of Electrical Engineering KAIST For NAVER
  • 2. Voice Activity Detection She had your dark suit in greasy wash water all year. Local device (smart speaker, robot) Overview End Point Detection Speech Enhancement Speech Recognition Server Today’s topic
  • 3. Voice Activity Detection Using an Adaptive Context Attention Model Kim, Juntae, and Minsoo Hahn. "Voice Activity Detection Using an Adaptive Context Attention Model." IEEE Signal Processing Letters (2018). The most famous VAD repository in github
  • 4. Voice activity detection (VAD) Objective: From incoming signal, detecting the speech signal only. Important Points for VAD: ① Robustness to the various real-world noise environments. ② Robustness to the distance variation. ③ Compact computational cost with low-latency. Conventional methods: ① Statistical signal processing based approaches.  Assume DFT coefficients of speech and noise signal to Gaussian random variable and conduct the decision by calculating the likelihood ratio. ② Feature engineering based approaches.  Harmonicity, energy, zero-crossing rate, entropy and etc. ③ Traditional machine learning based approaches.  SVM, LDA, KNN and etc.
  • 5. Deep learning based VAD Research branches for deep-learning based VAD Acoustic Feature Extraction Neural network: DNN, CNN, LSTM Decision Which acoustic features are useful for VAD?  These approaches show outstanding performance, however, generally needs to multiple features so that computation cost can be increased. Can we directly use this raw-waveform for neural network?  Generally raw-waveform based approach needs high computational cost even the performance improvement is slight. Which neural network architecture make the VAD robust to various noise environments?  Generally many researchers try to apply state-of-the-art architectures from other field such as LSTM, CNN and DNN but such architecture show the trade-off according to the noise types. Which neural network architecture can effectively use the context information of speech signal for VAD?
  • 6. Deep learning based VAD Boosted deep neural network (bDNN) Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.2 (2016): 252-264. 𝑥 𝑛 𝑥 𝑛+1 𝑥 𝑛−1 𝑥 𝑛+1+𝑢 𝑥 𝑛+𝑊 Inputs: Frames subsampled with W=19, u=9 are used as input. Combine with average method Extended outputs through time are used as output. Loss : Mean squared error bDNN shows outstanding performance by adopting the boosting strategy.  However it only can use fixed context information.
  • 7. Deep learning based VAD Boosted deep neural network (bDNN) Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.2 (2016): 252-264. bDNN shows outstanding performance by adopting the boosting strategy.  However it only can use fixed context information. In order to solve this problem, Zhang et el. built multi-stacking model with several bDNNs that have various context size of inputs. This structure shows state-of-the-art performance but the computation cost was 11 times higher than single bDNN. However, this result implies that if we use the context information adaptively, we can get some performance improvement.
  • 8. Deep learning based VAD Can a single model adaptively use the context information (CI) according to noise types and SNRs? In noisy acoustic features, there is no ground truth what is proper usage of CI. Let’s use the reinforcement-like method. From the input acoustic features, repeatedly find the proper CI usage. If found CI usage make the classification correct, give some rewards to model. Motivation
  • 9. Deep learning based VAD Adaptive context attention model (ACAM) • Decoder: Determine which context is important. (attention) • Core network: Given previous hidden state (𝐡 𝑚, 𝑡−1) and the input information with noise environment (𝐠 𝑚,𝑡), propose the next action to the succeeding module. • Encoder: Aggregate the information of the results of current action. Start No Yes Stop 𝑚, , 𝑚, 𝑡 ( 𝑚, 𝑡−1 𝑚, 𝑡 ( 𝑚, 𝑚, 𝑡 𝑚, 𝑡 ( 𝑚, 𝑡, 𝑚, 𝑡 𝐡 𝑚, 𝑡 (𝐡 𝑚, 𝑡−1, 𝐠 𝑚, 𝑡 𝑚 (𝐡 𝑚, 𝑡 ⋯ Core network Attention Attention 𝐡 𝑚, 𝑡−1 𝐡 𝑚, 𝑡 𝐡 𝑚, 𝑇 𝛒 𝑚, 𝑡 𝑚, 𝑡 [0.05, 0. , 0. , 0.5, 0. , 0. , 0.05] 𝑚, 𝑡+1 [0.05, 0. , 0.05 0.6 0.05, 0. , 0.05] 𝛒 𝑚, 𝑡+1 𝑚 𝐠 𝑚,𝑡 𝐠 𝑚,𝑡
  • 10. Experimental Setup Training Phase (20 h) • Speech dataset: 4,620 utterances from the training set in the TIMIT corpus. • Noise dataset: 20,000 types from Sound effect library. • Noise addition was conducted with a randomly selected SNR between −10 and 12 dB. Test phase D1 dataset (2.4 h) • Speech dataset: 192 utterances from the training set in the TIMIT corpus. • Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine, destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank, machine gun, pink, Volvo, speech babble, and white). • Noise addition was conducted with SNRs −5, 0, 5, 10 dB. D2 dataset (2 h) • Real world dataset recorded by Galaxy 8. D3 dataset (72 h) • Youtube dataset recorded in real-world.
  • 11. Experimental Setup Zhang, Xiao-Lei, and DeLiang Wang. "Boosting contextual information for deep neural network based voice activity detection." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.2 (2016): 252-264. Acoustic features: Multi-resolution cochleagram features (MRCG)
  • 12. ACAM 1: Fix the attention. ACAM 2: Train the model with 𝐽𝑠𝑣 only. ACAM 3: Train the model with 𝐽. D1: TIMIT+NoiseX92 (2.4h), D2: Recorded dataset(2h), D3: YouTube Dataset(72h) Performance Measure: area under the ROC curve (AUC) in %. Number of parameters for each model. HFCL MSFI DNN bDNN LSTM ACAM # param. ▪ ~2 k ~3015 k ~3018 k ~2097 k ~953 k Computation time (ms) for each model. The number in bracket is the MRCG extraction time. HFCL MSFI DNN bDNN LSTM ACAM The average p rocessing tim e per second of signal 9.04 67.73 0.88 + (206) 9.24 + (206) 31.61 + (206) 10.12 + (206) Experimental Results – Investigation of the TSN framework
  • 13. End-to-End Speech Enhancement Using Boosting-Based Two-Step Neural Network Kim, Juntae, and Minsoo Hahn. “End-to-End Speech Enhancement Using Boosting-Based Two-Step Neural Network" submitted to IEEE Signal Processing Letters (2018).
  • 14. Speech enhancement (SE) Objective: From incoming noisy speech signal, removing the noise signal, while conserving the speech signal. 𝑥 ( )  ො 𝑥 Important Points for SE: ① Perceptual quality (related with speech distortion). ② Noise reduction. ③ Computational cost. Conventional methods: )( )H w ˆ( )y t( )y t ( ) 1 ( ) ( ) ( ) 1 ( ) / ( ) x x n n x S w H w S w S w S w S w        2 2 ( ) [ ( )] , ( ) [ ( )]x nS w E F x t S w E F n t  * Assumption: x(t), n(t) are independent. Estimated from silence region (by conducting VAD) How we can find the H(w)?  Minimum mean squared error estimation (MMSE). 2 minimize [( ( ) ( ) ( )) ]E Y w H w X w Trade off
  • 15. Deep learning based SE  ˆ |t tf x Y  ˆ |t tf x Y Method 1: Directly maps the noisy log power spectral features (LPS) to clean LPS. where ොxt ∈ ℝN is an enhanced LPS vector, yt ∈ ℝN is a noisy LPS vector, t is the frame index, N is the feature dimension, f ( ∙ | θ) denotes the neural network-based function. Xu, Yong, et al. "A regression approach to speech enhancement based on deep neural networks." IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23.1 (2015): 7-19. 1. Fully Convolutional Neural Network 2. Deep Neural Network 3. Etc.
  • 16. Deep learning based SE Method 2: SE is carried out by adopting the masking method. where ොxt(k), xt k , and yt k are the kth element of ොxt, xt, and yt, respectively, and xt∈ℝN is a clean LPS vector, k is frequency bin, and mt k is a mask value for kth element. The IRM in second line is the most widely used one for the mask.             1/2 ˆ , 2 log ty kmask t t t tx k f x k y k m k e            1/2 t tx k y kIRM tm k e    Clean/Noisy samples Noisy samples Feature extraction DNN Training Mask extractionx t , y(t) y(t) m(t) Feature extraction Reconstructi on ොx(t) Training stage Enhancement stage Y Y 𝐘 ෝm(t)
  • 17. Deep learning based SE – Summary Conventional Approaches Method 2 LSTM Method 1 Fully convolutional network (FCN) GAN Multitask learning Convolutional LSTM DNN with skip connection Multitask learning Ideal ratio mask (IRM) Complex ideal ratio mask (cIRM) MMSE Wiener filtering Optimally modified log- spectral amplitude Minimum variance distortionless Response (MVDR) Two-step network (TSN) Conventional Approaches Pros: ① Good! but only in specific noise environment (stationary noise). Cons: ① Vulnerable in non-stationary noise environment ② Computation cost is relatively high (matrix inversion operation) ③ Model size is small (only have to save the impulse response H(w)). Method 1 Pros: ① Good! compared to conventional approaches. But some muffled sound (smoothing effect in the spectrogram) is investigated. ② Strong performance in unseen noise environments. ③ Computation cost is cheap if we use simple acoustic features such as log-power spectra and simple architecture such as DNN. Cons: ① In order to reduce the speech distortion, GAN based SE methods were proposed. But training is quite hard because of over-regularization. ② According to used architecture, there is some trade-off between noise types. ③ Model size is relatively big (many parameters.). ④ Some neural network types (bi-directional RNN) prevent the online enhancement. ⑤ Phase reconstruction seems to be hard in this framework. Method 2 compared to Method 1 Pros: ① Phase reconstruction become easier than method 1. ② There is some performance improvement if we use method 1 framework. (Some paper deny this fact.) Cons: ① Additional masking operation is needed. ② Even though the neural network can model any arbitrary function, there aren’t solid reason the masking method outperform the method 1. According to used architecture, there is some trade-off between noise types.
  • 18. Two-Step Network Clean/Noisy samples Noisy samples Feature extraction Prior Net Training x t , y(t) Feature extraction Reconstructi on ොx(t) Training stage Enhancement stage Y Y 𝐗, 𝐘 Post Net Training Prior Net Enhancing Post Net Enhancing 𝐗∗, 𝐘    * ˆ , , ,mask post t t t t tf f x x y X Y Proposed Method: SE is carried out in end-to-end manner but implicitly considering the masking method (Our model directly maps noisy features to clean features). y(t) Why end-to-end? ① We can share acoustic features with following modules (speech recognition system). ② We can save computation cost (no additional masking operation). ③ We can fully exploit the potential modeling power of neural network.
  • 19. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX   * ,pri t t k k g    X Y   * ( ) ,m pred CI t t m k t tm k         X x X X Why multiple outputs for pri-NN?  We can get multiple predictions Xt pred for xt. From multiple predictions, we can adopt the boosting method. Simplest boosting method: ( )1 ˆ m t t mM  x x
  • 20. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX Simplest boosting method: ( )1 ˆ m t t mM  x x ෤𝐱 𝑡 y 𝑡 y 𝑡+1y 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡 −1 y 𝑡+1 y 𝑡+2y 𝑡y 𝑡−1 y 𝑡y 𝑡−2 Drawback 1: are from different context.  The data distribution across the frequency dimension can be different each other. (1) (2) (3) , ,t t tx x x Input contexts are different.
  • 21. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX Drawback 2: We cannot use Xt CI which is highly related with 𝐱 𝑡, because it corresponds to neighboring frames. Simplest boosting method: ( )1 ˆ m t t mM  x x Drawback 3: We cannot use 𝐘t  We cannot implicitly model the masking method.    * ˆ , , ,mask post t t t t tf f x x y X Y Also, according to noise types, 𝐘t can have intact clean speech features in some frequency band.
  • 22. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX Idea: Try to use all the useful information Xt pred , Xt CI, Yt that related with xt. Simplest boosting method: ( )1 ˆ m t t mM  x x      2 1 2 2* , ( ) . N t t tconcat set      V X Y However, we cannot use the simple averaging method because of Xt CI, Yt . Convolution based boosting for filter out some noisy information while aggregating target-related information from each feature vectors.
  • 23. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX 1 ,l l l t t   H K H      1 1 1 , , , 1, , l l I S l l l t t i u f j u i j f u i             H K H where Ht l∈ℝN×O l and Kl∈ℝS l×I l×O l are the output feature maps and the convolutional kernel respectively, S l and I l and O l denote the size of the convolutional kernel and the number of input and output feature maps, respectively, from the l-th convolutional layer. Simplest boosting method: ( )1 ˆ m t t mM  x x Idea: Convolution based boosting for filter out some noisy information while aggregating target-related information from each feature vectors.
  • 24. Two-Step Network ( )pri g  pri-NN y 𝑡−1 y 𝑡y 𝑡−2 y 𝑡 y 𝑡+1y 𝑡−1 y 𝑡+1 y 𝑡+2y 𝑡 ෤𝐱 𝑡−1 ෤𝐱 𝑡 1 ෤𝐱 𝑡−2 −1 ෤𝐱 𝑡 ෤𝐱 𝑡+1 1 ෤𝐱 𝑡−1 −1 ෤𝐱 𝑡+1 ෤𝐱 𝑡+2 1 ෤𝐱 𝑡 −1 𝐗 𝑡 ∗ 𝐘𝑡 ො𝐱 𝑡 post-NN 𝐕𝑡 ,CI tX pred tX Loss function 22 ˆ ,prior loss post pri t t t t F t J J J       x x X X  , , , , ,t t t t  X x x x prior tX If λ is set to 0, we cannot consider that Xt * contains Xt pred because there is no evidence that Xt pred includes the prediction of xt, which means that the pri-NN loses the characteristic as weak-predictors. Thus, the boosting effect of TSN from multiple predictions would be negated.
  • 25. Experimental Setup Training Phase • Speech dataset: 4,620 utterances from the training set in the TIMIT corpus. • Noise dataset: 100 types of noise from HU dataset, 50 types from Sound effect library. • Noise addition was conducted with a randomly selected SNR between −5 and 20 dB. • We repeat this procedure until the length of the entire training dataset is approximately 50 h. • 90% of the training dataset is used in training the model and the remaining 10% in validation. Test phase D1 dataset • Speech dataset: 192 utterances from the training set in the TIMIT corpus. • Noise dataset: 15 types of noise from NOISEX-92 corpus. (jet cockpit1, jet cockpit2, destroyer engine, destroyer operations, F-16 cockpit, factory1, factory2, HF channel, military vehicle, M109 tank, machine gun, pink, Volvo, speech babble, and white). • Noise addition was conducted with SNRs −5, 0, 5, 10 dB. D2 dataset • Real world dataset recorded by Galaxy 8.
  • 26. Experimental Setup Additional Information • Sampling rate: 8kHz. • Window shift and length: 10 and 25ms • Log-power spectra (LPS) were used for acoustic features. • Z-score-normalization was conducted across the frequency dimension for LPS. • When reconstructing the waveform, the noisy phase information was used. Baseline Methods • Deep neural network (DNN) • DNN with skip connection (sDNN) • Ideal ratio mask with DNN (IRM) • Fully convolutional neural network (FCN) • Long short term memory recurrent neural network (LSTM)
  • 27. Experimental Setup Pri-NN Post-NN 𝜏 =4 3 hidden layers={1024, 1024, 1024} 8 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (32, 5, 1), (32, 5, 1), (32, 5, 1), (32, 5, 1), (1, 5, 1)} TSN Pri-NN Post-NN 𝜏 =4 3 hidden layers={512, 512, 512} 4 convolutional layers={(256, 5, 1), (128, 5, 1), (64, 5, 1), (1, 5, 1)} Compact TSN (cTSN) Model size comparison (in million) DNN sDNN IRM FCN LSTM TSN cTSN 11.03 11.03 11.03 0.28 13.24 4.13 1.87 Evaluation metrics • The perceptual evaluation of speech quality (PESQ). • The short-time objective intelligibility (STOI, in %). • The segmental signal-to-noise ratio (SSNR). • The adopted log-spectral distortion (LSD).
  • 28. Experimental Results – Investigation of the TSN framework • TSN-1 was trained with Vt in which Xt ∗ is substituted to Xt pred to observe the influence of Xt CI. • TSN-2 was trained with Vt, from which Yt is removed to investigate the effect of implicitly modeling the masking method. • TSN-3, our proposed method, was trained with Vt. Using 𝐕𝑡 is effective! Well training the Pri-NN is more important than post-NN. 22 ˆ ,prior loss post pri t t t t F t J J J       x x X X Boosting is important for performance improvement.
  • 29. Experimental Results – Performance Evaluation –D1 PESQ babble buccanner1 buccanner2 destroyer destroyero ps f16 factory1 factory2 hfchannel leopard m109 machinegu n pink volvo white Average Noisy 1.898 2.038 1.847 2.064 2.036 2.094 1.980 2.402 2.174 2.752 2.511 2.913 1.997 3.510 1.997 2.281 FNN 2.223 2.272 2.099 2.266 2.371 2.345 2.319 2.645 2.318 2.842 2.699 3.000 2.314 3.571 2.270 2.504 SDNN1 2.205 2.236 2.117 2.253 2.363 2.332 2.306 2.615 2.267 2.785 2.657 2.945 2.297 3.739 2.260 2.492 LSTM 2.233 2.145 1.985 2.152 2.384 2.241 2.293 2.604 1.997 2.666 2.695 2.944 2.220 3.647 2.301 2.434 FCN 2.075 1.979 1.986 2.248 2.068 2.118 2.019 2.159 2.316 2.377 2.201 2.349 1.997 2.603 2.146 2.176 IRM 2.198 2.220 2.110 2.248 2.358 2.322 2.309 2.610 2.271 2.781 2.648 2.932 2.300 3.731 2.266 2.487 cTSN 2.273 2.335 2.242 2.351 2.423 2.462 2.378 2.721 2.506 3.016 2.783 3.066 2.357 3.566 2.316 2.586 TSN 2.264 2.407 2.245 2.379 2.439 2.442 2.400 2.765 2.485 3.016 2.826 3.117 2.431 3.653 2.378 2.616 PESQ STOI SSNR LSD SNR −5 dB 0 dB 5 dB 10 dB Avr. −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr −5 dB 0 dB 5 dB 10 dB Avr Noisy 1.627 1.926 2.247 2.572 2.093 60.62 70.5 79.86 87.77 74.69 −7.374 −5.632 −3.241 −0.259 −4.127 2.239 2.067 1.828 1.545 1.919 DNN 2.187 2.525 2.827 3.104 2.661 67.57 77.57 84.95 90.01 80.03 −1.463 0.167 1.903 3.54 1.037 1.539 1.369 1.21 1.077 1.299 sDNN1 2.158 2.504 2.813 3.096 2.643 66.89 77.03 84.66 90.12 79.67 −1.641 0.123 2.017 3.865 1.091 1.567 1.397 1.242 1.11 1.329 IRM 2.155 2.499 2.807 3.089 2.638 66.92 77.02 84.67 90.14 79.69 −1.641 0.118 2.035 3.873 1.096 1.566 1.393 1.235 1.102 1.324 FCN 1.9 2.178 2.451 2.703 2.308 60.51 69.32 76.56 82.1 72.12 −0.412 0.719 1.615 2.321 1.061 1.903 2.02 2.048 2.02 1.998 LSTM 2.073 2.445 2.783 3.085 2.597 66.15 77.04 85.15 90.62 79.74 −2.420 −0.267 1.757 3.545 0.654 1.693 1.449 1.252 1.099 1.373 cTSN 2.255 2.597 2.907 3.191 2.738 69.51 79.41 86.63 91.68 81.81 −0.619 1.145 3.024 4.825 2.094 1.479 1.319 1.173 1.048 1.255 TSN 2.281 2.629 2.939 3.225 2.769 70.47 80.38 87.45 92.31 82.66 −0.820 1.037 3.02 4.911 2.037 1.478 1.307 1.156 1.027 1.242 DNN SDNN1 LSTM FCN IRM cTSN TSN 13.91 14.15 63.98 30.09 13.54 13.10 23.87 The averaging process times per second of speech (ms) conducted on an Intel® Core™ i7-6700k workstation with an Nvidia GTX 1080 Ti . Noisy (White 5dB) DNN cTSN
  • 30. Experimental Results – Preference Test–D2 The number of participants: 20 Test: For each pair (noisy, DNN, TSN), choose the best one. (In the aspect of noise reduction and speech distortion.)  20 pairs were used. Result: TSN 85%, DNN 15%, 18 10p   Noisy TSN DNN