SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Noise Robust Speech Recognition of
Missing or Uncertain Data
Jos´e ´Andres Gonz´alez L´opez
Advisors: Dr. Antonio M. Peinado Herreros
Dr. ´Angel M. G´omez Garc´ıa
Dpt. Signal Theory, Telecommunications and Networking
University of Granada
Ph.D. Defence
February 25th, 2013
1 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
2 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
3 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Robust ASR
The performance of ASR (Automatic Speech recognition)
systems degrades when training and testing conditions differ.
This mismatch can be due to different factors
Language complexity: grammar, vocabulary, spontaneous
speech, ...
Speaker variability: accent, age, gender, ...
Environmental factors: background noise, channel distortion,
room acoustics, ...
In this work, we will focus on the environmental factors,
especially on the background noise and the channel distortion.
Effect of noise on speech: noise modifies the speech
distributions and causes loss of information.
4 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Approaches for Noise Robustness
Different approaches to achieve noise robustness: robust
feature extraction, model adaptation and feature modification.
Feature compensation enhances the noisy features used for
speech recognition.
yt and ˆxt are, respectively, the feature vectors for noisy
speech and estimated clean speech at time t.
uncertainty: information about the reliability of ˆxt.
5 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Objectives
Development of a set of compensation techniques for speech
feature enhancement.
To do this, a Bayesian estimation framework is adopted here.
Two different approaches for estimating clean speech will be
explored
Feature compensation based on stereo-data: clean and
noisy recordings are used to derive a set of transformations
applied to noisy speech.
Feature compensation based on a masking model:
parametric models of speech degradation are used to estimate
clean speech.
Finally, an uncertainty decoding approach and temporal
modelling of speech will be also investigated.
6 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
7 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Introduction
Stereo data: simultaneous recordings of clean and noisy
speech signals,
(X, Y) = ( x1, y1 , x2, y2 , . . . , xT , yT )
The stereo data is used to learn the statistical relationship
between the clean and noisy feature spaces.
As a result, a set of transformations is derived to enhance
speech in a certain acoustic environment.
Acoustic environment: combination of additive and
convolutional noises at a given SNR.
8 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
MMSE Estimation (I)
MMSE estimation is chosen to obtain suitable estimates for
the clean feature vectors,
ˆx = E[x|y] = xp(x|y)dx
Problem: p(x|y) must be expressed in a convenient form.
Solution: clean and noisy feature spaces are represented by
VQ codebooks Mx and My , respectively.
9 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
MMSE Estimation (II)
Using these codebooks, the MMSE estimation can be expressed as,
ˆx =
Mx
kx =1
P(kx |k∗
y ) ˆx(kx )
P(kx |k∗
y ): mapping between the clean and noisy cells for a
certain environment. Estimated using stereo data.
ˆx(kx )
= E[x|y, kx , k∗
y ]: 3 alternatives (Q-VQMMSE,
S-VQMMSE and W-VQMMSE).
10 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Computation of ˆx(kx )
Q-VQMMSE
Assumes that both spaces are
quantized.
Also, this approach assumes that
the spaces are independent.
Then, ˆx(kx )
= µ
(kx )
x .
S-VQMMSE
A correction is applied to y,
ˆx(kx )
= y − µ
(k∗
y )
y − µ(kx )
x
= µ(kx )
x + y − µ
(k∗
y )
y
∆: quantization error
11 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Improving the Mapping Accuracy
Subregion modelling
C
(kx ,ky )
y is the subset of the noisy cell ky whose corresponding
clean vectors belong to kx .
Similarly, C
(kx ,ky )
x is the subset of kx whose corresponding
noisy vectors are C
(kx ,ky )
y .
12 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Whitening-transformation based VQMMSE
W-VQMMSE assumes that the subregions of both feature
spaces are Gaussian distributed, e.g.
C
(kx ,ky )
x ∼ N µ
(kx ,ky )
x , Σ
(kx ,ky )
x
Computation of E[x|y, kx , ky ]: the following whitening
transformation is applied
E[x|y, kx , ky ] = µ
(kx ,ky )
x + Σ
(kx ,ky )
x
1/2
Σ
(kx ,ky )
y
−1/2
y − µ
(kx ,ky )
y
After some manipulations the MMSE estimation becomes,
ˆx = A(k∗
y )
y + b(k∗
y )
where the parameters of the affine transformation can be
precomputed offline for each noisy cell ky = 1, . . . , My .
13 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Experimental Setup
Recognition task: based on the Aurora2 noisy digits
database.
Acoustic environments: 9 noises at 7 SNRs (clean, 20, 15,
10, 5, 0, and -5 dB).
Speech features: ETSI FE Standard (13 MFCCs + ∆ +
∆2).
Front-end speech models: codebooks with 256 components.
SPLICE and MEMLIN are also evaluated (i.e. GMM-based
MMSE estimation).
A priori knowledge on the acoustic environment is assumed.
14 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
FE Results
System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.
Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83
Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38
SPLICE 99.02 98.09 95.87 88.88 70.62 39.04 15.99 78.50
MEMLIN 99.02 98.36 97.01 92.43 78.26 47.03 18.76 82.62
Q-VQMMSE 96.19 93.72 90.21 81.24 61.82 31.33 14.39 71.66
S-VQMMSE 99.02 97.93 96.28 90.57 74.70 43.02 18.57 80.50
iW-VQMMSE 99.02 98.23 96.79 91.60 76.82 46.60 20.02 82.01
dW-VQMMSE 99.02 98.33 97.06 92.43 78.70 48.88 20.26 83.08
fW-VQMMSE 99.02 98.37 97.15 92.88 79.61 50.04 20.89 83.61
Matched: HMMs trained under the same conditions that in testing.
iW-, dW-, fW-: identity, diagonal and full covariance matrices.
MEMLIN and iW-VQMMSE behave almost identically, but our proposal
is more efficient.
When the dynamic features are also processed, MEMLIN and
fW-VQMMSE achieves similar results: 87.67 % vs. 87.31 %.
15 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
AFE Results
System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.
Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83
AFE 99.22 98.24 96.95 93.68 84.37 62.46 29.53 87.14
Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38
Q-VQMMSE 95.60 93.56 91.28 85.25 70.23 39.20 12.84 75.90
S-VQMMSE 99.22 98.32 97.39 94.71 86.30 63.07 27.46 87.96
iW-VQMMSE 99.22 98.61 97.93 95.89 89.19 69.46 32.62 90.22
dW-VQMMSE 99.22 98.70 98.05 96.19 89.93 71.47 34.94 90.87
fW-VQMMSE 99.22 98.65 97.99 96.10 89.92 72.29 36.57 90.99
AFE: ETSI Advanced Front-End.
The proposed techniques are applied to the features extracted
by AFE.
The combined systems AFE+VQMMSE increase the
robustness of AFE against noise.
16 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
17 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Introduction
Speech degradation model: an analytical model that relates
y with x and n (the additive noise vector).
Model-based compensation: the degradation model is used
to derive the MMSE estimator.
No stereo data is required.
Thus, unknown distortions can be mitigated.
× MMSE estimation only tackles the distortions considered in
the degradation model. E.g. additive and convolutional noises.
× Noise need to be estimated.
We will only consider the robustness to additive noise here.
18 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Speech Masking Model
In the log-Mel domain, the degradation model is approximated by
y = log(ex
+ en
)
This model can be rewritten
as,
y = max(x, n) + ε(x, n)
Disregarding ε(x, n), the
speech masking model is
y ≈ max(x, n) 0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
-0.4 -0.2 0 0.2 0.4 0.6
Probability
ε(x, n)
Distribution of ε(x, n) in Aurora2
19 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Spectral Reconstruction: Problems
According to the speech masking model, the observation can
be rearranged into y = (yr , yu).
Reliable features (xr ≈ yr ), i.e. speech is dominant.
Unreliable features (−∞ ≤ xu ≤ yu): speech is masked by
noise.
Thus, feature compensation can be reformulated as different
two problems
1 Segregation of the noisy spectra into speech and noise.
This yields a mask where the reliable and unreliable features
are identified.
2 Spectral reconstruction, i.e. estimate the speech energy in
the unreliable features.
Two alternative techniques are proposed here:
TGI only deals with problem 2.
MMSR addresses both 1 & 2.
20 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Truncated-Gaussian based Imputation
TGI estimates the speech energy in the unreliable regions of
the observed spectrogram.
To do this, the correlation between features is exploited.
Prerequisites: the segregation binary mask is known in
advance.
After spectral reconstruction, MFCC features can be
computed and used for recognition.
21 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSE Estimation of the Unreliable Features
MMSE estimation is used again to reconstruct the unreliable
features,
ˆxu = E[xu|xr = yr , −∞ ≤ xu ≤ yu]
Speech model: p(x) is modelled as a Gaussian Mixture
Model (GMM),
p(x) =
M
k=1
P(k)N x; µ(k)
, Σ(k)
Applying this model, the MMSE estimation is given by,
ˆxu =
M
k=1
P(k|yr , yu) ˆx
(k)
u
Problem: computation of P(k|yr , yu) and ˆx
(k)
u .
22 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Posterior Computation
After applying Bayes’ rule, the posterior can be expressed as,
P(k|yr , yu) =
p(yr , yu|k)P(k)
M
k =1 p(yr , yu|k )P(k )
p(yr , yu|k) is factorized as the following product,
p(yr , yu|k) = p(yr |k)
yu
−∞
p(xu|yr , k)dxu
p(yr |k) = N(yr ; µ
(k)
r , Σ
(k)
r ): marginal PDF of the reliable
features.
p(xu|yr , k) = N(xu; µ
(k)
u|r , Σ
(k)
u|r ): conditional PDF of the
unreliable features given the reliable ones.
23 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Partial Estimates
According to the speech masking model xu ∈ (−∞, yu]. Thus,
ˆx
(k)
u =
yu
−∞
xup(xu|yr , k)dxu
Independence is assumed to
solve the integral.
The partial estimate
ˆx
(k)
u = ˜µ(k)(y) corresponds
to the mean of a
right-truncated Gaussian
PDF.
24 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Example
Clean
Noisy (0 dB)
Oracle mask
TGI reconstruction
23
15
7
12
0
23
15
7
12
5
23
15
7
1
0
23
15
7
12
4
Time (s)
eigth six zero one one six two
Melchannel
0.5 1.0 1.5 2.0 2.5 3.0
25 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Experimental Setup
Databases: Aurora2 & Aurora4.
The 3 test sets (A, B and C) of Aurora2 are considered.
Aurora4: 5000-word recognition task based on the Wall
Street Journal corpus. Two testing conditions:
Test 01-07 includes utterances with artificially added acoustic
noise (random SNR between 10 dB and 20 dB).
Test 08-14: acoustic noise + different microphones.
TGI is evaluated using both oracle (OR) or estimated (EST)
binary masks.
Noise estimation: linear interpolation of the first and last
frames of the utterance.
Front-end speech model: GMM with 256 components.
26 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Experimental Results
WAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline CBR−OR TGI−OR CBR−EST TGI−EST
CBR: Cluster-Based Reconstruction (Raj et al., 2004).
TGI outperforms CBR when oracle masks are used.
The difference is small when the masks are estimated.
Large margin for improvement between OR and EST ⇒ a
more robust approach for speech/noise segregation is required.
27 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Masking-Model based Spectral Reconstruction
As we have seen, TGI achieves excellent results when oracle
masks are used.
However, its performance diminishes when the masks are
estimated ⇒ the noise estimation errors can be magnified by
the hard decision implemented by the binary masks.
MMSR uses the noise estimates directly in the MMSE
estimation.
Advantages with respect to TGI
No a priori segregation mask is required now.
Therefore, the feature reliability and the speech energy in the
unreliable regions are jointly estimated.
28 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSR: Diagram
Mx : GMM with Mx gaussians.
Mn: GMM with Mn gaussians (alternatively a noise estimate
nt ∼ Nn(ˆnt , Σn,t ) for each frame).
MMSE estimation
ˆx =
Mx
kx =1
Mn
kn=1
P(kx , kn|y) ˆx(kx ,kn)
29 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Posterior Computation
Applying Bayes’ rule, P(kx , kn|y) ∝ p(y|kx , kn)P(kx )P(kn).
Independence assumpion: p(y|kx , kn) is expressed as the
product of p(y|kx , kn) for every observed feature y.
According to the masking model, p(y|kx , kn) is computed as,
p(y|kx , kn) = p(x = y, n ≤ y|kx , kn) + p(n = y, x < y|kx , kn)
px (y|kx )Pn(x ≤ y|kn) pn(y|kn)Px (x < y|kx )
Probability that speech is dominant
Probability that noise is dominant
30 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Partial Estimates
Contrary to TGI, the reliability of the observed feature y is
unknown in MMSR.
Hence, both the reliable and unreliable cases are taken into
account,
ˆx(kx ,kn)
= w(kx ,kn)
y + 1 − w(kx ,kn)
˜µ
(kx )
x
Estimate for high SNRs
Estimate for masked speech (i.e. truncated PDF mean)
w(kx ,kn)
= P(x = y, n < y|kx , kn) is the normalized speech
presence probability.
31 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSR: Mask Estimation
MMSR can be also considered as a robust method for speech
segregation.
To see this, we reproduce here the final expression for the
MMSE estimator,
ˆx =


Mx
kx =1
Mn
kn=1
P(kx , kn|y)w(kx ,kn)


m
y +
Mx
kx =1
Mn
kn=1
P(kx , kn|y) 1 − w(kx ,kn)
˜µ
(kx )
x
m ∈ [0, 1] acts as a soft-mask: m ≈ 1 for the reliable features
and m ≈ 0 for the unreliable ones.
Advantages regarding other methods:
Parameter free.
Mask estimation is fully integrated within the reconstruction.
32 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Experimental Results
WAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline TGI−OR TGI−EST MMSR VTS
VTS: well-known model-based compensation technique
(Moreno, 1996).
MMSR outperforms TGI-EST and is upper-bounded by
TGI-OR.
VTS is slightly better than MMSR ⇒ more accurate noise
models can reduce the gap.
33 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSR: Diagram
34 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSR: Diagram
35 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
EM-based Noise Model Estimation
Objective: estimate the noise model used in MMSR.
Noise model: GMM with Mn gaussians,
Mn = π
(1)
n , µ
(1)
n , Σ
(1)
n , . . . , π
(Mn)
n , µ
(Mn)
n , Σ
(Mn)
n
where π
(kn)
n (kn = 1, . . . , Mn) are the component priors.
Maximum Likelihood estimation
ˆMn = argmax
Mn
p(y1, . . . , yT |Mn, Mx )
Direct optimization of this expression is unfeasible ⇒ an
iterative EM approach is used.
36 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Overview
Problems
The oracle mask is unknown ⇒ the soft-mask estimated by
MMSR is used.
Treatment of the speech-dominated regions: the noise in
these regions can be estimated using the model obtained in
the previous iteration.
37 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Experimental Results
2 4 6 8
85
85.5
86
86.5
Aurora2
No. of components
WAcc(%)
2 4 6 8 10
68
68.5
69
69.5
Aurora4
No. of components
WAcc(%)
Estimated noise
GMM noise model
Small but consistent performance improvement is achieved
when using GMM noise models in MMSR.
GMMs worse than estimated noise in 2 cases
1-gauss GMMs: unable to properly model non-stationary noises.
Complex GMMs: not enough data to robustly estimate the GMM
parameters.
38 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
39 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Temporal Modelling
More accurate MMSE estimates are obtained with better
speech models.
Here, the temporal correlation of speech is considered.
Two alternative approaches
Patch-based modelling: short segments of speech are
modelled instead of single frames.
HMM modelling: the previous speech models (GMMs or VQ
codebooks) are augmented with transition probabilities. Then,
ˆxt =
M
k=1
P(k|y1, . . . , yt, . . . , yT )E[x|yt, k]
40 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Experimental Results
WAcc(%)
Aurora2 Aurora4
50
60
70
80
90
100
TGI−OR
PATCH−OR
HMM−OR
TGI−EST
PATCH−EST
HMM−EST
The PATCH and HMM approaches are applied in combination
with TGI.
Spectral reconstruction benefits from temporal redundancy,
especially at low SNRs.
The HMM-based modelling achieves the best results.
41 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Uncertainty Decoding (I)
The accuracy of MMSE estimation depends on many factors,
such as the SNR of the signal, stationarity of the noise, etc.
Inaccurate ˆxt can degrade the performance of ASR.
Two objectives
1 Estimate the uncertainty/reliability of ˆxt.
2 Account for this information in the recognizer.
42 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Uncertainty Decoding (II)
Uncertainty of ˆx
Depends on p(x|y) that appears in the MMSE estimator
If p(x|y) = δy(x), then we will consider that ˆx is fully reliable.
If p(x|y) is uniformly distributed, then ˆx is badly estimated.
How to measure the uncertainty of ˆx?
Entropy of p(x|y).
Variance of the MMSE estimate: Σˆx.
Exploitation in the recognizer
Soft-data decoding: Σˆx increases the variance of the
Gaussians in the acoustic model.
Weighted Viterbi Algorithm: the exponential factor
ρ ∈ [0, 1] used to weight the observation probabilities of ˆx is
obtained after applying a sigmoid function to MSE = tr(Σˆx).
43 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Experimental ResultsWAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline TGI−OR UD−OR TGI−EST UD−EST
UD: TGI + Weighted Viterbi Algorithm.
OR vs. EST: oracle masks and oracle uncertainties vs.
estimated masks and uncertainties.
The recognition performance is improved after accounting for
the uncertainty, especially in Aurora4.
44 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
45 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Conclusions (I)
The performance of ASR is severely affected by noise.
To improve the robustness of ASR to noise, a feature
compensation approach has been adopted in this thesis.
Stereo-data based compensation: stereo recordings are used
to estimate a set of transformations that are later applied to
noisy speech.
Excellent results for the environments seen during training.
Efficient implementation without a significant performance
degradation when VQ codebooks are used.
The proposed techniques can be used to reduce the residual
noise of other robust techniques.
46 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Conclusions (II)
Model-based compensation: a model that considers the
distortion of speech as a masking problem is used to derive
two reconstruction techniques.
TGI estimates the masked regions in the noisy spectra. Good
results if the masking pattern is perfectly known, otherwise its
performance is significantly affected.
MMSR uses clean speech and noise models to enhance noisy
speech. Unlike TGI, mask estimation is an integrated part of
the reconstruction algorithm.
An EM-based iterative algorithm has been proposed to
estimate the noise models used by MMSR.
Finally, several approaches to account for temporal
correlations and to decode uncertain speech evidence were
also investigated.
47 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Future Work
Speech masking model vs. perceptual masking.
EM algorithm: joint estimation of additive and convolutional
noises.
Using more information in MMSR. E.g. pitch, onset/offset
position, etc.
Joint speaker and noise compensation.
48 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Thank you!
49 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data

Mais conteúdo relacionado

Mais procurados

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
An adaptive method for noise removal from real world images
An adaptive method for noise removal from real world imagesAn adaptive method for noise removal from real world images
An adaptive method for noise removal from real world images
IAEME Publication
 
Image Processing
Image ProcessingImage Processing
Image Processing
tijeel
 
Non-Blind Deblurring Using Partial Differential Equation Method
Non-Blind Deblurring Using Partial Differential Equation MethodNon-Blind Deblurring Using Partial Differential Equation Method
Non-Blind Deblurring Using Partial Differential Equation Method
Editor IJCATR
 
Alternating direction-method-for-image-restoration
Alternating direction-method-for-image-restorationAlternating direction-method-for-image-restoration
Alternating direction-method-for-image-restoration
Prashant Pal
 

Mais procurados (20)

A computer vision approach to speech enhancement
A computer vision approach to speech enhancementA computer vision approach to speech enhancement
A computer vision approach to speech enhancement
 
3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues3D Audio playback for single channel audio using visual cues
3D Audio playback for single channel audio using visual cues
 
Random Valued Impulse Noise Removal in Colour Images using Adaptive Threshold...
Random Valued Impulse Noise Removal in Colour Images using Adaptive Threshold...Random Valued Impulse Noise Removal in Colour Images using Adaptive Threshold...
Random Valued Impulse Noise Removal in Colour Images using Adaptive Threshold...
 
Image degradation and noise by Md.Naseem Ashraf
Image degradation and noise by Md.Naseem AshrafImage degradation and noise by Md.Naseem Ashraf
Image degradation and noise by Md.Naseem Ashraf
 
Image restoration recent_advances_and_applications
Image restoration recent_advances_and_applicationsImage restoration recent_advances_and_applications
Image restoration recent_advances_and_applications
 
Lecture 11
Lecture 11Lecture 11
Lecture 11
 
Image Restoration (Digital Image Processing)
Image Restoration (Digital Image Processing)Image Restoration (Digital Image Processing)
Image Restoration (Digital Image Processing)
 
In2414961500
In2414961500In2414961500
In2414961500
 
Image restoration
Image restorationImage restoration
Image restoration
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
An adaptive method for noise removal from real world images
An adaptive method for noise removal from real world imagesAn adaptive method for noise removal from real world images
An adaptive method for noise removal from real world images
 
Image Processing
Image ProcessingImage Processing
Image Processing
 
3 ijaems nov-2015-6-development of an advanced technique for historical docum...
3 ijaems nov-2015-6-development of an advanced technique for historical docum...3 ijaems nov-2015-6-development of an advanced technique for historical docum...
3 ijaems nov-2015-6-development of an advanced technique for historical docum...
 
Image restoration
Image restorationImage restoration
Image restoration
 
Non-Blind Deblurring Using Partial Differential Equation Method
Non-Blind Deblurring Using Partial Differential Equation MethodNon-Blind Deblurring Using Partial Differential Equation Method
Non-Blind Deblurring Using Partial Differential Equation Method
 
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
SCALE RATIO ICP FOR 3D POINT CLOUDS WITH DIFFERENT SCALES
 
Alternating direction-method-for-image-restoration
Alternating direction-method-for-image-restorationAlternating direction-method-for-image-restoration
Alternating direction-method-for-image-restoration
 
Image Processing With Sampling and Noise Filtration in Image Reconigation Pr...
Image Processing With Sampling and Noise Filtration in Image  Reconigation Pr...Image Processing With Sampling and Noise Filtration in Image  Reconigation Pr...
Image Processing With Sampling and Noise Filtration in Image Reconigation Pr...
 
Hg3512751279
Hg3512751279Hg3512751279
Hg3512751279
 
The super resolution technology 2016
The super resolution technology 2016The super resolution technology 2016
The super resolution technology 2016
 

Semelhante a Thesis

A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
sipij
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
威華 王
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 

Semelhante a Thesis (20)

A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
 
F010334548
F010334548F010334548
F010334548
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
A literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environmentA literature review on improving speech intelligibility in noisy environment
A literature review on improving speech intelligibility in noisy environment
 
Speaker Identification based on GFCC using GMM-UBM
Speaker Identification based on GFCC using GMM-UBMSpeaker Identification based on GFCC using GMM-UBM
Speaker Identification based on GFCC using GMM-UBM
 
Improvement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodImprovement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation method
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
 
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...
 
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...
 
An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...An efficient peak valley detection based vad algorithm for robust detection o...
An efficient peak valley detection based vad algorithm for robust detection o...
 
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
文献紹介:Learning From Noisy Labels With Deep Neural Networks: A Survey
 
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdfA_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
 
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral SubtractionA New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
A New Approach for Speech Enhancement Based On Eigenvalue Spectral Subtraction
 
Asr
AsrAsr
Asr
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Adapter Wavelet Thresholding for Image Denoising Using Various Shrinkage Unde...
Adapter Wavelet Thresholding for Image Denoising Using Various Shrinkage Unde...Adapter Wavelet Thresholding for Image Denoising Using Various Shrinkage Unde...
Adapter Wavelet Thresholding for Image Denoising Using Various Shrinkage Unde...
 
Speckle noise reduction using hybrid tmav based fuzzy filter
Speckle noise reduction using hybrid tmav based fuzzy filterSpeckle noise reduction using hybrid tmav based fuzzy filter
Speckle noise reduction using hybrid tmav based fuzzy filter
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Thesis

  • 1. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Noise Robust Speech Recognition of Missing or Uncertain Data Jos´e ´Andres Gonz´alez L´opez Advisors: Dr. Antonio M. Peinado Herreros Dr. ´Angel M. G´omez Garc´ıa Dpt. Signal Theory, Telecommunications and Networking University of Granada Ph.D. Defence February 25th, 2013 1 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 2. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 2 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 3. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 3 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 4. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Robust ASR The performance of ASR (Automatic Speech recognition) systems degrades when training and testing conditions differ. This mismatch can be due to different factors Language complexity: grammar, vocabulary, spontaneous speech, ... Speaker variability: accent, age, gender, ... Environmental factors: background noise, channel distortion, room acoustics, ... In this work, we will focus on the environmental factors, especially on the background noise and the channel distortion. Effect of noise on speech: noise modifies the speech distributions and causes loss of information. 4 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 5. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Approaches for Noise Robustness Different approaches to achieve noise robustness: robust feature extraction, model adaptation and feature modification. Feature compensation enhances the noisy features used for speech recognition. yt and ˆxt are, respectively, the feature vectors for noisy speech and estimated clean speech at time t. uncertainty: information about the reliability of ˆxt. 5 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 6. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Objectives Development of a set of compensation techniques for speech feature enhancement. To do this, a Bayesian estimation framework is adopted here. Two different approaches for estimating clean speech will be explored Feature compensation based on stereo-data: clean and noisy recordings are used to derive a set of transformations applied to noisy speech. Feature compensation based on a masking model: parametric models of speech degradation are used to estimate clean speech. Finally, an uncertainty decoding approach and temporal modelling of speech will be also investigated. 6 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 7. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 7 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 8. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Introduction Stereo data: simultaneous recordings of clean and noisy speech signals, (X, Y) = ( x1, y1 , x2, y2 , . . . , xT , yT ) The stereo data is used to learn the statistical relationship between the clean and noisy feature spaces. As a result, a set of transformations is derived to enhance speech in a certain acoustic environment. Acoustic environment: combination of additive and convolutional noises at a given SNR. 8 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 9. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results MMSE Estimation (I) MMSE estimation is chosen to obtain suitable estimates for the clean feature vectors, ˆx = E[x|y] = xp(x|y)dx Problem: p(x|y) must be expressed in a convenient form. Solution: clean and noisy feature spaces are represented by VQ codebooks Mx and My , respectively. 9 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 10. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results MMSE Estimation (II) Using these codebooks, the MMSE estimation can be expressed as, ˆx = Mx kx =1 P(kx |k∗ y ) ˆx(kx ) P(kx |k∗ y ): mapping between the clean and noisy cells for a certain environment. Estimated using stereo data. ˆx(kx ) = E[x|y, kx , k∗ y ]: 3 alternatives (Q-VQMMSE, S-VQMMSE and W-VQMMSE). 10 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 11. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Computation of ˆx(kx ) Q-VQMMSE Assumes that both spaces are quantized. Also, this approach assumes that the spaces are independent. Then, ˆx(kx ) = µ (kx ) x . S-VQMMSE A correction is applied to y, ˆx(kx ) = y − µ (k∗ y ) y − µ(kx ) x = µ(kx ) x + y − µ (k∗ y ) y ∆: quantization error 11 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 12. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Improving the Mapping Accuracy Subregion modelling C (kx ,ky ) y is the subset of the noisy cell ky whose corresponding clean vectors belong to kx . Similarly, C (kx ,ky ) x is the subset of kx whose corresponding noisy vectors are C (kx ,ky ) y . 12 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 13. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Whitening-transformation based VQMMSE W-VQMMSE assumes that the subregions of both feature spaces are Gaussian distributed, e.g. C (kx ,ky ) x ∼ N µ (kx ,ky ) x , Σ (kx ,ky ) x Computation of E[x|y, kx , ky ]: the following whitening transformation is applied E[x|y, kx , ky ] = µ (kx ,ky ) x + Σ (kx ,ky ) x 1/2 Σ (kx ,ky ) y −1/2 y − µ (kx ,ky ) y After some manipulations the MMSE estimation becomes, ˆx = A(k∗ y ) y + b(k∗ y ) where the parameters of the affine transformation can be precomputed offline for each noisy cell ky = 1, . . . , My . 13 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 14. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results Experimental Setup Recognition task: based on the Aurora2 noisy digits database. Acoustic environments: 9 noises at 7 SNRs (clean, 20, 15, 10, 5, 0, and -5 dB). Speech features: ETSI FE Standard (13 MFCCs + ∆ + ∆2). Front-end speech models: codebooks with 256 components. SPLICE and MEMLIN are also evaluated (i.e. GMM-based MMSE estimation). A priori knowledge on the acoustic environment is assumed. 14 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 15. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results FE Results System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg. Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83 Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38 SPLICE 99.02 98.09 95.87 88.88 70.62 39.04 15.99 78.50 MEMLIN 99.02 98.36 97.01 92.43 78.26 47.03 18.76 82.62 Q-VQMMSE 96.19 93.72 90.21 81.24 61.82 31.33 14.39 71.66 S-VQMMSE 99.02 97.93 96.28 90.57 74.70 43.02 18.57 80.50 iW-VQMMSE 99.02 98.23 96.79 91.60 76.82 46.60 20.02 82.01 dW-VQMMSE 99.02 98.33 97.06 92.43 78.70 48.88 20.26 83.08 fW-VQMMSE 99.02 98.37 97.15 92.88 79.61 50.04 20.89 83.61 Matched: HMMs trained under the same conditions that in testing. iW-, dW-, fW-: identity, diagonal and full covariance matrices. MEMLIN and iW-VQMMSE behave almost identically, but our proposal is more efficient. When the dynamic features are also processed, MEMLIN and fW-VQMMSE achieves similar results: 87.67 % vs. 87.31 %. 15 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 16. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Introduction MMSE Estimation Experimental Results AFE Results System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg. Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83 AFE 99.22 98.24 96.95 93.68 84.37 62.46 29.53 87.14 Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38 Q-VQMMSE 95.60 93.56 91.28 85.25 70.23 39.20 12.84 75.90 S-VQMMSE 99.22 98.32 97.39 94.71 86.30 63.07 27.46 87.96 iW-VQMMSE 99.22 98.61 97.93 95.89 89.19 69.46 32.62 90.22 dW-VQMMSE 99.22 98.70 98.05 96.19 89.93 71.47 34.94 90.87 fW-VQMMSE 99.22 98.65 97.99 96.10 89.92 72.29 36.57 90.99 AFE: ETSI Advanced Front-End. The proposed techniques are applied to the features extracted by AFE. The combined systems AFE+VQMMSE increase the robustness of AFE against noise. 16 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 17. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 17 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 18. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Introduction Speech degradation model: an analytical model that relates y with x and n (the additive noise vector). Model-based compensation: the degradation model is used to derive the MMSE estimator. No stereo data is required. Thus, unknown distortions can be mitigated. × MMSE estimation only tackles the distortions considered in the degradation model. E.g. additive and convolutional noises. × Noise need to be estimated. We will only consider the robustness to additive noise here. 18 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 19. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Speech Masking Model In the log-Mel domain, the degradation model is approximated by y = log(ex + en ) This model can be rewritten as, y = max(x, n) + ε(x, n) Disregarding ε(x, n), the speech masking model is y ≈ max(x, n) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 -0.4 -0.2 0 0.2 0.4 0.6 Probability ε(x, n) Distribution of ε(x, n) in Aurora2 19 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 20. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Spectral Reconstruction: Problems According to the speech masking model, the observation can be rearranged into y = (yr , yu). Reliable features (xr ≈ yr ), i.e. speech is dominant. Unreliable features (−∞ ≤ xu ≤ yu): speech is masked by noise. Thus, feature compensation can be reformulated as different two problems 1 Segregation of the noisy spectra into speech and noise. This yields a mask where the reliable and unreliable features are identified. 2 Spectral reconstruction, i.e. estimate the speech energy in the unreliable features. Two alternative techniques are proposed here: TGI only deals with problem 2. MMSR addresses both 1 & 2. 20 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 21. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Truncated-Gaussian based Imputation TGI estimates the speech energy in the unreliable regions of the observed spectrogram. To do this, the correlation between features is exploited. Prerequisites: the segregation binary mask is known in advance. After spectral reconstruction, MFCC features can be computed and used for recognition. 21 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 22. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSE Estimation of the Unreliable Features MMSE estimation is used again to reconstruct the unreliable features, ˆxu = E[xu|xr = yr , −∞ ≤ xu ≤ yu] Speech model: p(x) is modelled as a Gaussian Mixture Model (GMM), p(x) = M k=1 P(k)N x; µ(k) , Σ(k) Applying this model, the MMSE estimation is given by, ˆxu = M k=1 P(k|yr , yu) ˆx (k) u Problem: computation of P(k|yr , yu) and ˆx (k) u . 22 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 23. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Posterior Computation After applying Bayes’ rule, the posterior can be expressed as, P(k|yr , yu) = p(yr , yu|k)P(k) M k =1 p(yr , yu|k )P(k ) p(yr , yu|k) is factorized as the following product, p(yr , yu|k) = p(yr |k) yu −∞ p(xu|yr , k)dxu p(yr |k) = N(yr ; µ (k) r , Σ (k) r ): marginal PDF of the reliable features. p(xu|yr , k) = N(xu; µ (k) u|r , Σ (k) u|r ): conditional PDF of the unreliable features given the reliable ones. 23 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 24. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Partial Estimates According to the speech masking model xu ∈ (−∞, yu]. Thus, ˆx (k) u = yu −∞ xup(xu|yr , k)dxu Independence is assumed to solve the integral. The partial estimate ˆx (k) u = ˜µ(k)(y) corresponds to the mean of a right-truncated Gaussian PDF. 24 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 25. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Example Clean Noisy (0 dB) Oracle mask TGI reconstruction 23 15 7 12 0 23 15 7 12 5 23 15 7 1 0 23 15 7 12 4 Time (s) eigth six zero one one six two Melchannel 0.5 1.0 1.5 2.0 2.5 3.0 25 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 26. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Experimental Setup Databases: Aurora2 & Aurora4. The 3 test sets (A, B and C) of Aurora2 are considered. Aurora4: 5000-word recognition task based on the Wall Street Journal corpus. Two testing conditions: Test 01-07 includes utterances with artificially added acoustic noise (random SNR between 10 dB and 20 dB). Test 08-14: acoustic noise + different microphones. TGI is evaluated using both oracle (OR) or estimated (EST) binary masks. Noise estimation: linear interpolation of the first and last frames of the utterance. Front-end speech model: GMM with 256 components. 26 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 27. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Experimental Results WAcc(%) Aurora2 Aurora4 40 50 60 70 80 90 100 Baseline CBR−OR TGI−OR CBR−EST TGI−EST CBR: Cluster-Based Reconstruction (Raj et al., 2004). TGI outperforms CBR when oracle masks are used. The difference is small when the masks are estimated. Large margin for improvement between OR and EST ⇒ a more robust approach for speech/noise segregation is required. 27 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 28. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Masking-Model based Spectral Reconstruction As we have seen, TGI achieves excellent results when oracle masks are used. However, its performance diminishes when the masks are estimated ⇒ the noise estimation errors can be magnified by the hard decision implemented by the binary masks. MMSR uses the noise estimates directly in the MMSE estimation. Advantages with respect to TGI No a priori segregation mask is required now. Therefore, the feature reliability and the speech energy in the unreliable regions are jointly estimated. 28 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 29. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSR: Diagram Mx : GMM with Mx gaussians. Mn: GMM with Mn gaussians (alternatively a noise estimate nt ∼ Nn(ˆnt , Σn,t ) for each frame). MMSE estimation ˆx = Mx kx =1 Mn kn=1 P(kx , kn|y) ˆx(kx ,kn) 29 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 30. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Posterior Computation Applying Bayes’ rule, P(kx , kn|y) ∝ p(y|kx , kn)P(kx )P(kn). Independence assumpion: p(y|kx , kn) is expressed as the product of p(y|kx , kn) for every observed feature y. According to the masking model, p(y|kx , kn) is computed as, p(y|kx , kn) = p(x = y, n ≤ y|kx , kn) + p(n = y, x < y|kx , kn) px (y|kx )Pn(x ≤ y|kn) pn(y|kn)Px (x < y|kx ) Probability that speech is dominant Probability that noise is dominant 30 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 31. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Partial Estimates Contrary to TGI, the reliability of the observed feature y is unknown in MMSR. Hence, both the reliable and unreliable cases are taken into account, ˆx(kx ,kn) = w(kx ,kn) y + 1 − w(kx ,kn) ˜µ (kx ) x Estimate for high SNRs Estimate for masked speech (i.e. truncated PDF mean) w(kx ,kn) = P(x = y, n < y|kx , kn) is the normalized speech presence probability. 31 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 32. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSR: Mask Estimation MMSR can be also considered as a robust method for speech segregation. To see this, we reproduce here the final expression for the MMSE estimator, ˆx =   Mx kx =1 Mn kn=1 P(kx , kn|y)w(kx ,kn)   m y + Mx kx =1 Mn kn=1 P(kx , kn|y) 1 − w(kx ,kn) ˜µ (kx ) x m ∈ [0, 1] acts as a soft-mask: m ≈ 1 for the reliable features and m ≈ 0 for the unreliable ones. Advantages regarding other methods: Parameter free. Mask estimation is fully integrated within the reconstruction. 32 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 33. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Experimental Results WAcc(%) Aurora2 Aurora4 40 50 60 70 80 90 100 Baseline TGI−OR TGI−EST MMSR VTS VTS: well-known model-based compensation technique (Moreno, 1996). MMSR outperforms TGI-EST and is upper-bounded by TGI-OR. VTS is slightly better than MMSR ⇒ more accurate noise models can reduce the gap. 33 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 34. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSR: Diagram 34 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 35. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation MMSR: Diagram 35 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 36. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation EM-based Noise Model Estimation Objective: estimate the noise model used in MMSR. Noise model: GMM with Mn gaussians, Mn = π (1) n , µ (1) n , Σ (1) n , . . . , π (Mn) n , µ (Mn) n , Σ (Mn) n where π (kn) n (kn = 1, . . . , Mn) are the component priors. Maximum Likelihood estimation ˆMn = argmax Mn p(y1, . . . , yT |Mn, Mx ) Direct optimization of this expression is unfeasible ⇒ an iterative EM approach is used. 36 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 37. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Overview Problems The oracle mask is unknown ⇒ the soft-mask estimated by MMSR is used. Treatment of the speech-dominated regions: the noise in these regions can be estimated using the model obtained in the previous iteration. 37 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 38. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Speech Masking Model TGI MMSR Noise Model Estimation Experimental Results 2 4 6 8 85 85.5 86 86.5 Aurora2 No. of components WAcc(%) 2 4 6 8 10 68 68.5 69 69.5 Aurora4 No. of components WAcc(%) Estimated noise GMM noise model Small but consistent performance improvement is achieved when using GMM noise models in MMSR. GMMs worse than estimated noise in 2 cases 1-gauss GMMs: unable to properly model non-stationary noises. Complex GMMs: not enough data to robustly estimate the GMM parameters. 38 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 39. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 39 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 40. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Temporal Modelling More accurate MMSE estimates are obtained with better speech models. Here, the temporal correlation of speech is considered. Two alternative approaches Patch-based modelling: short segments of speech are modelled instead of single frames. HMM modelling: the previous speech models (GMMs or VQ codebooks) are augmented with transition probabilities. Then, ˆxt = M k=1 P(k|y1, . . . , yt, . . . , yT )E[x|yt, k] 40 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 41. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Experimental Results WAcc(%) Aurora2 Aurora4 50 60 70 80 90 100 TGI−OR PATCH−OR HMM−OR TGI−EST PATCH−EST HMM−EST The PATCH and HMM approaches are applied in combination with TGI. Spectral reconstruction benefits from temporal redundancy, especially at low SNRs. The HMM-based modelling achieves the best results. 41 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 42. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Uncertainty Decoding (I) The accuracy of MMSE estimation depends on many factors, such as the SNR of the signal, stationarity of the noise, etc. Inaccurate ˆxt can degrade the performance of ASR. Two objectives 1 Estimate the uncertainty/reliability of ˆxt. 2 Account for this information in the recognizer. 42 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 43. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Uncertainty Decoding (II) Uncertainty of ˆx Depends on p(x|y) that appears in the MMSE estimator If p(x|y) = δy(x), then we will consider that ˆx is fully reliable. If p(x|y) is uniformly distributed, then ˆx is badly estimated. How to measure the uncertainty of ˆx? Entropy of p(x|y). Variance of the MMSE estimate: Σˆx. Exploitation in the recognizer Soft-data decoding: Σˆx increases the variance of the Gaussians in the acoustic model. Weighted Viterbi Algorithm: the exponential factor ρ ∈ [0, 1] used to weight the observation probabilities of ˆx is obtained after applying a sigmoid function to MSE = tr(Σˆx). 43 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 44. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Temporal Modelling Uncertainty Decoding Experimental ResultsWAcc(%) Aurora2 Aurora4 40 50 60 70 80 90 100 Baseline TGI−OR UD−OR TGI−EST UD−EST UD: TGI + Weighted Viterbi Algorithm. OR vs. EST: oracle masks and oracle uncertainties vs. estimated masks and uncertainties. The recognition performance is improved after accounting for the uncertainty, especially in Aurora4. 44 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 45. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Outline 1 Introduction 2 Feature Compensation based on Stereo Data 3 Feature Compensation based on a Masking Model 4 Temporal Modelling and Uncertainty Decoding 5 Conclusions 45 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 46. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Conclusions (I) The performance of ASR is severely affected by noise. To improve the robustness of ASR to noise, a feature compensation approach has been adopted in this thesis. Stereo-data based compensation: stereo recordings are used to estimate a set of transformations that are later applied to noisy speech. Excellent results for the environments seen during training. Efficient implementation without a significant performance degradation when VQ codebooks are used. The proposed techniques can be used to reduce the residual noise of other robust techniques. 46 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 47. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Conclusions (II) Model-based compensation: a model that considers the distortion of speech as a masking problem is used to derive two reconstruction techniques. TGI estimates the masked regions in the noisy spectra. Good results if the masking pattern is perfectly known, otherwise its performance is significantly affected. MMSR uses clean speech and noise models to enhance noisy speech. Unlike TGI, mask estimation is an integrated part of the reconstruction algorithm. An EM-based iterative algorithm has been proposed to estimate the noise models used by MMSR. Finally, several approaches to account for temporal correlations and to decode uncertain speech evidence were also investigated. 47 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 48. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Future Work Speech masking model vs. perceptual masking. EM algorithm: joint estimation of additive and convolutional noises. Using more information in MMSR. E.g. pitch, onset/offset position, etc. Joint speaker and noise compensation. 48 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
  • 49. Introduction Feature Compensation based on Stereo Data Feature Compensation based on a Masking Model Temporal Modelling and Uncertainty Decoding Conclusions Thank you! 49 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data