Introduction to Multilingual Retrieval Augmented Generation (RAG)
Thesis
1. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Noise Robust Speech Recognition of
Missing or Uncertain Data
Jos´e ´Andres Gonz´alez L´opez
Advisors: Dr. Antonio M. Peinado Herreros
Dr. ´Angel M. G´omez Garc´ıa
Dpt. Signal Theory, Telecommunications and Networking
University of Granada
Ph.D. Defence
February 25th, 2013
1 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
2. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
2 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
3. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
3 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
4. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Robust ASR
The performance of ASR (Automatic Speech recognition)
systems degrades when training and testing conditions differ.
This mismatch can be due to different factors
Language complexity: grammar, vocabulary, spontaneous
speech, ...
Speaker variability: accent, age, gender, ...
Environmental factors: background noise, channel distortion,
room acoustics, ...
In this work, we will focus on the environmental factors,
especially on the background noise and the channel distortion.
Effect of noise on speech: noise modifies the speech
distributions and causes loss of information.
4 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
5. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Approaches for Noise Robustness
Different approaches to achieve noise robustness: robust
feature extraction, model adaptation and feature modification.
Feature compensation enhances the noisy features used for
speech recognition.
yt and ˆxt are, respectively, the feature vectors for noisy
speech and estimated clean speech at time t.
uncertainty: information about the reliability of ˆxt.
5 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
6. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Objectives
Development of a set of compensation techniques for speech
feature enhancement.
To do this, a Bayesian estimation framework is adopted here.
Two different approaches for estimating clean speech will be
explored
Feature compensation based on stereo-data: clean and
noisy recordings are used to derive a set of transformations
applied to noisy speech.
Feature compensation based on a masking model:
parametric models of speech degradation are used to estimate
clean speech.
Finally, an uncertainty decoding approach and temporal
modelling of speech will be also investigated.
6 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
7. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
7 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
8. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Introduction
Stereo data: simultaneous recordings of clean and noisy
speech signals,
(X, Y) = ( x1, y1 , x2, y2 , . . . , xT , yT )
The stereo data is used to learn the statistical relationship
between the clean and noisy feature spaces.
As a result, a set of transformations is derived to enhance
speech in a certain acoustic environment.
Acoustic environment: combination of additive and
convolutional noises at a given SNR.
8 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
9. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
MMSE Estimation (I)
MMSE estimation is chosen to obtain suitable estimates for
the clean feature vectors,
ˆx = E[x|y] = xp(x|y)dx
Problem: p(x|y) must be expressed in a convenient form.
Solution: clean and noisy feature spaces are represented by
VQ codebooks Mx and My , respectively.
9 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
10. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
MMSE Estimation (II)
Using these codebooks, the MMSE estimation can be expressed as,
ˆx =
Mx
kx =1
P(kx |k∗
y ) ˆx(kx )
P(kx |k∗
y ): mapping between the clean and noisy cells for a
certain environment. Estimated using stereo data.
ˆx(kx )
= E[x|y, kx , k∗
y ]: 3 alternatives (Q-VQMMSE,
S-VQMMSE and W-VQMMSE).
10 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
11. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Computation of ˆx(kx )
Q-VQMMSE
Assumes that both spaces are
quantized.
Also, this approach assumes that
the spaces are independent.
Then, ˆx(kx )
= µ
(kx )
x .
S-VQMMSE
A correction is applied to y,
ˆx(kx )
= y − µ
(k∗
y )
y − µ(kx )
x
= µ(kx )
x + y − µ
(k∗
y )
y
∆: quantization error
11 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
12. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Improving the Mapping Accuracy
Subregion modelling
C
(kx ,ky )
y is the subset of the noisy cell ky whose corresponding
clean vectors belong to kx .
Similarly, C
(kx ,ky )
x is the subset of kx whose corresponding
noisy vectors are C
(kx ,ky )
y .
12 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
13. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Whitening-transformation based VQMMSE
W-VQMMSE assumes that the subregions of both feature
spaces are Gaussian distributed, e.g.
C
(kx ,ky )
x ∼ N µ
(kx ,ky )
x , Σ
(kx ,ky )
x
Computation of E[x|y, kx , ky ]: the following whitening
transformation is applied
E[x|y, kx , ky ] = µ
(kx ,ky )
x + Σ
(kx ,ky )
x
1/2
Σ
(kx ,ky )
y
−1/2
y − µ
(kx ,ky )
y
After some manipulations the MMSE estimation becomes,
ˆx = A(k∗
y )
y + b(k∗
y )
where the parameters of the affine transformation can be
precomputed offline for each noisy cell ky = 1, . . . , My .
13 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
14. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
Experimental Setup
Recognition task: based on the Aurora2 noisy digits
database.
Acoustic environments: 9 noises at 7 SNRs (clean, 20, 15,
10, 5, 0, and -5 dB).
Speech features: ETSI FE Standard (13 MFCCs + ∆ +
∆2).
Front-end speech models: codebooks with 256 components.
SPLICE and MEMLIN are also evaluated (i.e. GMM-based
MMSE estimation).
A priori knowledge on the acoustic environment is assumed.
14 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
15. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
FE Results
System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.
Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83
Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38
SPLICE 99.02 98.09 95.87 88.88 70.62 39.04 15.99 78.50
MEMLIN 99.02 98.36 97.01 92.43 78.26 47.03 18.76 82.62
Q-VQMMSE 96.19 93.72 90.21 81.24 61.82 31.33 14.39 71.66
S-VQMMSE 99.02 97.93 96.28 90.57 74.70 43.02 18.57 80.50
iW-VQMMSE 99.02 98.23 96.79 91.60 76.82 46.60 20.02 82.01
dW-VQMMSE 99.02 98.33 97.06 92.43 78.70 48.88 20.26 83.08
fW-VQMMSE 99.02 98.37 97.15 92.88 79.61 50.04 20.89 83.61
Matched: HMMs trained under the same conditions that in testing.
iW-, dW-, fW-: identity, diagonal and full covariance matrices.
MEMLIN and iW-VQMMSE behave almost identically, but our proposal
is more efficient.
When the dynamic features are also processed, MEMLIN and
fW-VQMMSE achieves similar results: 87.67 % vs. 87.31 %.
15 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
16. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Introduction
MMSE Estimation
Experimental Results
AFE Results
System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.
Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83
AFE 99.22 98.24 96.95 93.68 84.37 62.46 29.53 87.14
Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38
Q-VQMMSE 95.60 93.56 91.28 85.25 70.23 39.20 12.84 75.90
S-VQMMSE 99.22 98.32 97.39 94.71 86.30 63.07 27.46 87.96
iW-VQMMSE 99.22 98.61 97.93 95.89 89.19 69.46 32.62 90.22
dW-VQMMSE 99.22 98.70 98.05 96.19 89.93 71.47 34.94 90.87
fW-VQMMSE 99.22 98.65 97.99 96.10 89.92 72.29 36.57 90.99
AFE: ETSI Advanced Front-End.
The proposed techniques are applied to the features extracted
by AFE.
The combined systems AFE+VQMMSE increase the
robustness of AFE against noise.
16 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
17. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
17 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
18. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Introduction
Speech degradation model: an analytical model that relates
y with x and n (the additive noise vector).
Model-based compensation: the degradation model is used
to derive the MMSE estimator.
No stereo data is required.
Thus, unknown distortions can be mitigated.
× MMSE estimation only tackles the distortions considered in
the degradation model. E.g. additive and convolutional noises.
× Noise need to be estimated.
We will only consider the robustness to additive noise here.
18 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
19. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Speech Masking Model
In the log-Mel domain, the degradation model is approximated by
y = log(ex
+ en
)
This model can be rewritten
as,
y = max(x, n) + ε(x, n)
Disregarding ε(x, n), the
speech masking model is
y ≈ max(x, n) 0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
-0.4 -0.2 0 0.2 0.4 0.6
Probability
ε(x, n)
Distribution of ε(x, n) in Aurora2
19 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
20. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Spectral Reconstruction: Problems
According to the speech masking model, the observation can
be rearranged into y = (yr , yu).
Reliable features (xr ≈ yr ), i.e. speech is dominant.
Unreliable features (−∞ ≤ xu ≤ yu): speech is masked by
noise.
Thus, feature compensation can be reformulated as different
two problems
1 Segregation of the noisy spectra into speech and noise.
This yields a mask where the reliable and unreliable features
are identified.
2 Spectral reconstruction, i.e. estimate the speech energy in
the unreliable features.
Two alternative techniques are proposed here:
TGI only deals with problem 2.
MMSR addresses both 1 & 2.
20 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
21. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Truncated-Gaussian based Imputation
TGI estimates the speech energy in the unreliable regions of
the observed spectrogram.
To do this, the correlation between features is exploited.
Prerequisites: the segregation binary mask is known in
advance.
After spectral reconstruction, MFCC features can be
computed and used for recognition.
21 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
22. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSE Estimation of the Unreliable Features
MMSE estimation is used again to reconstruct the unreliable
features,
ˆxu = E[xu|xr = yr , −∞ ≤ xu ≤ yu]
Speech model: p(x) is modelled as a Gaussian Mixture
Model (GMM),
p(x) =
M
k=1
P(k)N x; µ(k)
, Σ(k)
Applying this model, the MMSE estimation is given by,
ˆxu =
M
k=1
P(k|yr , yu) ˆx
(k)
u
Problem: computation of P(k|yr , yu) and ˆx
(k)
u .
22 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
23. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Posterior Computation
After applying Bayes’ rule, the posterior can be expressed as,
P(k|yr , yu) =
p(yr , yu|k)P(k)
M
k =1 p(yr , yu|k )P(k )
p(yr , yu|k) is factorized as the following product,
p(yr , yu|k) = p(yr |k)
yu
−∞
p(xu|yr , k)dxu
p(yr |k) = N(yr ; µ
(k)
r , Σ
(k)
r ): marginal PDF of the reliable
features.
p(xu|yr , k) = N(xu; µ
(k)
u|r , Σ
(k)
u|r ): conditional PDF of the
unreliable features given the reliable ones.
23 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
24. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Partial Estimates
According to the speech masking model xu ∈ (−∞, yu]. Thus,
ˆx
(k)
u =
yu
−∞
xup(xu|yr , k)dxu
Independence is assumed to
solve the integral.
The partial estimate
ˆx
(k)
u = ˜µ(k)(y) corresponds
to the mean of a
right-truncated Gaussian
PDF.
24 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
25. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Example
Clean
Noisy (0 dB)
Oracle mask
TGI reconstruction
23
15
7
12
0
23
15
7
12
5
23
15
7
1
0
23
15
7
12
4
Time (s)
eigth six zero one one six two
Melchannel
0.5 1.0 1.5 2.0 2.5 3.0
25 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
26. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Experimental Setup
Databases: Aurora2 & Aurora4.
The 3 test sets (A, B and C) of Aurora2 are considered.
Aurora4: 5000-word recognition task based on the Wall
Street Journal corpus. Two testing conditions:
Test 01-07 includes utterances with artificially added acoustic
noise (random SNR between 10 dB and 20 dB).
Test 08-14: acoustic noise + different microphones.
TGI is evaluated using both oracle (OR) or estimated (EST)
binary masks.
Noise estimation: linear interpolation of the first and last
frames of the utterance.
Front-end speech model: GMM with 256 components.
26 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
27. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Experimental Results
WAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline CBR−OR TGI−OR CBR−EST TGI−EST
CBR: Cluster-Based Reconstruction (Raj et al., 2004).
TGI outperforms CBR when oracle masks are used.
The difference is small when the masks are estimated.
Large margin for improvement between OR and EST ⇒ a
more robust approach for speech/noise segregation is required.
27 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
28. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Masking-Model based Spectral Reconstruction
As we have seen, TGI achieves excellent results when oracle
masks are used.
However, its performance diminishes when the masks are
estimated ⇒ the noise estimation errors can be magnified by
the hard decision implemented by the binary masks.
MMSR uses the noise estimates directly in the MMSE
estimation.
Advantages with respect to TGI
No a priori segregation mask is required now.
Therefore, the feature reliability and the speech energy in the
unreliable regions are jointly estimated.
28 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
29. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSR: Diagram
Mx : GMM with Mx gaussians.
Mn: GMM with Mn gaussians (alternatively a noise estimate
nt ∼ Nn(ˆnt , Σn,t ) for each frame).
MMSE estimation
ˆx =
Mx
kx =1
Mn
kn=1
P(kx , kn|y) ˆx(kx ,kn)
29 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
30. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Posterior Computation
Applying Bayes’ rule, P(kx , kn|y) ∝ p(y|kx , kn)P(kx )P(kn).
Independence assumpion: p(y|kx , kn) is expressed as the
product of p(y|kx , kn) for every observed feature y.
According to the masking model, p(y|kx , kn) is computed as,
p(y|kx , kn) = p(x = y, n ≤ y|kx , kn) + p(n = y, x < y|kx , kn)
px (y|kx )Pn(x ≤ y|kn) pn(y|kn)Px (x < y|kx )
Probability that speech is dominant
Probability that noise is dominant
30 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
31. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Partial Estimates
Contrary to TGI, the reliability of the observed feature y is
unknown in MMSR.
Hence, both the reliable and unreliable cases are taken into
account,
ˆx(kx ,kn)
= w(kx ,kn)
y + 1 − w(kx ,kn)
˜µ
(kx )
x
Estimate for high SNRs
Estimate for masked speech (i.e. truncated PDF mean)
w(kx ,kn)
= P(x = y, n < y|kx , kn) is the normalized speech
presence probability.
31 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
32. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSR: Mask Estimation
MMSR can be also considered as a robust method for speech
segregation.
To see this, we reproduce here the final expression for the
MMSE estimator,
ˆx =
Mx
kx =1
Mn
kn=1
P(kx , kn|y)w(kx ,kn)
m
y +
Mx
kx =1
Mn
kn=1
P(kx , kn|y) 1 − w(kx ,kn)
˜µ
(kx )
x
m ∈ [0, 1] acts as a soft-mask: m ≈ 1 for the reliable features
and m ≈ 0 for the unreliable ones.
Advantages regarding other methods:
Parameter free.
Mask estimation is fully integrated within the reconstruction.
32 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
33. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Experimental Results
WAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline TGI−OR TGI−EST MMSR VTS
VTS: well-known model-based compensation technique
(Moreno, 1996).
MMSR outperforms TGI-EST and is upper-bounded by
TGI-OR.
VTS is slightly better than MMSR ⇒ more accurate noise
models can reduce the gap.
33 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
34. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSR: Diagram
34 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
35. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
MMSR: Diagram
35 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
36. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
EM-based Noise Model Estimation
Objective: estimate the noise model used in MMSR.
Noise model: GMM with Mn gaussians,
Mn = π
(1)
n , µ
(1)
n , Σ
(1)
n , . . . , π
(Mn)
n , µ
(Mn)
n , Σ
(Mn)
n
where π
(kn)
n (kn = 1, . . . , Mn) are the component priors.
Maximum Likelihood estimation
ˆMn = argmax
Mn
p(y1, . . . , yT |Mn, Mx )
Direct optimization of this expression is unfeasible ⇒ an
iterative EM approach is used.
36 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
37. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Overview
Problems
The oracle mask is unknown ⇒ the soft-mask estimated by
MMSR is used.
Treatment of the speech-dominated regions: the noise in
these regions can be estimated using the model obtained in
the previous iteration.
37 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
38. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Experimental Results
2 4 6 8
85
85.5
86
86.5
Aurora2
No. of components
WAcc(%)
2 4 6 8 10
68
68.5
69
69.5
Aurora4
No. of components
WAcc(%)
Estimated noise
GMM noise model
Small but consistent performance improvement is achieved
when using GMM noise models in MMSR.
GMMs worse than estimated noise in 2 cases
1-gauss GMMs: unable to properly model non-stationary noises.
Complex GMMs: not enough data to robustly estimate the GMM
parameters.
38 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
39. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
39 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
40. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Temporal Modelling
More accurate MMSE estimates are obtained with better
speech models.
Here, the temporal correlation of speech is considered.
Two alternative approaches
Patch-based modelling: short segments of speech are
modelled instead of single frames.
HMM modelling: the previous speech models (GMMs or VQ
codebooks) are augmented with transition probabilities. Then,
ˆxt =
M
k=1
P(k|y1, . . . , yt, . . . , yT )E[x|yt, k]
40 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
41. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Experimental Results
WAcc(%)
Aurora2 Aurora4
50
60
70
80
90
100
TGI−OR
PATCH−OR
HMM−OR
TGI−EST
PATCH−EST
HMM−EST
The PATCH and HMM approaches are applied in combination
with TGI.
Spectral reconstruction benefits from temporal redundancy,
especially at low SNRs.
The HMM-based modelling achieves the best results.
41 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
42. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Uncertainty Decoding (I)
The accuracy of MMSE estimation depends on many factors,
such as the SNR of the signal, stationarity of the noise, etc.
Inaccurate ˆxt can degrade the performance of ASR.
Two objectives
1 Estimate the uncertainty/reliability of ˆxt.
2 Account for this information in the recognizer.
42 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
43. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Uncertainty Decoding (II)
Uncertainty of ˆx
Depends on p(x|y) that appears in the MMSE estimator
If p(x|y) = δy(x), then we will consider that ˆx is fully reliable.
If p(x|y) is uniformly distributed, then ˆx is badly estimated.
How to measure the uncertainty of ˆx?
Entropy of p(x|y).
Variance of the MMSE estimate: Σˆx.
Exploitation in the recognizer
Soft-data decoding: Σˆx increases the variance of the
Gaussians in the acoustic model.
Weighted Viterbi Algorithm: the exponential factor
ρ ∈ [0, 1] used to weight the observation probabilities of ˆx is
obtained after applying a sigmoid function to MSE = tr(Σˆx).
43 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
44. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Temporal Modelling
Uncertainty Decoding
Experimental ResultsWAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline TGI−OR UD−OR TGI−EST UD−EST
UD: TGI + Weighted Viterbi Algorithm.
OR vs. EST: oracle masks and oracle uncertainties vs.
estimated masks and uncertainties.
The recognition performance is improved after accounting for
the uncertainty, especially in Aurora4.
44 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
45. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
45 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
46. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Conclusions (I)
The performance of ASR is severely affected by noise.
To improve the robustness of ASR to noise, a feature
compensation approach has been adopted in this thesis.
Stereo-data based compensation: stereo recordings are used
to estimate a set of transformations that are later applied to
noisy speech.
Excellent results for the environments seen during training.
Efficient implementation without a significant performance
degradation when VQ codebooks are used.
The proposed techniques can be used to reduce the residual
noise of other robust techniques.
46 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
47. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Conclusions (II)
Model-based compensation: a model that considers the
distortion of speech as a masking problem is used to derive
two reconstruction techniques.
TGI estimates the masked regions in the noisy spectra. Good
results if the masking pattern is perfectly known, otherwise its
performance is significantly affected.
MMSR uses clean speech and noise models to enhance noisy
speech. Unlike TGI, mask estimation is an integrated part of
the reconstruction algorithm.
An EM-based iterative algorithm has been proposed to
estimate the noise models used by MMSR.
Finally, several approaches to account for temporal
correlations and to decode uncertain speech evidence were
also investigated.
47 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
48. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Future Work
Speech masking model vs. perceptual masking.
EM algorithm: joint estimation of additive and convolutional
noises.
Using more information in MMSR. E.g. pitch, onset/offset
position, etc.
Joint speaker and noise compensation.
48 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data
49. Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Thank you!
49 / 49 Jos´e A. Gonz´alez Noise Robust Speech Recognition of Missing or Uncertain Data