Thesis

Introduction
Feature Compensation based on Stereo Data
Feature Compensation based on a Masking Model
Temporal Modelling and Uncertainty Decoding
Conclusions
Noise Robust Speech Recognition of
Missing or Uncertain Data
José Ándres González López
Advisors: Dr. Antonio M. Peinado Herreros
Dr. Ángel M. Gómez Garc´ıa
Dpt. Signal Theory, Telecommunications and Networking
University of Granada
Ph.D. Defence
February 25th, 2013
1 / 49 José A. González Noise Robust Speech Recognition of Missing or Uncertain Data

Introduction
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions

Introduction
Conclusions
Outline
1 Introduction
5 Conclusions

Introduction
Conclusions
Robust ASR
The performance of ASR (Automatic Speech recognition)
systems degrades when training and testing conditions differ.
This mismatch can be due to different factors
Language complexity: grammar, vocabulary, spontaneous
speech, ...
Speaker variability: accent, age, gender, ...
Environmental factors: background noise, channel distortion,
room acoustics, ...
In this work, we will focus on the environmental factors,
especially on the background noise and the channel distortion.
Effect of noise on speech: noise modifies the speech
distributions and causes loss of information.

Introduction
Conclusions
Approaches for Noise Robustness
Diﬀerent approaches to achieve noise robustness: robust
feature extraction, model adaptation and feature modiﬁcation.
Feature compensation enhances the noisy features used for
speech recognition.
yt and ˆxt are, respectively, the feature vectors for noisy
speech and estimated clean speech at time t.
uncertainty: information about the reliability of ˆxt.

Introduction
Conclusions
Objectives
Development of a set of compensation techniques for speech
feature enhancement.
To do this, a Bayesian estimation framework is adopted here.
Two diﬀerent approaches for estimating clean speech will be
explored
Feature compensation based on stereo-data: clean and
noisy recordings are used to derive a set of transformations
applied to noisy speech.
Feature compensation based on a masking model:
parametric models of speech degradation are used to estimate
clean speech.
Finally, an uncertainty decoding approach and temporal
modelling of speech will be also investigated.

Introduction
Conclusions
Introduction
MMSE Estimation
Experimental Results
Outline
1 Introduction
5 Conclusions

Introduction
Conclusions
Introduction
MMSE Estimation
Introduction
Stereo data: simultaneous recordings of clean and noisy
speech signals,
(X, Y) = ( x1, y1 , x2, y2 , . . . , xT , yT )
The stereo data is used to learn the statistical relationship
between the clean and noisy feature spaces.
As a result, a set of transformations is derived to enhance
speech in a certain acoustic environment.
Acoustic environment: combination of additive and
convolutional noises at a given SNR.

Introduction
Conclusions
Introduction
MMSE Estimation
MMSE Estimation (I)
MMSE estimation is chosen to obtain suitable estimates for
the clean feature vectors,
ˆx = E[x|y] = xp(x|y)dx
Problem: p(x|y) must be expressed in a convenient form.
Solution: clean and noisy feature spaces are represented by
VQ codebooks Mx and My , respectively.

Introduction
Conclusions
Introduction
MMSE Estimation
MMSE Estimation (II)
Using these codebooks, the MMSE estimation can be expressed as,
ˆx =
Mx
kx =1
P(kx |k∗
y ) ˆx(kx )
P(kx |k∗
y ): mapping between the clean and noisy cells for a
certain environment. Estimated using stereo data.
ˆx(kx )
= E[x|y, kx , k∗
y ]: 3 alternatives (Q-VQMMSE,
S-VQMMSE and W-VQMMSE).

Introduction
Conclusions
Introduction
MMSE Estimation
Computation of ˆx(kx )
Q-VQMMSE
Assumes that both spaces are
quantized.
Also, this approach assumes that
the spaces are independent.
Then, ˆx(kx )
= µ
(kx )
x .
S-VQMMSE
A correction is applied to y,
ˆx(kx )
= y − µ
(k∗
y )
y − µ(kx )
x
= µ(kx )
x + y − µ
(k∗
y )
y
∆: quantization error

Introduction
Conclusions
Introduction
MMSE Estimation
Improving the Mapping Accuracy
Subregion modelling
C
(kx ,ky )
y is the subset of the noisy cell ky whose corresponding
clean vectors belong to kx .
Similarly, C
(kx ,ky )
x is the subset of kx whose corresponding
noisy vectors are C
(kx ,ky )
y .

Introduction
Conclusions
Introduction
MMSE Estimation
Whitening-transformation based VQMMSE
W-VQMMSE assumes that the subregions of both feature
spaces are Gaussian distributed, e.g.
C
(kx ,ky )
x ∼ N µ
(kx ,ky )
x , Σ
(kx ,ky )
x
Computation of E[x|y, kx , ky ]: the following whitening
transformation is applied
E[x|y, kx , ky ] = µ
(kx ,ky )
x + Σ
(kx ,ky )
x
1/2
Σ
(kx ,ky )
y
−1/2
y − µ
(kx ,ky )
y
After some manipulations the MMSE estimation becomes,
ˆx = A(k∗
y )
y + b(k∗
y )
where the parameters of the aﬃne transformation can be
precomputed oﬄine for each noisy cell ky = 1, . . . , My .

Introduction
Conclusions
Introduction
MMSE Estimation
Experimental Setup
Recognition task: based on the Aurora2 noisy digits
database.
Acoustic environments: 9 noises at 7 SNRs (clean, 20, 15,
10, 5, 0, and -5 dB).
Speech features: ETSI FE Standard (13 MFCCs + ∆ +
∆2).
Front-end speech models: codebooks with 256 components.
SPLICE and MEMLIN are also evaluated (i.e. GMM-based
MMSE estimation).
A priori knowledge on the acoustic environment is assumed.

Introduction
Conclusions
Introduction
MMSE Estimation
FE Results
System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.
Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83
Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38
SPLICE 99.02 98.09 95.87 88.88 70.62 39.04 15.99 78.50
MEMLIN 99.02 98.36 97.01 92.43 78.26 47.03 18.76 82.62
Q-VQMMSE 96.19 93.72 90.21 81.24 61.82 31.33 14.39 71.66
S-VQMMSE 99.02 97.93 96.28 90.57 74.70 43.02 18.57 80.50
iW-VQMMSE 99.02 98.23 96.79 91.60 76.82 46.60 20.02 82.01
dW-VQMMSE 99.02 98.33 97.06 92.43 78.70 48.88 20.26 83.08
fW-VQMMSE 99.02 98.37 97.15 92.88 79.61 50.04 20.89 83.61
Matched: HMMs trained under the same conditions that in testing.
iW-, dW-, fW-: identity, diagonal and full covariance matrices.
MEMLIN and iW-VQMMSE behave almost identically, but our proposal
is more eﬃcient.
When the dynamic features are also processed, MEMLIN and
fW-VQMMSE achieves similar results: 87.67 % vs. 87.31 %.

Introduction
Conclusions
Introduction
MMSE Estimation
AFE Results
System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.
Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83
AFE 99.22 98.24 96.95 93.68 84.37 62.46 29.53 87.14
Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38
Q-VQMMSE 95.60 93.56 91.28 85.25 70.23 39.20 12.84 75.90
S-VQMMSE 99.22 98.32 97.39 94.71 86.30 63.07 27.46 87.96
iW-VQMMSE 99.22 98.61 97.93 95.89 89.19 69.46 32.62 90.22
dW-VQMMSE 99.22 98.70 98.05 96.19 89.93 71.47 34.94 90.87
fW-VQMMSE 99.22 98.65 97.99 96.10 89.92 72.29 36.57 90.99
AFE: ETSI Advanced Front-End.
The proposed techniques are applied to the features extracted
by AFE.
The combined systems AFE+VQMMSE increase the
robustness of AFE against noise.

Introduction
Conclusions
Speech Masking Model
TGI
MMSR
Noise Model Estimation
Outline
1 Introduction
5 Conclusions

Introduction
Conclusions
TGI
MMSR
Introduction
Speech degradation model: an analytical model that relates
y with x and n (the additive noise vector).
Model-based compensation: the degradation model is used
to derive the MMSE estimator.
No stereo data is required.
Thus, unknown distortions can be mitigated.
× MMSE estimation only tackles the distortions considered in
the degradation model. E.g. additive and convolutional noises.
× Noise need to be estimated.
We will only consider the robustness to additive noise here.

Introduction
Conclusions
TGI
MMSR
In the log-Mel domain, the degradation model is approximated by
y = log(ex
+ en
)
This model can be rewritten
as,
y = max(x, n) + ε(x, n)
Disregarding ε(x, n), the
speech masking model is
y ≈ max(x, n) 0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
-0.4 -0.2 0 0.2 0.4 0.6
Probability
ε(x, n)
Distribution of ε(x, n) in Aurora2

Introduction
Conclusions
TGI
MMSR
Spectral Reconstruction: Problems
According to the speech masking model, the observation can
be rearranged into y = (yr , yu).
Reliable features (xr ≈ yr ), i.e. speech is dominant.
Unreliable features (−∞ ≤ xu ≤ yu): speech is masked by
noise.
Thus, feature compensation can be reformulated as diﬀerent
two problems
1 Segregation of the noisy spectra into speech and noise.
This yields a mask where the reliable and unreliable features
are identiﬁed.
2 Spectral reconstruction, i.e. estimate the speech energy in
the unreliable features.
Two alternative techniques are proposed here:
TGI only deals with problem 2.
MMSR addresses both 1 & 2.

Introduction
Conclusions
TGI
MMSR
Truncated-Gaussian based Imputation
TGI estimates the speech energy in the unreliable regions of
the observed spectrogram.
To do this, the correlation between features is exploited.
Prerequisites: the segregation binary mask is known in
advance.
After spectral reconstruction, MFCC features can be
computed and used for recognition.

Introduction
Conclusions
TGI
MMSR
MMSE Estimation of the Unreliable Features
MMSE estimation is used again to reconstruct the unreliable
features,
ˆxu = E[xu|xr = yr , −∞ ≤ xu ≤ yu]
Speech model: p(x) is modelled as a Gaussian Mixture
Model (GMM),
p(x) =
M
k=1
P(k)N x; µ(k)
, Σ(k)
Applying this model, the MMSE estimation is given by,
ˆxu =
M
k=1
P(k|yr , yu) ˆx
(k)
u
Problem: computation of P(k|yr , yu) and ˆx
(k)
u .

Introduction
Conclusions
TGI
MMSR
Partial Estimates
According to the speech masking model xu ∈ (−∞, yu]. Thus,
ˆx
(k)
u =
yu
−∞
xup(xu|yr , k)dxu
Independence is assumed to
solve the integral.
The partial estimate
ˆx
(k)
u = ˜µ(k)(y) corresponds
to the mean of a
right-truncated Gaussian
PDF.

Introduction
Conclusions
TGI
MMSR
Example
Clean
Noisy (0 dB)
Oracle mask
TGI reconstruction
23
15
7
12
0
23
15
7
12
5
23
15
7
1
0
23
15
7
12
4
Time (s)
eigth six zero one one six two
Melchannel
0.5 1.0 1.5 2.0 2.5 3.0

Introduction
Conclusions
TGI
MMSR
Experimental Setup
Databases: Aurora2 & Aurora4.
The 3 test sets (A, B and C) of Aurora2 are considered.
Aurora4: 5000-word recognition task based on the Wall
Street Journal corpus. Two testing conditions:
Test 01-07 includes utterances with artificially added acoustic
noise (random SNR between 10 dB and 20 dB).
Test 08-14: acoustic noise + different microphones.
TGI is evaluated using both oracle (OR) or estimated (EST)
binary masks.
Noise estimation: linear interpolation of the first and last
frames of the utterance.
Front-end speech model: GMM with 256 components.

Introduction
Conclusions
TGI
MMSR
WAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline CBR−OR TGI−OR CBR−EST TGI−EST
CBR: Cluster-Based Reconstruction (Raj et al., 2004).
TGI outperforms CBR when oracle masks are used.
The diﬀerence is small when the masks are estimated.
Large margin for improvement between OR and EST ⇒ a
more robust approach for speech/noise segregation is required.

Introduction
Conclusions
TGI
MMSR
Masking-Model based Spectral Reconstruction
As we have seen, TGI achieves excellent results when oracle
masks are used.
However, its performance diminishes when the masks are
estimated ⇒ the noise estimation errors can be magniﬁed by
the hard decision implemented by the binary masks.
MMSR uses the noise estimates directly in the MMSE
estimation.
Advantages with respect to TGI
No a priori segregation mask is required now.
Therefore, the feature reliability and the speech energy in the
unreliable regions are jointly estimated.

Introduction
Conclusions
TGI
MMSR
MMSR: Diagram
Mx : GMM with Mx gaussians.
Mn: GMM with Mn gaussians (alternatively a noise estimate
nt ∼ Nn(ˆnt , Σn,t ) for each frame).
MMSE estimation
ˆx =
Mx
kx =1
Mn
kn=1
P(kx , kn|y) ˆx(kx ,kn)

Introduction
Conclusions
TGI
MMSR
Partial Estimates
Contrary to TGI, the reliability of the observed feature y is
unknown in MMSR.
Hence, both the reliable and unreliable cases are taken into
account,
ˆx(kx ,kn)
= w(kx ,kn)
y + 1 − w(kx ,kn)
˜µ
(kx )
x
Estimate for high SNRs
Estimate for masked speech (i.e. truncated PDF mean)
w(kx ,kn)
= P(x = y, n < y|kx , kn) is the normalized speech
presence probability.

Introduction
Conclusions
TGI
MMSR
MMSR: Mask Estimation
MMSR can be also considered as a robust method for speech
segregation.
To see this, we reproduce here the ﬁnal expression for the
MMSE estimator,
ˆx =


Mx
kx =1
Mn
kn=1
P(kx , kn|y)w(kx ,kn)


m
y +
Mx
kx =1
Mn
kn=1
P(kx , kn|y) 1 − w(kx ,kn)
˜µ
(kx )
x
m ∈ [0, 1] acts as a soft-mask: m ≈ 1 for the reliable features
and m ≈ 0 for the unreliable ones.
Advantages regarding other methods:
Parameter free.
Mask estimation is fully integrated within the reconstruction.

Introduction
Conclusions
TGI
MMSR
WAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline TGI−OR TGI−EST MMSR VTS
VTS: well-known model-based compensation technique
(Moreno, 1996).
MMSR outperforms TGI-EST and is upper-bounded by
TGI-OR.
VTS is slightly better than MMSR ⇒ more accurate noise
models can reduce the gap.

Introduction
Conclusions
TGI
MMSR
MMSR: Diagram

Introduction
Conclusions
TGI
MMSR
EM-based Noise Model Estimation
Objective: estimate the noise model used in MMSR.
Noise model: GMM with Mn gaussians,
Mn = π
(1)
n , µ
(1)
n , Σ
(1)
n , . . . , π
(Mn)
n , µ
(Mn)
n , Σ
(Mn)
n
where π
(kn)
n (kn = 1, . . . , Mn) are the component priors.
Maximum Likelihood estimation
ˆMn = argmax
Mn
p(y1, . . . , yT |Mn, Mx )
Direct optimization of this expression is unfeasible ⇒ an
iterative EM approach is used.

Introduction
Conclusions
TGI
MMSR
Overview
Problems
The oracle mask is unknown ⇒ the soft-mask estimated by
MMSR is used.
Treatment of the speech-dominated regions: the noise in
these regions can be estimated using the model obtained in
the previous iteration.

Introduction
Conclusions
TGI
MMSR
2 4 6 8
85
85.5
86
86.5
Aurora2
No. of components
WAcc(%)
2 4 6 8 10
68
68.5
69
69.5
Aurora4
No. of components
WAcc(%)
Estimated noise
GMM noise model
Small but consistent performance improvement is achieved
when using GMM noise models in MMSR.
GMMs worse than estimated noise in 2 cases
1-gauss GMMs: unable to properly model non-stationary noises.
Complex GMMs: not enough data to robustly estimate the GMM
parameters.

Introduction
Conclusions
Temporal Modelling
Uncertainty Decoding
Outline
1 Introduction
5 Conclusions

Introduction
Conclusions
Temporal Modelling
Temporal Modelling
More accurate MMSE estimates are obtained with better
speech models.
Here, the temporal correlation of speech is considered.
Two alternative approaches
Patch-based modelling: short segments of speech are
modelled instead of single frames.
HMM modelling: the previous speech models (GMMs or VQ
codebooks) are augmented with transition probabilities. Then,
ˆxt =
M
k=1
P(k|y1, . . . , yt, . . . , yT )E[x|yt, k]

Introduction
Conclusions
Temporal Modelling
WAcc(%)
Aurora2 Aurora4
50
60
70
80
90
100
TGI−OR
PATCH−OR
HMM−OR
TGI−EST
PATCH−EST
HMM−EST
The PATCH and HMM approaches are applied in combination
with TGI.
Spectral reconstruction beneﬁts from temporal redundancy,
especially at low SNRs.
The HMM-based modelling achieves the best results.

Introduction
Conclusions
Temporal Modelling
Uncertainty Decoding (I)
The accuracy of MMSE estimation depends on many factors,
such as the SNR of the signal, stationarity of the noise, etc.
Inaccurate ˆxt can degrade the performance of ASR.
Two objectives
1 Estimate the uncertainty/reliability of ˆxt.
2 Account for this information in the recognizer.

Introduction
Conclusions
Temporal Modelling
Uncertainty Decoding (II)
Uncertainty of ˆx
Depends on p(x|y) that appears in the MMSE estimator
If p(x|y) = δy(x), then we will consider that ˆx is fully reliable.
If p(x|y) is uniformly distributed, then ˆx is badly estimated.
How to measure the uncertainty of ˆx?
Entropy of p(x|y).
Variance of the MMSE estimate: Σˆx.
Exploitation in the recognizer
Soft-data decoding: Σˆx increases the variance of the
Gaussians in the acoustic model.
Weighted Viterbi Algorithm: the exponential factor
ρ ∈ [0, 1] used to weight the observation probabilities of ˆx is
obtained after applying a sigmoid function to MSE = tr(Σˆx).

Introduction
Conclusions
Temporal Modelling
Experimental ResultsWAcc(%)
Aurora2 Aurora4
40
50
60
70
80
90
100
Baseline TGI−OR UD−OR TGI−EST UD−EST
UD: TGI + Weighted Viterbi Algorithm.
OR vs. EST: oracle masks and oracle uncertainties vs.
estimated masks and uncertainties.
The recognition performance is improved after accounting for
the uncertainty, especially in Aurora4.

Introduction
Conclusions
Outline
1 Introduction
5 Conclusions

Introduction
Conclusions
Conclusions (I)
The performance of ASR is severely affected by noise.
To improve the robustness of ASR to noise, a feature
compensation approach has been adopted in this thesis.
Stereo-data based compensation: stereo recordings are used
to estimate a set of transformations that are later applied to
noisy speech.
Excellent results for the environments seen during training.
Efficient implementation without a significant performance
degradation when VQ codebooks are used.
The proposed techniques can be used to reduce the residual
noise of other robust techniques.

Introduction
Conclusions
Conclusions (II)
Model-based compensation: a model that considers the
distortion of speech as a masking problem is used to derive
two reconstruction techniques.
TGI estimates the masked regions in the noisy spectra. Good
results if the masking pattern is perfectly known, otherwise its
performance is signiﬁcantly aﬀected.
MMSR uses clean speech and noise models to enhance noisy
speech. Unlike TGI, mask estimation is an integrated part of
the reconstruction algorithm.
An EM-based iterative algorithm has been proposed to
estimate the noise models used by MMSR.
Finally, several approaches to account for temporal
correlations and to decode uncertain speech evidence were
also investigated.

Introduction
Conclusions
Future Work
Speech masking model vs. perceptual masking.
EM algorithm: joint estimation of additive and convolutional
noises.
Using more information in MMSR. E.g. pitch, onset/oﬀset
position, etc.
Joint speaker and noise compensation.

Introduction
Conclusions
Thank you!

Thesis

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Thesis

Semelhante a Thesis (20)

Último

Último (20)

Thesis