4. Frontend
MFCC-‐39
features
(12
Cepstra
+
Energy)
+
Delta
+
DeltaDelta
Mean
&
variance
normalizaBon
at
sentence
level
Posterior
probabiliBes
from
a
GMM
background
model
L2-‐normalizaBon
5. Background
model
training
IteraBve
128
Gaussian
Spling
EM-‐ML
GMM
training
K-‐means
assignment
[1]
“Speaker
Independent
discriminant
feature
extracBon
for
acousBc
paXern
matching”,
Xavier
Anguera,
ICASSP
2012
6. Silence
modeling
10%
lowest
energy
frames
• 1
Gauss
for
noise
and
4
Gauss
for
speech
Silence/Speech
• Perform
10
iteraBons
or
GMM
training
while
%
variaBon
is
high
Decode
the
data
8. Overlap
postprocessing
• We
compute
the
percentage
of
overlap
between
all
matching
paths
min(End1, End2) ! max(Start1, Start2)
Ovl =
min(End1! Start1, End2 ! Start2)
• For
pairs
with
>
0.5
overlap
– Select
the
match
with
highest
score
10. S-‐DTW
submission
• Based
on
last
year’s
submission
but
with
the
system
improvements
above
11. DTW
local
constraints
• no
global
constraints
are
applied
in
order
to
allow
for
matching
of
any
segment
among
both
sequences
• Local
constraints
are
set
to
allow
warping
up
to
2X
" D(m ! 2, n) + d(xm , yn ) (m,
n)
$
$ jumps(m ! 2, n) + 3
$ D(m, n ! 2) + d(xm , yn ) (m-‐2,
n-‐1)
D(m, n) = min #
$ jumps(m, n ! 2) + 3
$ D(m ! 2, n ! 2) + d(x , y )
m n
$ (m-‐1,
n-‐2)
% jumps(m ! 2, n ! 2) + 4 (m-‐1,
n-‐1)
• Posteriorgram
features
distance:
$ N!1 '
d(xm , yn ) = ! log & # xm [i]" yn [i])
% i=0 (
14. IR-‐DTW
• Total
rework
from
last
year’s
system
• Aim
at
keeping
the
same
accuracy,
but:
– Much
less
memory
usage
– Faster
retrieval
• IR
(InformaBon
Retrieval)
cause
we
use
reference
features
indexing
for
fast
nearest
neighbors
retrieval
16. DEV-DEV results
98
Random Performance
IR-DTW MTWV=0.390 Scr=0.387
95
S-DTW MTWV=0.375 Scr=0.695
90
80
Miss probability (in %)
60
40
20
10
5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
False Alarm probability (in %)
17. EVAL-EVAL Results
98
Random Performance
IR-DTW MTWV=0.342
95
S-DTW MTWV=0.311
90
80
Miss probability (in %)
60
40
20
10
5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
False Alarm probability (in %)
18. DEV-EVAL results
98
Random Performance
IR-DTW MTWV=0.314
95
S-DTW MTWV=0.300
90
80
Miss probability (in %)
60
40
20
10
5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
False Alarm probability (in %)
19. EVAL-DEV results
98
Random Performance
IR-DTW MTWV=0.498
95
S-DTW MTWV=0.472
90
80
Miss probability (in %)
60
40
20
10
5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
False Alarm probability (in %)
20. Xavier
Anguera
Summary
xanguera@Bd.es
• We
propose
2
systems,
all
sharing
the
same
framework
• Some
improvements
in
the
framework
were
incorporated:
speech/silence
classificaBon,
new
overlap
detecBon,
modified
background
model.
• IR-‐DTW
is
a
total
reimplementaBon
of
SDTW,
using
informaBon
retrieval
concepts