1. Machine Learning for Speech & Language
Processing
Mark Gales
28 June 2004
Cambridge University Engineering Department
Foresight Cognitive Systems Workshop
2. Machine Learning for Speech & Language Processing
Overview
• Machine learning.
• Feature extraction:
– Gaussianisation for speaker normalisation.
• Dynamic Bayesian networks:
– multiple data stream models
– switching linear dynamical systems for ASR.
• SVMs and kernel methods:
– rational kernels for text classification.
• Reinforcement learning and Markov decision processes:
– spoken dialogue system policy optimisation.
Cambridge University
Foresight Cognitive Systems Workshop 1
Engineering Department
3. Machine Learning for Speech & Language Processing
Machine Learning
• One definition is (Mitchell):
“A computer program is said to learn from experience (E) with some class
of tasks (T) and a performance measure (P) if its performance at tasks
in T as measured by P improves with E”
alternatively
“Systems built by analysing data sets rather than by using the intuition
of experts”
• Multiple specific conferences:
– {International,European} Conference on Machine Learning;
– Neural Information Processing Systems;
– International Conference on Pattern Recognition etc etc;
• as well as sessions in other conferences:
– ICASSP - machine learning for signal processing.
Cambridge University
Foresight Cognitive Systems Workshop 2
Engineering Department
4. Machine Learning for Speech & Language Processing
“Machine Learning” Community
“You should come to NIPS. They have lots of ideas.
The Speech Community has lots of data.”
• Some categories from Neural Information Processing Systems:
– clustering;
– dimensionality reduction and manifolds;
– graphical models;
– kernels, margins, boosting;
– Monte Carlo methods;
– neural networks;
– ...
– speech and signal processing.
• Speech and language processing is just an application
Cambridge University
Foresight Cognitive Systems Workshop 3
Engineering Department
5. Machine Learning for Speech & Language Processing
Too Much of a Good Thing?
“You should come to NIPS. They have lots of ideas.
Unfortunately, the Speech Community has lots of data.”
• Text data: used to train the ASR language model:
– large news corpora available;
– systems built on > 1 billion words of data.
• Acoustic data: used to train the ASR acoustic models:
– > 2000 hours speech data
(∼ 20 million words, ∼ 720 million frames of data);
– rapid transcriptions/closed caption data.
• Solutions required to be scalable:
– heavily influences (limits!) machine learning approaches used;
– additional data masks many problems!
Cambridge University
Foresight Cognitive Systems Workshop 4
Engineering Department
7. Machine Learning for Speech & Language Processing
Gaussianisation for Speaker Normalisation
1. Linear projection and “decorrelation of the data” (heteroscedastic LDA)
2. Gaussianise the data for each speaker:
0.1 100 100 0.4
0.09 90 90
0.35
0.08 80 80
0.3
0.07 70 70
60 0.25
0.06 60
0.05 50 50 0.2
0.04 40 40
0.15
0.03 30 30
0.1
0.02 20 20
0.05
0.01 10 10
0 0 0 0
−20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
Source PDF Source CDF Target CDF Target PDF
(a) construct a Gaussian mixture model for each dimension;
(b) non-linearly transform using cumulative density functions.
• May view as higher-moment version of mean and variance normalisation:
– single component/dimension GMM equals CMN plus CVN
• Performance gains on state-of-the-art tasks
Cambridge University
Foresight Cognitive Systems Workshop 6
Engineering Department
8. Machine Learning for Speech & Language Processing
Bayesian Networks
• Bayesian networks are a method to show conditional independence:
Cloudy
Sprinkler Rain
Wet Grass
– whether the grass is wet, W , depends on :
whether the sprinkler used, S, and whether it has rained; R.
– whether sprinkler used (or it rained) depends on: whether it is cloudy C.
• W is conditionally independent of C given S and R.
• Dynamic Bayesian networks handle variable length data.
Cambridge University
Foresight Cognitive Systems Workshop 7
Engineering Department
9. Machine Learning for Speech & Language Processing
Hidden Markov Model - A Dynamic Bayesian Network
o1 o2 o3 o4 oT
b ()
qt qt+1
b2() b3() 4
1 2 3 a 4
a12 a23 a45 5
34
ot ot+1
a22 a33 a44
(a) Standard HMM phone topology (b) HMM Dynamic Bayesian Network
• Notation for DBNs:
circles - continuous variables squares - discrete variables
shaded - observed variables non-shaded - unobserved variables
• Observations conditionally independent of other observations given state.
• States conditionally independent of other states given previous states,
• Poor model of the speech process - piecewise constant state-space.
Cambridge University
Foresight Cognitive Systems Workshop 8
Engineering Department
10. Machine Learning for Speech & Language Processing
Alternative Dynamic Bayesian networks
Switching linear dynamical system:
q t q t+1
• discrete and continuous state-spaces
• observations conditionally independent given
continuous and discretes state;
x t x t+1
T
• exponential growth of paths, O(Ns )
⇒ approximate inference required. o t o t+1
Multiple data stream DBN:
q
(1)
t q (1)
t+1
• e.g. factorial HMM/mixed memory model;
• asynchronous data common:
– speech and video/noise;
o t o t+1
– speech and brain activation patterns.
• observation depends on state of both streams q
(2)
t q (2)
t+1
Cambridge University
Foresight Cognitive Systems Workshop 9
Engineering Department
11. Machine Learning for Speech & Language Processing
SLDS Trajectory Modelling
sh ow dh ax g r ih dd l iy z
5
0
Frames from phrase:
SHOW THE GRIDLEY’S ... −5
−10
1
MFCC c
Legend −15
• True −20
• HMM −25
• SLDS −30
−35
20 40 60 80 100
Frame Index
• Unfortunately doesn’t currently classify better than an HMM!
Cambridge University
Foresight Cognitive Systems Workshop 10
Engineering Department
12. Machine Learning for Speech & Language Processing
Support Vector Machines
support vector
support vector
margin
width
decision
boundary
• SVMs are a maximum margin, binary, classifier:
– related to minimising generalisation error;
– unique solution (compare to neural networks);
– may be kernelised - training/classification a function of dot-product (xi.xj ).
• Successfully applied to many tasks - how to apply to speech and language?
Cambridge University
Foresight Cognitive Systems Workshop 11
Engineering Department
13. Machine Learning for Speech & Language Processing
The “Kernel Trick”
2.5 2.5
class one class one
class two class two
2 support vector 2 support vector
class one margin class one margin
decision boundary decision boundary
class two margin class two margin
1.5 1.5
1 1
0.5 0.5
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
• SVM decision boundary linear in the feature-space
– may be made non-linear using a non-linear mapping φ() e.g.
√ 1x2
x1
φ = 2x1x2 , K(xi, xj ) = φ(xi).φ(xj )
x2
x2
2
• Efficiently implemented using a Kernel: K(xi, xj ) = (xi.xj )2
Cambridge University
Foresight Cognitive Systems Workshop 12
Engineering Department
14. Machine Learning for Speech & Language Processing
String Kernel
• For speech and text processing input space has variable dimension:
– use a kernel to map from variable to a fixed length;
– Fisher kernels are one example for acoustic modelling;
– String kernels are an example for text.
• Consider the words cat, cart, bar and a character string kernel
c-a c-t c-r a-r r-t b-a b-r
φ(cat) 1 λ 0 0 0 0 0
φ(cart) 1 λ2 λ 1 1 0 0
φ(bar) 0 0 0 1 0 1 λ
K(cat, cart) = 1 + λ3, K(cat, bar) = 0, K(cart, bar) = 1
• Successfully applied to various text classification tasks:
– how to make process efficient (and more general)?
Cambridge University
Foresight Cognitive Systems Workshop 13
Engineering Department
15. Machine Learning for Speech & Language Processing
Weighted Finite-State Transducers
• A weighted finite-state transducer is a weighted directed graph:
– transitions labelled with an input symbol, output symbol, weight.
• An example transducer, T , for calculating binary numbers: a=0, b=1
b:b/1 b:b/2
Input State Seq. Output Weight
a:a/1 a:a/2
112 bab 1
bab
b:b/1 211 bab 4
1 2/1
For this sequence output weight: w [bab ◦ T ] = 5
• Standard (highly efficient) algorithms exist for various operations:
– combining transducer, T1 ◦ T2;
– inverse, T −1, swap the input and output symbols in the tranducer.
• May be used for efficient implementation of string kernels.
Cambridge University
Foresight Cognitive Systems Workshop 14
Engineering Department
16. Machine Learning for Speech & Language Processing
Rational Kernels
• A transducer, T , for the string kernel (gappy bigram) (vocab {a, b})
b: ε/1 b: ε/λ b: ε/1
a: ε/1 a: ε/λ a: ε/1
a:a/1 a:a/1
1 2 3/1
b:b/1 b:b/1
The kernel is: K(Oi, Oj ) = w Oi ◦ (T ◦ T −1) ◦ Oj
• This form can also handle uncertainty in decoding:
– lattices can be used rather than the 1-best output (Oi).
• This form encompasses various standard feature-spaces and kernels:
– bag-of-words and N-gram counts, gappy N-grams (string Kernel),
• Successfully applied to a multi-class call classification task.
Cambridge University
Foresight Cognitive Systems Workshop 15
Engineering Department
17. Machine Learning for Speech & Language Processing
Reinforcement Learning
• Reinforcement learning is a class of training methods:
– problem defined by payoffs;
– aims to learn the policy that maximises the payoff;
– no need for a mathematical model of environment.
PAYOFF PAYOFF
FUNCTION FUNCTION
CONTROL DIALOGUE
SYSTEM POLICY
SENSORS ACTUATORS ASR/SU TTS
ENVIRONMENT USER
Reinforcement learning Spoken dialogue system
• Dialogue policy learning fits nicely within this framework.
Cambridge University
Foresight Cognitive Systems Workshop 16
Engineering Department
18. Machine Learning for Speech & Language Processing
Example Dialogue
S1: Welcome to NJFun. How may I help you?
U1: I’d like to find um winetasting in Lambertville in the morning.
[ ASR: I’d like to find out wineries in the Lambertville in the
morning ]
S2: Did you say you are interested in Lambertville?
U2: Yes.
S3: Did you want to go in the morning?
U3: Yes.
S4: I found a winery near Lambertville that is open in the morning.
It is Poor Richard’s Winery in Lambertville.
• Variety of action choices available:
– mixed versus system initiative;
– explicit versus implicit confirmation.
Cambridge University
Foresight Cognitive Systems Workshop 17
Engineering Department
19. Machine Learning for Speech & Language Processing
Markov Decision Process
• SDS modelled as a MDP:
– system state and action at time t: St and At;
– transition function: user and ASR/SU model, P (St|St−1, At−1).
Turn Delay User + ASR/SU
Rt
z −1 P(St | St−1,At−1)
At St
π( St )
Dialogue Policy
• Select policy to maximise expected total reward:
– total reward: Rt sum of instantaneous rewards from t to end of dialogue;
– value function (expected reward) for policy π in state S: V π (S).
Cambridge University
Foresight Cognitive Systems Workshop 18
Engineering Department
20. Machine Learning for Speech & Language Processing
Q-Learning
• In reinforcement learning use the Q-function, Qπ (S, A)
– expected reward from taking action A in state S using policy π
• Best policy using π given state St is obtained from
π (St) = argmax (Qπ (St, A))
ˆ
A
• Transition function not normally known - one-step Q-learning algorithm:
– learn Qπ (S, A) rather than transition function;
– estimate using difference between actual and estimated values.
• How to specify reward: simplest form assign to final state:
– positive value for task success;
– negative value for task failure.
Cambridge University
Foresight Cognitive Systems Workshop 19
Engineering Department
21. Machine Learning for Speech & Language Processing
Partially Observed MDP
• State-space required to encapsulate all information to make decision:
– state space can become very large e.g. transcript of dialogue to date etc;
– required to compress size - usually application specific choice;
– if state-space is too small MDP not appropriate.
• Also User beliefs cannot be observed:
– decisions required on incomplete information (POMDP);
– use of a belief state - value function becomes
V π (B) = B(S)V π (S)
S
where B(S) gives belief in a state.
• Major problem: how to obtain sufficient training data?
– build prototype system and then refine;
– build a user model to simulate user interaction.
Cambridge University
Foresight Cognitive Systems Workshop 20
Engineering Department
22. Machine Learning for Speech & Language Processing
Machine Learning for Speech & Language Processing
Briefly described only a few examples
• Markov chain Monte-Carlo techniques:
– Rao-Blackwellised Gibbs sampling for SLDS one example.
• Discriminative training criteria:
– use criteria more closely related to WER, (MMI, MPE, MCE).
• Latent variable models for language modelling:
– Latent semantic analysis (LSA) and Probabilistic LSA.
• Boosting style schemes:
– generate multiple complementary classifiers and combine them.
• Minimum Description Length & evidence framework:
– automatically determine numbers of model parameters and configuration.
Cambridge University
Foresight Cognitive Systems Workshop 21
Engineering Department
23. Machine Learning for Speech & Language Processing
Some Standard Toolkits
• Hidden Markov model toolkit (HTK)
– building state-of-art HMM-based systems
– http://htk.eng.cam.ac.uk/
• Graphical model toolkit (GMTK)
– training and inference for graphical models
– http://ssli.ee.washington.edu/∼bilmes/gmtk/
• Finite state transducer toolkit (FSM)
– building, combining, optimising weighted finite state transducers
– http://www.research.att.com/sw/tools/fsm/
• Support vector machine toolkit (SVMlight)
– training and classifying with SVMs
– http://svmlight.joachims.org/
Cambridge University
Foresight Cognitive Systems Workshop 22
Engineering Department
24. Machine Learning for Speech & Language Processing
(Random) References
• Machine Learning, T Mitchell, McGraw Hill, 1997.
• Nonlinear dimensionality reduction by locally linear embedding, S. Roweis and L Saul Science,
v.290 no.5500 , Dec.22, 2000. pp.2323–2326.
• Gaussianisation, S. Chen and R Gopinath, NIPS 2000.
• An Introduction to Bayesian Network Theory and Usage,T.A.Stephenson,IDIAP-RR03, 2000
• A Tutorial on Support Vector Machines for Pattern Recognition, C.J.C. Burges, Knowledge
Discovery and Data Mining, 2(2), 1998.
• Switching Linear Dynamical Systems for Speech Recognition, A-V.I. Rosti and M.J.F. Gales,
Technical Report CUED/F-INFENG/TR.461 December 2003,
• Factorial hidden Markov models, Z. Ghahramani and M.I. Jordan. Machine Learning 29:245–
275, 1997.
• The Nature of Statistical Learning Theory, V. Vapnik, Springer, 2000.
• Finite-State Transducers in Language and Speech Processing, M. Mohri,Computational
Linguistics, 23:2, 1997.
• Weighted Automata Kernels - General Framework and Algorithms, C. Cortes, P. Haffner and
M. Mohri, EuroSpeech 2003.
• An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue
System for Email, M. Walker, Journal of Artificial Intelligence Research, JAIR, Vol 12., pp.
387-416, 2000.
Cambridge University
Foresight Cognitive Systems Workshop 23
Engineering Department