SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Machine Learning for Speech & Language
              Processing
                      Mark Gales

                      28 June 2004




   Cambridge University Engineering Department

           Foresight Cognitive Systems Workshop
Machine Learning for Speech & Language Processing

                                                      Overview

• Machine learning.

• Feature extraction:
    – Gaussianisation for speaker normalisation.

• Dynamic Bayesian networks:
    – multiple data stream models
    – switching linear dynamical systems for ASR.

• SVMs and kernel methods:
    – rational kernels for text classification.

• Reinforcement learning and Markov decision processes:
    – spoken dialogue system policy optimisation.

         Cambridge University
                                               Foresight Cognitive Systems Workshop   1
         Engineering Department
Machine Learning for Speech & Language Processing

                                             Machine Learning
• One definition is (Mitchell):
        “A computer program is said to learn from experience (E) with some class
        of tasks (T) and a performance measure (P) if its performance at tasks
        in T as measured by P improves with E”
    alternatively
        “Systems built by analysing data sets rather than by using the intuition
        of experts”
• Multiple specific conferences:
    – {International,European} Conference on Machine Learning;
    – Neural Information Processing Systems;
    – International Conference on Pattern Recognition etc etc;
• as well as sessions in other conferences:
    – ICASSP - machine learning for signal processing.

         Cambridge University
                                               Foresight Cognitive Systems Workshop   2
         Engineering Department
Machine Learning for Speech & Language Processing

                              “Machine Learning” Community

                      “You should come to NIPS. They have lots of ideas.
                          The Speech Community has lots of data.”

• Some categories from Neural Information Processing Systems:
    –   clustering;
    –   dimensionality reduction and manifolds;
    –   graphical models;
    –   kernels, margins, boosting;
    –   Monte Carlo methods;
    –   neural networks;
    –   ...
    –   speech and signal processing.

• Speech and language processing is just an application


         Cambridge University
                                               Foresight Cognitive Systems Workshop   3
         Engineering Department
Machine Learning for Speech & Language Processing

                                  Too Much of a Good Thing?
                    “You should come to NIPS. They have lots of ideas.
                   Unfortunately, the Speech Community has lots of data.”

• Text data: used to train the ASR language model:
    – large news corpora available;
    – systems built on > 1 billion words of data.

• Acoustic data: used to train the ASR acoustic models:
    – > 2000 hours speech data
      (∼ 20 million words, ∼ 720 million frames of data);
    – rapid transcriptions/closed caption data.

• Solutions required to be scalable:
    – heavily influences (limits!) machine learning approaches used;
    – additional data masks many problems!

         Cambridge University
                                               Foresight Cognitive Systems Workshop   4
         Engineering Department
Machine Learning for Speech & Language Processing

                                                                                   Feature Extraction




                 Low-dimensional non-linear projection (example from LLE)

                   0.25                                                                                           0.4

                    0.2
                                                                                                                  0.3
                   0.15
                                                                                                                  0.2
                    0.1

                   0.05                                                                                           0.1

                     0
                    10                                                                                             0
                          8                                                                                        4
                              6                                                                                         3
                                  4                                                                                         2
                                      2                                                         10                                                                                              4
                                                                                                                                1
                                          0                                                                                                                                                 3
                                                                                           5                                        0                                                   2
                                              −2                                                                                                                                    1
                                                                                                                                        −1
                                                   −4                              0                                                                                            0
                                                                                                                                             −2                            −1


                                                                                                      →
                                                        −6
                                                                              −5                                                                  −3                  −2
                                                             −8
                                                                                                                                                                 −3
                                                                  −10   −10                                                                            −4   −4




                                                             Feature transformation (Gaussianisation)
         Cambridge University
                                                                                       Foresight Cognitive Systems Workshop                                                                         5
         Engineering Department
Machine Learning for Speech & Language Processing

                              Gaussianisation for Speaker Normalisation
1. Linear projection and “decorrelation of the data” (heteroscedastic LDA)
2. Gaussianise the data for each speaker:
   0.1                                           100                                              100                                                0.4

  0.09                                            90                                               90
                                                                                                                                                    0.35
  0.08                                            80                                               80
                                                                                                                                                     0.3
  0.07                                            70                                               70

                                                                                                   60                                               0.25
  0.06                                            60

  0.05                                            50                                               50                                                0.2

  0.04                                            40                                               40
                                                                                                                                                    0.15
  0.03                                            30                                               30
                                                                                                                                                     0.1
  0.02                                            20                                               20
                                                                                                                                                    0.05
  0.01                                            10                                               10

    0                                              0                                               0                                                  0
   −20   −15   −10   −5   0   5   10   15   20    −20   −15   −10   −5     0   5   10   15   20    −5   −4   −3   −2   −1   0   1   2   3   4   5     −5   −4   −3   −2   −1   0   1   2   3   4   5




    Source PDF            Source CDF         Target CDF                                                                                                     Target PDF
 (a) construct a Gaussian mixture model for each dimension;
 (b) non-linearly transform using cumulative density functions.

• May view as higher-moment version of mean and variance normalisation:
         – single component/dimension GMM equals CMN plus CVN
• Performance gains on state-of-the-art tasks

               Cambridge University
                                                                         Foresight Cognitive Systems Workshop                                                                                  6
               Engineering Department
Machine Learning for Speech & Language Processing

                                            Bayesian Networks
• Bayesian networks are a method to show conditional independence:
                                                            Cloudy




                                          Sprinkler                           Rain




                                                          Wet Grass


    – whether the grass is wet, W , depends on :
       whether the sprinkler used, S, and whether it has rained; R.
    – whether sprinkler used (or it rained) depends on: whether it is cloudy C.
• W is conditionally independent of C given S and R.
• Dynamic Bayesian networks handle variable length data.

         Cambridge University
                                               Foresight Cognitive Systems Workshop   7
         Engineering Department
Machine Learning for Speech & Language Processing

       Hidden Markov Model - A Dynamic Bayesian Network
              o1     o2     o3 o4                   oT


                                            b ()
                                                                                       qt      qt+1
            b2()             b3()             4


        1           2              3 a      4
             a12          a23                       a45 5
                                      34
                                                                                       ot      ot+1
                   a22            a33      a44
              (a) Standard HMM phone topology                               (b) HMM Dynamic Bayesian Network

• Notation for DBNs:
       circles - continuous variables squares - discrete variables
       shaded - observed variables    non-shaded - unobserved variables
• Observations conditionally independent of other observations given state.
• States conditionally independent of other states given previous states,
• Poor model of the speech process - piecewise constant state-space.

         Cambridge University
                                                Foresight Cognitive Systems Workshop                           8
         Engineering Department
Machine Learning for Speech & Language Processing

                      Alternative Dynamic Bayesian networks

Switching linear dynamical system:
                                                                                      q   t       q    t+1
• discrete and continuous state-spaces
• observations conditionally independent given
  continuous and discretes state;
                                                                                      x   t       x     t+1


                                   T
• exponential growth of paths, O(Ns )
  ⇒ approximate inference required.                                                   o   t       o     t+1



Multiple data stream DBN:
                                                                                      q
                                                                                      (1)
                                                                                      t           q    (1)
                                                                                                      t+1
• e.g. factorial HMM/mixed memory model;
• asynchronous data common:
    – speech and video/noise;
                                                                                      o       t   o   t+1

    – speech and brain activation patterns.
• observation depends on state of both streams                                        q
                                                                                      (2)
                                                                                      t           q    (2)
                                                                                                      t+1



         Cambridge University
                                               Foresight Cognitive Systems Workshop                           9
         Engineering Department
Machine Learning for Speech & Language Processing

                                   SLDS Trajectory Modelling
                                                                   sh      ow         dh ax   g   r   ih   dd l   iy   z
                                                               5

                                                               0
Frames from phrase:
SHOW THE GRIDLEY’S ...                                       −5

                                                             −10




                                                        1
                                                    MFCC c
Legend                                                       −15

• True                                                       −20

• HMM                                                        −25

• SLDS                                                       −30

                                                             −35
                                                                   20           40           60            80          100
                                                                                        Frame Index

• Unfortunately doesn’t currently classify better than an HMM!

         Cambridge University
                                               Foresight Cognitive Systems Workshop                                     10
         Engineering Department
Machine Learning for Speech & Language Processing

                                      Support Vector Machines
                                                                             support vector
                                                                             support vector

                                    margin
                                     width




                                   decision
                                  boundary




• SVMs are a maximum margin, binary, classifier:
    – related to minimising generalisation error;
    – unique solution (compare to neural networks);
    – may be kernelised - training/classification a function of dot-product (xi.xj ).
• Successfully applied to many tasks - how to apply to speech and language?

         Cambridge University
                                               Foresight Cognitive Systems Workshop           11
         Engineering Department
Machine Learning for Speech & Language Processing

                                           The “Kernel Trick”
         2.5                                                       2.5
                   class one                                                 class one
                   class two                                                 class two
           2       support vector                                   2        support vector
                   class one margin                                          class one margin
                   decision boundary                                         decision boundary
                   class two margin                                          class two margin
         1.5                                                       1.5



           1                                                        1



         0.5                                                       0.5



           0                                                        0
               1    2        3         4    5        6       7           1    2        3         4   5   6   7


• SVM decision boundary linear in the feature-space
    – may be made non-linear using a non-linear mapping φ() e.g.
                                     
                             √ 1x2
                x1
           φ            =  2x1x2  , K(xi, xj ) = φ(xi).φ(xj )
                x2
                                x2
                                 2
• Efficiently implemented using a Kernel: K(xi, xj ) = (xi.xj )2

         Cambridge University
                                                Foresight Cognitive Systems Workshop                             12
         Engineering Department
Machine Learning for Speech & Language Processing

                                                    String Kernel
• For speech and text processing input space has variable dimension:
    – use a kernel to map from variable to a fixed length;
    – Fisher kernels are one example for acoustic modelling;
    – String kernels are an example for text.
• Consider the words cat, cart, bar and a character string kernel

                                         c-a        c-t      c-r        a-r       r-t     b-a   b-r
                       φ(cat)             1          λ        0          0         0       0     0
                       φ(cart)            1          λ2       λ          1         1       0     0
                       φ(bar)             0          0        0          1         0       1     λ

          K(cat, cart) = 1 + λ3,                     K(cat, bar) = 0,                 K(cart, bar) = 1

• Successfully applied to various text classification tasks:
    – how to make process efficient (and more general)?

         Cambridge University
                                               Foresight Cognitive Systems Workshop                      13
         Engineering Department
Machine Learning for Speech & Language Processing

                           Weighted Finite-State Transducers
• A weighted finite-state transducer is a weighted directed graph:
    – transitions labelled with an input symbol, output symbol, weight.
• An example transducer, T , for calculating binary numbers: a=0, b=1

             b:b/1                   b:b/2
                                                                  Input        State Seq.   Output   Weight
              a:a/1                   a:a/2
                                                                                 112         bab       1
                                                                   bab
                          b:b/1                                                  211         bab       4
                 1                     2/1

    For this sequence output weight: w [bab ◦ T ] = 5
• Standard (highly efficient) algorithms exist for various operations:
    – combining transducer, T1 ◦ T2;
    – inverse, T −1, swap the input and output symbols in the tranducer.
• May be used for efficient implementation of string kernels.

         Cambridge University
                                               Foresight Cognitive Systems Workshop                       14
         Engineering Department
Machine Learning for Speech & Language Processing

                                              Rational Kernels
• A transducer, T , for the string kernel (gappy bigram) (vocab {a, b})

                                     b: ε/1                 b: ε/λ                b: ε/1

                                     a: ε/1                 a: ε/λ                    a: ε/1

                                                a:a/1                  a:a/1
                                        1                      2                       3/1
                                                    b:b/1              b:b/1


    The kernel is: K(Oi, Oj ) = w Oi ◦ (T ◦ T −1) ◦ Oj
• This form can also handle uncertainty in decoding:
    – lattices can be used rather than the 1-best output (Oi).
• This form encompasses various standard feature-spaces and kernels:
    – bag-of-words and N-gram counts, gappy N-grams (string Kernel),
• Successfully applied to a multi-class call classification task.

         Cambridge University
                                               Foresight Cognitive Systems Workshop            15
         Engineering Department
Machine Learning for Speech & Language Processing

                                       Reinforcement Learning
• Reinforcement learning is a class of training methods:
    – problem defined by payoffs;
    – aims to learn the policy that maximises the payoff;
    – no need for a mathematical model of environment.

                                  PAYOFF                                              PAYOFF
                                  FUNCTION                                            FUNCTION


                                  CONTROL                                             DIALOGUE
                                  SYSTEM                                              POLICY



                SENSORS                        ACTUATORS            ASR/SU                       TTS



                              ENVIRONMENT                                               USER


                    Reinforcement learning                            Spoken dialogue system

• Dialogue policy learning fits nicely within this framework.

         Cambridge University
                                               Foresight Cognitive Systems Workshop                    16
         Engineering Department
Machine Learning for Speech & Language Processing

                                             Example Dialogue
S1: Welcome to NJFun. How may I help you?
U1: I’d like to find um winetasting in Lambertville in the morning.
[ ASR: I’d like to find out wineries in the Lambertville in the
       morning ]
S2: Did you say you are interested in Lambertville?
U2: Yes.
S3: Did you want to go in the morning?
U3: Yes.
S4: I found a winery near Lambertville that is open in the morning.
    It is Poor Richard’s Winery in Lambertville.

• Variety of action choices available:
    – mixed versus system initiative;
    – explicit versus implicit confirmation.



         Cambridge University
                                               Foresight Cognitive Systems Workshop   17
         Engineering Department
Machine Learning for Speech & Language Processing

                                      Markov Decision Process
• SDS modelled as a MDP:
    – system state and action at time t: St and At;
    – transition function: user and ASR/SU model, P (St|St−1, At−1).

                                      Turn Delay            User + ASR/SU
                                                                                      Rt
                                           z −1              P(St | St−1,At−1)


                                         At                                   St
                                                        π( St )

                                                    Dialogue Policy


• Select policy to maximise expected total reward:
    – total reward: Rt sum of instantaneous rewards from t to end of dialogue;
    – value function (expected reward) for policy π in state S: V π (S).

         Cambridge University
                                               Foresight Cognitive Systems Workshop        18
         Engineering Department
Machine Learning for Speech & Language Processing

                                                    Q-Learning

• In reinforcement learning use the Q-function, Qπ (S, A)
    – expected reward from taking action A in state S using policy π
• Best policy using π given state St is obtained from

                                       π (St) = argmax (Qπ (St, A))
                                       ˆ
                                                          A


• Transition function not normally known - one-step Q-learning algorithm:
    – learn Qπ (S, A) rather than transition function;
    – estimate using difference between actual and estimated values.
• How to specify reward: simplest form assign to final state:
    – positive value for task success;
    – negative value for task failure.

         Cambridge University
                                               Foresight Cognitive Systems Workshop   19
         Engineering Department
Machine Learning for Speech & Language Processing

                                      Partially Observed MDP
• State-space required to encapsulate all information to make decision:
    – state space can become very large e.g. transcript of dialogue to date etc;
    – required to compress size - usually application specific choice;
    – if state-space is too small MDP not appropriate.
• Also User beliefs cannot be observed:
    – decisions required on incomplete information (POMDP);
    – use of a belief state - value function becomes

                                            V π (B) =             B(S)V π (S)
                                                             S

        where B(S) gives belief in a state.
• Major problem: how to obtain sufficient training data?
    – build prototype system and then refine;
    – build a user model to simulate user interaction.

         Cambridge University
                                               Foresight Cognitive Systems Workshop   20
         Engineering Department
Machine Learning for Speech & Language Processing

        Machine Learning for Speech & Language Processing
Briefly described only a few examples

• Markov chain Monte-Carlo techniques:
    – Rao-Blackwellised Gibbs sampling for SLDS one example.
• Discriminative training criteria:
    – use criteria more closely related to WER, (MMI, MPE, MCE).
• Latent variable models for language modelling:
    – Latent semantic analysis (LSA) and Probabilistic LSA.
• Boosting style schemes:
    – generate multiple complementary classifiers and combine them.
• Minimum Description Length & evidence framework:
    – automatically determine numbers of model parameters and configuration.

         Cambridge University
                                               Foresight Cognitive Systems Workshop   21
         Engineering Department
Machine Learning for Speech & Language Processing

                                       Some Standard Toolkits

• Hidden Markov model toolkit (HTK)
    – building state-of-art HMM-based systems
    – http://htk.eng.cam.ac.uk/
• Graphical model toolkit (GMTK)
    – training and inference for graphical models
    – http://ssli.ee.washington.edu/∼bilmes/gmtk/
• Finite state transducer toolkit (FSM)
    – building, combining, optimising weighted finite state transducers
    – http://www.research.att.com/sw/tools/fsm/
• Support vector machine toolkit (SVMlight)
    – training and classifying with SVMs
    – http://svmlight.joachims.org/

         Cambridge University
                                               Foresight Cognitive Systems Workshop   22
         Engineering Department
Machine Learning for Speech & Language Processing

                                         (Random) References
• Machine Learning, T Mitchell, McGraw Hill, 1997.
• Nonlinear dimensionality reduction by locally linear embedding, S. Roweis and L Saul Science,
  v.290 no.5500 , Dec.22, 2000. pp.2323–2326.
• Gaussianisation, S. Chen and R Gopinath, NIPS 2000.
• An Introduction to Bayesian Network Theory and Usage,T.A.Stephenson,IDIAP-RR03, 2000
• A Tutorial on Support Vector Machines for Pattern Recognition, C.J.C. Burges, Knowledge
  Discovery and Data Mining, 2(2), 1998.
• Switching Linear Dynamical Systems for Speech Recognition, A-V.I. Rosti and M.J.F. Gales,
  Technical Report CUED/F-INFENG/TR.461 December 2003,
• Factorial hidden Markov models, Z. Ghahramani and M.I. Jordan. Machine Learning 29:245–
  275, 1997.
• The Nature of Statistical Learning Theory, V. Vapnik, Springer, 2000.
• Finite-State Transducers in Language and Speech Processing, M. Mohri,Computational
  Linguistics, 23:2, 1997.
• Weighted Automata Kernels - General Framework and Algorithms, C. Cortes, P. Haffner and
  M. Mohri, EuroSpeech 2003.
• An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue
  System for Email, M. Walker, Journal of Artificial Intelligence Research, JAIR, Vol 12., pp.
  387-416, 2000.

         Cambridge University
                                               Foresight Cognitive Systems Workshop          23
         Engineering Department

Mais conteúdo relacionado

Semelhante a Machine Learning for Speech

Talk data sciencemeetup
Talk data sciencemeetupTalk data sciencemeetup
Talk data sciencemeetupdatasciencenl
 
Pivotal CRM: Optimize your Pivotal Implementation
Pivotal CRM: Optimize your Pivotal ImplementationPivotal CRM: Optimize your Pivotal Implementation
Pivotal CRM: Optimize your Pivotal ImplementationAptean
 
Concurrency And Erlang
Concurrency And ErlangConcurrency And Erlang
Concurrency And Erlangl xf
 
Objective Determination Of Minimum Engine Mapping Requirements For Optimal SI...
Objective Determination Of Minimum Engine Mapping Requirements For Optimal SI...Objective Determination Of Minimum Engine Mapping Requirements For Optimal SI...
Objective Determination Of Minimum Engine Mapping Requirements For Optimal SI...pmaloney1
 
Attention flow by tagging prediction
Attention flow by tagging predictionAttention flow by tagging prediction
Attention flow by tagging predictionYONG ZHENG
 
HIS'2008: Artificial Data Sets based on Knowledge Generators: Analysis of Lea...
HIS'2008: Artificial Data Sets based on Knowledge Generators: Analysis of Lea...HIS'2008: Artificial Data Sets based on Knowledge Generators: Analysis of Lea...
HIS'2008: Artificial Data Sets based on Knowledge Generators: Analysis of Lea...Albert Orriols-Puig
 
Opportunities & Obligations
Opportunities & ObligationsOpportunities & Obligations
Opportunities & ObligationsMartin Rehm
 
Dsp U Lec01 Real Time Dsp Systems
Dsp U   Lec01 Real Time Dsp SystemsDsp U   Lec01 Real Time Dsp Systems
Dsp U Lec01 Real Time Dsp Systemstaha25
 
A Retrospective Look at A Retrospective Look at Classifier System ResearchCl...
A Retrospective Look at  A Retrospective Look at  Classifier System ResearchCl...A Retrospective Look at  A Retrospective Look at  Classifier System ResearchCl...
A Retrospective Look at A Retrospective Look at Classifier System ResearchCl...Xavier Llorà
 
Firefox group introduction
Firefox group introductionFirefox group introduction
Firefox group introductionFrank Rietta
 
2011It's Not Just Student Support
2011It's Not Just Student Support2011It's Not Just Student Support
2011It's Not Just Student SupportWCET
 
Q5 how did you attract/address your target audience?
Q5 how did you attract/address your target audience?Q5 how did you attract/address your target audience?
Q5 how did you attract/address your target audience?emma123abc
 
Application delivery 2 0
Application delivery 2 0Application delivery 2 0
Application delivery 2 0Interop
 
Plone 4 and 5, plans and progress
Plone 4 and 5, plans and progressPlone 4 and 5, plans and progress
Plone 4 and 5, plans and progressGeir Bækholt
 
Energy smart grid-analytics and insights of Intelen patented Technology
Energy smart grid-analytics and insights of Intelen patented TechnologyEnergy smart grid-analytics and insights of Intelen patented Technology
Energy smart grid-analytics and insights of Intelen patented TechnologyVassilis Nikolopoulos
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks
 
Previsão surf carcavelos
Previsão surf carcavelosPrevisão surf carcavelos
Previsão surf carcavelosMeteocean Surf
 
Previsão surf carcavelos
Previsão surf carcavelosPrevisão surf carcavelos
Previsão surf carcavelosMeteocean Surf
 
Previsão surf carcavelos
Previsão surf carcavelosPrevisão surf carcavelos
Previsão surf carcavelosMeteocean Surf
 

Semelhante a Machine Learning for Speech (20)

Talk data sciencemeetup
Talk data sciencemeetupTalk data sciencemeetup
Talk data sciencemeetup
 
Pivotal CRM: Optimize your Pivotal Implementation
Pivotal CRM: Optimize your Pivotal ImplementationPivotal CRM: Optimize your Pivotal Implementation
Pivotal CRM: Optimize your Pivotal Implementation
 
Concurrency And Erlang
Concurrency And ErlangConcurrency And Erlang
Concurrency And Erlang
 
Objective Determination Of Minimum Engine Mapping Requirements For Optimal SI...
Objective Determination Of Minimum Engine Mapping Requirements For Optimal SI...Objective Determination Of Minimum Engine Mapping Requirements For Optimal SI...
Objective Determination Of Minimum Engine Mapping Requirements For Optimal SI...
 
Attention flow by tagging prediction
Attention flow by tagging predictionAttention flow by tagging prediction
Attention flow by tagging prediction
 
HIS'2008: Artificial Data Sets based on Knowledge Generators: Analysis of Lea...
HIS'2008: Artificial Data Sets based on Knowledge Generators: Analysis of Lea...HIS'2008: Artificial Data Sets based on Knowledge Generators: Analysis of Lea...
HIS'2008: Artificial Data Sets based on Knowledge Generators: Analysis of Lea...
 
Sattose 2011
Sattose 2011Sattose 2011
Sattose 2011
 
Opportunities & Obligations
Opportunities & ObligationsOpportunities & Obligations
Opportunities & Obligations
 
Dsp U Lec01 Real Time Dsp Systems
Dsp U   Lec01 Real Time Dsp SystemsDsp U   Lec01 Real Time Dsp Systems
Dsp U Lec01 Real Time Dsp Systems
 
A Retrospective Look at A Retrospective Look at Classifier System ResearchCl...
A Retrospective Look at  A Retrospective Look at  Classifier System ResearchCl...A Retrospective Look at  A Retrospective Look at  Classifier System ResearchCl...
A Retrospective Look at A Retrospective Look at Classifier System ResearchCl...
 
Firefox group introduction
Firefox group introductionFirefox group introduction
Firefox group introduction
 
2011It's Not Just Student Support
2011It's Not Just Student Support2011It's Not Just Student Support
2011It's Not Just Student Support
 
Q5 how did you attract/address your target audience?
Q5 how did you attract/address your target audience?Q5 how did you attract/address your target audience?
Q5 how did you attract/address your target audience?
 
Application delivery 2 0
Application delivery 2 0Application delivery 2 0
Application delivery 2 0
 
Plone 4 and 5, plans and progress
Plone 4 and 5, plans and progressPlone 4 and 5, plans and progress
Plone 4 and 5, plans and progress
 
Energy smart grid-analytics and insights of Intelen patented Technology
Energy smart grid-analytics and insights of Intelen patented TechnologyEnergy smart grid-analytics and insights of Intelen patented Technology
Energy smart grid-analytics and insights of Intelen patented Technology
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
Previsão surf carcavelos
Previsão surf carcavelosPrevisão surf carcavelos
Previsão surf carcavelos
 
Previsão surf carcavelos
Previsão surf carcavelosPrevisão surf carcavelos
Previsão surf carcavelos
 
Previsão surf carcavelos
Previsão surf carcavelosPrevisão surf carcavelos
Previsão surf carcavelos
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Machine Learning for Speech

  • 1. Machine Learning for Speech & Language Processing Mark Gales 28 June 2004 Cambridge University Engineering Department Foresight Cognitive Systems Workshop
  • 2. Machine Learning for Speech & Language Processing Overview • Machine learning. • Feature extraction: – Gaussianisation for speaker normalisation. • Dynamic Bayesian networks: – multiple data stream models – switching linear dynamical systems for ASR. • SVMs and kernel methods: – rational kernels for text classification. • Reinforcement learning and Markov decision processes: – spoken dialogue system policy optimisation. Cambridge University Foresight Cognitive Systems Workshop 1 Engineering Department
  • 3. Machine Learning for Speech & Language Processing Machine Learning • One definition is (Mitchell): “A computer program is said to learn from experience (E) with some class of tasks (T) and a performance measure (P) if its performance at tasks in T as measured by P improves with E” alternatively “Systems built by analysing data sets rather than by using the intuition of experts” • Multiple specific conferences: – {International,European} Conference on Machine Learning; – Neural Information Processing Systems; – International Conference on Pattern Recognition etc etc; • as well as sessions in other conferences: – ICASSP - machine learning for signal processing. Cambridge University Foresight Cognitive Systems Workshop 2 Engineering Department
  • 4. Machine Learning for Speech & Language Processing “Machine Learning” Community “You should come to NIPS. They have lots of ideas. The Speech Community has lots of data.” • Some categories from Neural Information Processing Systems: – clustering; – dimensionality reduction and manifolds; – graphical models; – kernels, margins, boosting; – Monte Carlo methods; – neural networks; – ... – speech and signal processing. • Speech and language processing is just an application Cambridge University Foresight Cognitive Systems Workshop 3 Engineering Department
  • 5. Machine Learning for Speech & Language Processing Too Much of a Good Thing? “You should come to NIPS. They have lots of ideas. Unfortunately, the Speech Community has lots of data.” • Text data: used to train the ASR language model: – large news corpora available; – systems built on > 1 billion words of data. • Acoustic data: used to train the ASR acoustic models: – > 2000 hours speech data (∼ 20 million words, ∼ 720 million frames of data); – rapid transcriptions/closed caption data. • Solutions required to be scalable: – heavily influences (limits!) machine learning approaches used; – additional data masks many problems! Cambridge University Foresight Cognitive Systems Workshop 4 Engineering Department
  • 6. Machine Learning for Speech & Language Processing Feature Extraction Low-dimensional non-linear projection (example from LLE) 0.25 0.4 0.2 0.3 0.15 0.2 0.1 0.05 0.1 0 10 0 8 4 6 3 4 2 2 10 4 1 0 3 5 0 2 −2 1 −1 −4 0 0 −2 −1 → −6 −5 −3 −2 −8 −3 −10 −10 −4 −4 Feature transformation (Gaussianisation) Cambridge University Foresight Cognitive Systems Workshop 5 Engineering Department
  • 7. Machine Learning for Speech & Language Processing Gaussianisation for Speaker Normalisation 1. Linear projection and “decorrelation of the data” (heteroscedastic LDA) 2. Gaussianise the data for each speaker: 0.1 100 100 0.4 0.09 90 90 0.35 0.08 80 80 0.3 0.07 70 70 60 0.25 0.06 60 0.05 50 50 0.2 0.04 40 40 0.15 0.03 30 30 0.1 0.02 20 20 0.05 0.01 10 10 0 0 0 0 −20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 Source PDF Source CDF Target CDF Target PDF (a) construct a Gaussian mixture model for each dimension; (b) non-linearly transform using cumulative density functions. • May view as higher-moment version of mean and variance normalisation: – single component/dimension GMM equals CMN plus CVN • Performance gains on state-of-the-art tasks Cambridge University Foresight Cognitive Systems Workshop 6 Engineering Department
  • 8. Machine Learning for Speech & Language Processing Bayesian Networks • Bayesian networks are a method to show conditional independence: Cloudy Sprinkler Rain Wet Grass – whether the grass is wet, W , depends on : whether the sprinkler used, S, and whether it has rained; R. – whether sprinkler used (or it rained) depends on: whether it is cloudy C. • W is conditionally independent of C given S and R. • Dynamic Bayesian networks handle variable length data. Cambridge University Foresight Cognitive Systems Workshop 7 Engineering Department
  • 9. Machine Learning for Speech & Language Processing Hidden Markov Model - A Dynamic Bayesian Network o1 o2 o3 o4 oT b () qt qt+1 b2() b3() 4 1 2 3 a 4 a12 a23 a45 5 34 ot ot+1 a22 a33 a44 (a) Standard HMM phone topology (b) HMM Dynamic Bayesian Network • Notation for DBNs: circles - continuous variables squares - discrete variables shaded - observed variables non-shaded - unobserved variables • Observations conditionally independent of other observations given state. • States conditionally independent of other states given previous states, • Poor model of the speech process - piecewise constant state-space. Cambridge University Foresight Cognitive Systems Workshop 8 Engineering Department
  • 10. Machine Learning for Speech & Language Processing Alternative Dynamic Bayesian networks Switching linear dynamical system: q t q t+1 • discrete and continuous state-spaces • observations conditionally independent given continuous and discretes state; x t x t+1 T • exponential growth of paths, O(Ns ) ⇒ approximate inference required. o t o t+1 Multiple data stream DBN: q (1) t q (1) t+1 • e.g. factorial HMM/mixed memory model; • asynchronous data common: – speech and video/noise; o t o t+1 – speech and brain activation patterns. • observation depends on state of both streams q (2) t q (2) t+1 Cambridge University Foresight Cognitive Systems Workshop 9 Engineering Department
  • 11. Machine Learning for Speech & Language Processing SLDS Trajectory Modelling sh ow dh ax g r ih dd l iy z 5 0 Frames from phrase: SHOW THE GRIDLEY’S ... −5 −10 1 MFCC c Legend −15 • True −20 • HMM −25 • SLDS −30 −35 20 40 60 80 100 Frame Index • Unfortunately doesn’t currently classify better than an HMM! Cambridge University Foresight Cognitive Systems Workshop 10 Engineering Department
  • 12. Machine Learning for Speech & Language Processing Support Vector Machines support vector support vector margin width decision boundary • SVMs are a maximum margin, binary, classifier: – related to minimising generalisation error; – unique solution (compare to neural networks); – may be kernelised - training/classification a function of dot-product (xi.xj ). • Successfully applied to many tasks - how to apply to speech and language? Cambridge University Foresight Cognitive Systems Workshop 11 Engineering Department
  • 13. Machine Learning for Speech & Language Processing The “Kernel Trick” 2.5 2.5 class one class one class two class two 2 support vector 2 support vector class one margin class one margin decision boundary decision boundary class two margin class two margin 1.5 1.5 1 1 0.5 0.5 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 • SVM decision boundary linear in the feature-space – may be made non-linear using a non-linear mapping φ() e.g.   √ 1x2 x1 φ =  2x1x2  , K(xi, xj ) = φ(xi).φ(xj ) x2 x2 2 • Efficiently implemented using a Kernel: K(xi, xj ) = (xi.xj )2 Cambridge University Foresight Cognitive Systems Workshop 12 Engineering Department
  • 14. Machine Learning for Speech & Language Processing String Kernel • For speech and text processing input space has variable dimension: – use a kernel to map from variable to a fixed length; – Fisher kernels are one example for acoustic modelling; – String kernels are an example for text. • Consider the words cat, cart, bar and a character string kernel c-a c-t c-r a-r r-t b-a b-r φ(cat) 1 λ 0 0 0 0 0 φ(cart) 1 λ2 λ 1 1 0 0 φ(bar) 0 0 0 1 0 1 λ K(cat, cart) = 1 + λ3, K(cat, bar) = 0, K(cart, bar) = 1 • Successfully applied to various text classification tasks: – how to make process efficient (and more general)? Cambridge University Foresight Cognitive Systems Workshop 13 Engineering Department
  • 15. Machine Learning for Speech & Language Processing Weighted Finite-State Transducers • A weighted finite-state transducer is a weighted directed graph: – transitions labelled with an input symbol, output symbol, weight. • An example transducer, T , for calculating binary numbers: a=0, b=1 b:b/1 b:b/2 Input State Seq. Output Weight a:a/1 a:a/2 112 bab 1 bab b:b/1 211 bab 4 1 2/1 For this sequence output weight: w [bab ◦ T ] = 5 • Standard (highly efficient) algorithms exist for various operations: – combining transducer, T1 ◦ T2; – inverse, T −1, swap the input and output symbols in the tranducer. • May be used for efficient implementation of string kernels. Cambridge University Foresight Cognitive Systems Workshop 14 Engineering Department
  • 16. Machine Learning for Speech & Language Processing Rational Kernels • A transducer, T , for the string kernel (gappy bigram) (vocab {a, b}) b: ε/1 b: ε/λ b: ε/1 a: ε/1 a: ε/λ a: ε/1 a:a/1 a:a/1 1 2 3/1 b:b/1 b:b/1 The kernel is: K(Oi, Oj ) = w Oi ◦ (T ◦ T −1) ◦ Oj • This form can also handle uncertainty in decoding: – lattices can be used rather than the 1-best output (Oi). • This form encompasses various standard feature-spaces and kernels: – bag-of-words and N-gram counts, gappy N-grams (string Kernel), • Successfully applied to a multi-class call classification task. Cambridge University Foresight Cognitive Systems Workshop 15 Engineering Department
  • 17. Machine Learning for Speech & Language Processing Reinforcement Learning • Reinforcement learning is a class of training methods: – problem defined by payoffs; – aims to learn the policy that maximises the payoff; – no need for a mathematical model of environment. PAYOFF PAYOFF FUNCTION FUNCTION CONTROL DIALOGUE SYSTEM POLICY SENSORS ACTUATORS ASR/SU TTS ENVIRONMENT USER Reinforcement learning Spoken dialogue system • Dialogue policy learning fits nicely within this framework. Cambridge University Foresight Cognitive Systems Workshop 16 Engineering Department
  • 18. Machine Learning for Speech & Language Processing Example Dialogue S1: Welcome to NJFun. How may I help you? U1: I’d like to find um winetasting in Lambertville in the morning. [ ASR: I’d like to find out wineries in the Lambertville in the morning ] S2: Did you say you are interested in Lambertville? U2: Yes. S3: Did you want to go in the morning? U3: Yes. S4: I found a winery near Lambertville that is open in the morning. It is Poor Richard’s Winery in Lambertville. • Variety of action choices available: – mixed versus system initiative; – explicit versus implicit confirmation. Cambridge University Foresight Cognitive Systems Workshop 17 Engineering Department
  • 19. Machine Learning for Speech & Language Processing Markov Decision Process • SDS modelled as a MDP: – system state and action at time t: St and At; – transition function: user and ASR/SU model, P (St|St−1, At−1). Turn Delay User + ASR/SU Rt z −1 P(St | St−1,At−1) At St π( St ) Dialogue Policy • Select policy to maximise expected total reward: – total reward: Rt sum of instantaneous rewards from t to end of dialogue; – value function (expected reward) for policy π in state S: V π (S). Cambridge University Foresight Cognitive Systems Workshop 18 Engineering Department
  • 20. Machine Learning for Speech & Language Processing Q-Learning • In reinforcement learning use the Q-function, Qπ (S, A) – expected reward from taking action A in state S using policy π • Best policy using π given state St is obtained from π (St) = argmax (Qπ (St, A)) ˆ A • Transition function not normally known - one-step Q-learning algorithm: – learn Qπ (S, A) rather than transition function; – estimate using difference between actual and estimated values. • How to specify reward: simplest form assign to final state: – positive value for task success; – negative value for task failure. Cambridge University Foresight Cognitive Systems Workshop 19 Engineering Department
  • 21. Machine Learning for Speech & Language Processing Partially Observed MDP • State-space required to encapsulate all information to make decision: – state space can become very large e.g. transcript of dialogue to date etc; – required to compress size - usually application specific choice; – if state-space is too small MDP not appropriate. • Also User beliefs cannot be observed: – decisions required on incomplete information (POMDP); – use of a belief state - value function becomes V π (B) = B(S)V π (S) S where B(S) gives belief in a state. • Major problem: how to obtain sufficient training data? – build prototype system and then refine; – build a user model to simulate user interaction. Cambridge University Foresight Cognitive Systems Workshop 20 Engineering Department
  • 22. Machine Learning for Speech & Language Processing Machine Learning for Speech & Language Processing Briefly described only a few examples • Markov chain Monte-Carlo techniques: – Rao-Blackwellised Gibbs sampling for SLDS one example. • Discriminative training criteria: – use criteria more closely related to WER, (MMI, MPE, MCE). • Latent variable models for language modelling: – Latent semantic analysis (LSA) and Probabilistic LSA. • Boosting style schemes: – generate multiple complementary classifiers and combine them. • Minimum Description Length & evidence framework: – automatically determine numbers of model parameters and configuration. Cambridge University Foresight Cognitive Systems Workshop 21 Engineering Department
  • 23. Machine Learning for Speech & Language Processing Some Standard Toolkits • Hidden Markov model toolkit (HTK) – building state-of-art HMM-based systems – http://htk.eng.cam.ac.uk/ • Graphical model toolkit (GMTK) – training and inference for graphical models – http://ssli.ee.washington.edu/∼bilmes/gmtk/ • Finite state transducer toolkit (FSM) – building, combining, optimising weighted finite state transducers – http://www.research.att.com/sw/tools/fsm/ • Support vector machine toolkit (SVMlight) – training and classifying with SVMs – http://svmlight.joachims.org/ Cambridge University Foresight Cognitive Systems Workshop 22 Engineering Department
  • 24. Machine Learning for Speech & Language Processing (Random) References • Machine Learning, T Mitchell, McGraw Hill, 1997. • Nonlinear dimensionality reduction by locally linear embedding, S. Roweis and L Saul Science, v.290 no.5500 , Dec.22, 2000. pp.2323–2326. • Gaussianisation, S. Chen and R Gopinath, NIPS 2000. • An Introduction to Bayesian Network Theory and Usage,T.A.Stephenson,IDIAP-RR03, 2000 • A Tutorial on Support Vector Machines for Pattern Recognition, C.J.C. Burges, Knowledge Discovery and Data Mining, 2(2), 1998. • Switching Linear Dynamical Systems for Speech Recognition, A-V.I. Rosti and M.J.F. Gales, Technical Report CUED/F-INFENG/TR.461 December 2003, • Factorial hidden Markov models, Z. Ghahramani and M.I. Jordan. Machine Learning 29:245– 275, 1997. • The Nature of Statistical Learning Theory, V. Vapnik, Springer, 2000. • Finite-State Transducers in Language and Speech Processing, M. Mohri,Computational Linguistics, 23:2, 1997. • Weighted Automata Kernels - General Framework and Algorithms, C. Cortes, P. Haffner and M. Mohri, EuroSpeech 2003. • An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email, M. Walker, Journal of Artificial Intelligence Research, JAIR, Vol 12., pp. 387-416, 2000. Cambridge University Foresight Cognitive Systems Workshop 23 Engineering Department