SlideShare uma empresa Scribd logo
1 de 63
Baixar para ler offline
Machine Learning
K-means, E.M. and Mixture models
                VU Pham
           phvu@fit.hcmus.edu.vn

       Department of Computer Science

             November 22, 2010




                Machine Learning
Remind: Three Main Problems in ML

• Three main problems in ML:
    – Regression: Linear Regression, Neural net...
    – Classification: Decision Tree, kNN, Bayessian Classifier...
    – Density Estimation: Gauss Naive DE,...

• Today, we will learn:
    – K-means: a trivial unsupervised classification algorithm.
    – Expectation Maximization: a general algorithm for density estimation.
      ∗ We will see how to use EM in general cases and in specific case of GMM.
    – GMM: a tool for modelling Data-in-the-Wild (density estimator)
      ∗ We also learn how to use GMM in a Bayessian Classifier




Machine Learning                                                                 1
Contents

• Unsupervised Learning

• K-means clustering

• Expectation Maximization (E.M.)
    – Regularized EM
    – Model Selection

• Gaussian mixtures as a Density Estimator
    – Gaussian mixtures
    – EM for mixtures

• Gaussian mixtures for classification

• Case studies

Machine Learning                             2
Unsupervised Learning
• So far, we have considered supervised learning techniques:
  – Label of each sample is included in the training set
                                 Sample     Label
                                   x1        y1
                                   ...       ...
                                   xn        yk

• Unsupervised learning:
  – Traning set contains the samples only
                                 Sample     Label
                                   x1
                                   ...
                                   xn




Machine Learning                                               3
Unsupervised Learning

     60                                                 60



     50                                                 50



     40                                                 40



     30                                                 30



     20                                                 20



     10                                                 10



      0                                                 0
      −10     0        10     20     30       40   50   −10   0        10     20     30        40   50



                   (a) Supervised learning.                       (b) Unsupervised learning.

                            Figure 1: Unsupervised vs. Supervised Learning




Machine Learning                                                                                         4
What is unsupervised learning useful for?

• Collecting and labeling a large training set can be very expensive.

• Be able to find features which are helpful for categorization.

• Gain insight into the natural structure of the data.




Machine Learning                                                        5
Contents

• Unsupervised Learning

• K-means clustering

• Expectation Maximization (E.M.)
    – Regularized EM
    – Model Selection

• Gaussian mixtures as a Density Estimator
    – Gaussian mixtures
    – EM for mixtures

• Gaussian mixtures for classification

• Case studies

Machine Learning                             6
K-means clustering
• Clustering algorithms aim to find
  groups of “similar” data points among     60


  the input data.                           50




• K-means is an effective algorithm to ex-   40


  tract a given number of clusters from a   30

  training set.
                                            20



• Once done, the cluster locations can      10

  be used to classify data into distinct
                                            0
  classes.                                  −10   0   10   20   30   40   50




Machine Learning                                                               7
K-means clustering

• Given:
    – The dataset: {xn}N = {x1, x2, ..., xN}
                        n=1
    – Number of clusters: K (K < N )

• Goal: find a partition S = {Sk }K so that it minimizes the objective function
                                 k=1

                                    N
                                    ∑   K
                                        ∑
                              J=             rnk ∥ xn − µk ∥2                 (1)
                                   n=1 k=1


    where rnk = 1 if xn is assigned to cluster Sk , and rnj = 0 for j ̸= k.

i.e. Find values for the {rnk } and the {µk } to minimize (1).




Machine Learning                                                                 8
K-means clustering
                                 N
                                 ∑   K
                                     ∑
                           J=             rnk ∥ xn − µk ∥2
                                n=1 k=1


• Select some initial values for the µk .

• Expectation: keep the µk fixed, minimize J respect to rnk .

• Maximization: keep the rnk fixed, minimize J respect to the µk .

• Loop until no change in the partitions (or maximum number of interations is
  exceeded).




Machine Learning                                                            9
K-means clustering
                                  N
                                  ∑   K
                                      ∑
                            J=              rnk ∥ xn − µk ∥2
                                  n=1 k=1


• Expectation: J is linear function of rnk
                              
                              
                              1 if k = arg minj ∥ xn − µj ∥2
                              
                              
                              
                              
                      rnk   =
                             
                             
                             0
                               otherwise


• Maximization: setting the derivative of J with respect to µk to zero, gives:
                                              ∑
                                               n rnk xn
                                      µk =        ∑
                                                  n rnk


    Convergence of K-means: assured [why?], but may lead to local minimum of J
    [8]


Machine Learning                                                                 10
K-means clustering: How to understand?
                                 N
                                 ∑   K
                                     ∑
                            J=             rnk ∥ xn − µk ∥2
                                 n=1 k=1


• Expectation: minimize J respect to rnk
    – For each xn, find the “closest” cluster mean µk and put xn into cluster Sk .

• Maximization: minimize J respect to µk
    – For each cluster Sk , re-estimate the cluster mean µk to be the average value
      of all samples in Sk .

• Loop until no change in the partitions (or maximum number of interations is
  exceeded).




Machine Learning                                                                    11
K-means clustering: Demonstration




Machine Learning                                       12
K-means clustering: some variations

• Initial cluster centroids:
    – Randomly selected
    – Iterative procedure: k-mean++ [2]

• Number of clusters K:
                                        √
    – Empirically/experimentally: 2 ∼       n
    – Learning [6]

• Objective function:
    – General dissimilarity measure: k-medoids algorithm.

• Speeding up:
    – kd-trees for pre-processing [7]
    – Triangle inequality for distance calculation [4]

Machine Learning                                            13
Contents

• Unsupervised Learning

• K-means clustering

• Expectation Maximization (E.M.)
    – Regularized EM
    – Model Selection

• Gaussian mixtures as a Density Estimator
    – Gaussian mixtures
    – EM for mixtures

• Gaussian mixtures for classification

• Case studies

Machine Learning                             14
Expectation Maximization




                   E.M.
Machine Learning                              15
Expectation Maximization

• A general-purpose algorithm for MLE in a wide range of situations.

• First formally stated by Dempster, Laird and Rubin in 1977 [1]
    – We even have several books discussing only on EM and its variations!

• An excellent way of doing our unsupervised learning problem, as we will see
    – EM is also used widely in other domains.




Machine Learning                                                                16
EM: a solution for MLE

• Given a statistical model with:
    –   a   set X of observed data,
    –   a   set Z of unobserved latent data,
    –   a   vector of unknown parameters θ,
    –   a   likelihood function L (θ; X, Z) = p (X, Z | θ)

• Roughly speaking, the aim of MLE is to determine θ = arg maxθ L (θ; X, Z)
    – We known the old trick: partial derivatives of the log likelihood...
    – But it is not always tractable [e.g.]
    – Other solutions are available.




Machine Learning                                                              17
EM: General Case

                                     L (θ; X, Z) = p (X, Z | θ)

• EM is just an iterative procedure for finding the MLE

• Expectation step: keep the current estimate θ (t) fixed, calculate the expected
  value of the log likelihood function
                    (            )
                   Q θ|θ   (t)
                                     = E [log L (θ; X, Z)] = E [log p (X, Z | θ)]


• Maximization step: Find the parameter that maximizes this quantity
                                                             (              )
                                      θ   (t+1)
                                                  = arg max Q θ | θ   (t)
                                                        θ




Machine Learning                                                                    18
EM: Motivation

• If we know the value of the parameters θ, we can find the value of latent variables
  Z by maximizing the log likelihood over all possible values of Z
    – Searching on the value space of Z.

• If we know Z, we can find an estimate of θ
    – Typically by grouping the observed data points according to the value of asso-
      ciated latent variable,
    – then averaging the values (or some functions of the values) of the points in
      each group.

To understand this motivation, let’s take K-means as a trivial example...




Machine Learning                                                                  19
EM: informal description
     Both θ and Z are unknown, EM is an iterative algorithm:

1. Initialize the parameters θ to some random values.

2. Compute the best values of Z given these parameter values.

3. Use the just-computed values of Z to find better estimates for θ.

4. Iterate until convergence.




Machine Learning                                                      20
EM Convergence

• E.M. Convergence: Yes
    – After each iteration, p (X, Z | θ) must increase or remain   [NOT OBVIOUS]
    – But it can not exceed 1 [OBVIOUS]
    – Hence it must converge [OBVIOUS]

• Bad news: E.M. converges to local optimum.
    – Whether the algorithm converges to the global optimum depends on the ini-
      tialization.

• Let’s take K-means as an example, again...

• Details can be found in [9].




Machine Learning                                                                   21
Regularized EM (REM)

• EM tries to inference the latent (missing) data Z from the observations X
    – We want to choose the missing data that has a strong probabilistic relation
      to the observations, i.e. we assume that the observations contains lots of
      information about the missing data.
    – But E.M. does not have any control on the relationship between the missing
      data and the observations!

• Regularized EM (REM) [5] tries to optimized the penalized likelihood

                    L (θ | X, Z) = L (θ | X, Z) − γH (Z | X, θ)

    where H (Y ) is Shannon’s entropy of the random variable Y :
                                           ∑
                              H (Y ) = −       p (y) log p (y)
                                           y

    and the positive value γ is the regularization parameter. [When γ = 0?]

Machine Learning                                                               22
Regularized EM (REM)

• E-step: unchanged

• M-step: Find the parameter that maximizes this quantity
                                                            (          )
                            θ   (t+1)
                                        = arg max Q θ | θ        (t)
                                              θ


    where             (           )       (             )
                    Q θ|θ   (t)
                                      =Q θ|θ      (t)
                                                            − γH (Z | X, θ)

• REM is expected to converge faster than EM (and it does!)

• So, to apply REM, we just need to determine the H (·) part...




Machine Learning                                                              23
Model Selection

• Considering a parametric model:
    – When estimating model parameters using MLE, it is possible to increase the
      likelihood by adding parameters
    – But may result in over-fitting.

• e.g. K-means with different values of K...

• Need a criteria for model selection, e.g. to “judge” which model configuration is
  better, how many parameters is sufficient...
    – Cross Validation
    – Akaike Information Criterion (AIC)
    – Bayesian Factor
      ∗ Bayesian Informaction Criterion (BIC)
      ∗ Deviance Information Criterion
    – ...

Machine Learning                                                                24
Bayesian Information Criterion
                                   (       )   # of param
                   BIC = − log p data | θ +               log n
                                                    2

• Where:
    – θ:( the estimated parameters.
                  )
    – p data | θ : the maximized value of the likelihood function for the estimated
      model.
    – n: number of data points.
    – Note that there are other ways to write the BIC expression, but they are all
      equivalent.

• Given any two estimated models, the model with the lower value of BIC is
  preferred.




Machine Learning                                                                 25
Bayesian Score

• BIC is an asymptotic (large n) approximation to better (and hard to evaluate)
  Bayesian score                      ˆ
                     Bayesian score = p (θ) p (data | θ) dθ
                                            θ


• Given two models, the model selection is based on Bayes factor
                               ˆ
                                      p (θ1) p (data | θ1) dθ1
                          K = ˆθ1
                                      p (θ2) p (data | θ2) dθ2
                                 θ2




Machine Learning                                                             26
Contents

• Unsupervised Learning

• K-means clustering

• Expectation Maximization (E.M.)
    – Regularized EM
    – Model Selection

• Gaussian mixtures as a Density Estimator
    – Gaussian mixtures
    – EM for mixtures

• Gaussian mixtures for classification

• Case studies

Machine Learning                             27
Remind: Bayes Classifier

                      70


                      60


                      50


                      40


                      30


                      20


                      10


                       0


                     −10
                           0   10   20   30   40   50   60   70   80




                                   p (x | y = i) p (y = i)
                   p (y = i | x) =
                                            p (x)




Machine Learning                                                       28
Remind: Bayes Classifier

                                     70


                                     60


                                     50


                                     40


                                     30


                                     20


                                     10


                                      0


                                 −10
                                          0   10   20   30   40    50     60    70       80




     In case of Gaussian Bayes Classifier:

                                                              [                               ]
                                                                                     T
                                          d/2
                                             1
                                                        exp       −2
                                                                   1
                                                                        (x − µi) Σi (x − µi) pi
                                      (2π) ∥Σi ∥1/2
                   p (y = i | x) =
                                                                        p (x)

     How can we deal with the denominator p (x)?

Machine Learning                                                                                  29
Remind: The Single Gaussian Distribution

• Multivariate Gaussian
                                                                           
                                         1            1
                   N (x; µ, Σ) =     d/2       exp −
                                                   
                                                        (x − µ)T Σ−1 (x − µ)
                                                                            
                                 (2π) ∥ Σ ∥1/2        2


• For maximum likelihood

                                      ∂ ln N (x1, x2, ..., xN; µ, Σ)
                                 0=
                                                   ∂µ


• and the solution is
                                                    1   N
                                                        ∑
                                          µM L    =           xi
                                                    N   i=1
                                      1   N
                                          ∑
                             ΣM L   =           (xi − µM L)T (xi − µM L)
                                      N   i=1



Machine Learning                                                                30
The GMM assumption
• There are k components: {ci}k
                              i=1


• Component ci has an associated mean
  vector µi
                                             µ2
•
                                        µ1


                                              µ3
•




Machine Learning                                   31
The GMM assumption
• There are k components: {ci}k
                              i=1


• Component ci has an associated mean
  vector µi
                                               µ2
• Each component generates data from a
  Gaussian with mean µi and covariance    µ1
  matrix Σi

• Each sample is generated according to         µ3
  the following guidelines:




Machine Learning                                     32
The GMM assumption
• There are k components: {ci}k
                              i=1

• Component ci has an associated mean
  vector µi

• Each component generates data from a        µ2
  Gaussian with mean µi and covariance
  matrix Σi

• Each sample is generated according to
  the following guidelines:
    – Randomly select component ci
      with probability P (ci) = wi, s.t.
      ∑k
       i=1 wi = 1




Machine Learning                                   33
The GMM assumption
• There are k components: {ci}k
                              i=1


• Component ci has an associated mean
  vector µi
                                             µ2
• Each component generates data from a
                                                  x
  Gaussian with mean µi and covariance
  matrix Σi

• Each sample is generated according to
  the following guidelines:

    – Randomly select component ci with
      probability P (ci) = wi, s.t.
      ∑k
       i=1 wi = 1
    – Sample ~ N (µi, Σi)


Machine Learning                                      34
Probability density function of GMM
            “Linear combination” of Gaussians:

                                                                              k
                                                                              ∑                                    k
                                                                                                                   ∑
                                               f (x) =                              wiN (x; µi, Σi) , where              wi = 1
                                                                          i=1                                      i=1




0.018


0.016


0.014


0.012


 0.01
                            f (x)
0.008
                                                     2
                       2
            w1 N µ1 , σ1                  w2 N µ2 , σ2
0.006

                                                                          2
                                                               w3 N µ3 , σ3
0.004


0.002


    0
        0              50           100                  150                  200     250


(a) The pdf of an 1D GMM with 3 components.                                                 (b) The pdf of an 2D GMM with 3 components.

                                      Figure 2: Probability density function of some GMMs.


Machine Learning                                                                                                                          35
GMM: Problem definition
                             k
                             ∑                               k
                                                             ∑
                   f (x) =         wiN (x; µi, Σi) , where         wi = 1
                             i=1                             i=1

     Given a training set, how to model these data point using GMM?

• Given:
    – The trainning set: {xi}N
                             i=1
    – Number of clusters: k

• Goal: model this data using a mixture of Gaussians
    – Weights: w1, w2, ..., wk
    – Means and covariances: µ1, µ2, ..., µk ; Σ1, Σ2, ..., Σk




Machine Learning                                                            36
Computing likelihoods in unsupervised case
                                 k
                                 ∑                                        k
                                                                          ∑
                       f (x) =         wiN (x; µi, Σi) , where                  wi = 1
                                 i=1                                      i=1


• Given a mixture of Gaussians, denoted by G. For any x, we can define the
  likelihood:

                        P (x | G) = P (x | w1, µ1, Σ1, ..., wk , µk , Σk )
                                            k
                                            ∑
                                        =         P (x | ci) P (ci)
                                            i=1
                                             k
                                             ∑
                                        =         wiN (x; µi, Σi)
                                            i=1


• So we can define likelihood for the whole training set [Why?]
                                                           N
                                                           ∏
                       P (x1, x2, ..., xN | G) =               P (xi | G)
                                                           i=1
                                                            N ∑
                                                            ∏  k
                                                       =             wj N (xi; µj , Σj )
                                                           i=1 j=1



Machine Learning                                                                           37
Estimating GMM parameters

• We known this: Maximum Likelihood Estimation
                                                                          
                                       N
                                       ∑            k
                                                    ∑
                      ln P (X | G) =         ln 
                                                       wj N (xi; µj , Σj )
                                                                           
                                       i=1      j=1


    – For the max likelihood:
                                        ∂ ln P (X | G)
                                    0=
                                              ∂µj
    – This leads to non-linear non-analytically-solvable equations!

• Use gradient descent
    – Slow but doable

• A much cuter and recently popular method...



Machine Learning                                                               38
E.M. for GMM

• Remember:
    – We have the training set {xi}N , the number of components k.
                                    i=1
    – Assume we know p (c1) = w1, p (c2) = w2, ..., p (ck ) = wk
    – We don’t know µ1, µ2, ..., µk

The likelihood:



            p (data | µ1, µ2, ..., µk ) = p (x1, x2, ..., xN | µ1, µ2, ..., µk )
                                            N
                                            ∏
                                        =       p (xi | µ1, µ2, ..., µk )
                                            i=1
                                             N ∑
                                             ∏  k
                                        =         p (xi | wj , µ1, µ2, ..., µk ) p (cj )
                                            i=1 j=1                           
                                             N ∑
                                             ∏   k             1 (           )
                                                                              2
                                        =         K exp − 2 xi − µj wi
                                                          
                                          i=1 j=1            2σ


Machine Learning                                                                           39
E.M. for GMM

• For Max. Likelihood, we know ∂µ log p (data | µ1, µ2, ..., µk ) = 0
                                  ∂
                                    i
• Some wild algebra turns this into: For Maximum Likelihood, for each j:

                                N
                                ∑
                                    p (cj | xi, µ1, µ2, ..., µk ) xi
                                i=1
                         µj =     N
                                  ∑
                                       p (cj | xi, µ1, µ2, ..., µk )
                                 i=1


  This is N non-linear equations of µj ’s.
• So:
  – If, for each xi, we know p (cj | xi, µ1, µ2, ..., µk ), then we could easily compute
    µj ,
  – If we know each µj , we could compute p (cj | xi, µ1, µ2, ..., µk ) for each xi
    and cj .




Machine Learning                                                                      40
E.M. for GMM

• E.M. is coming: on the t’th iteration, let our estimates be

                                 λt = {µ1 (t) , µ2 (t) , ..., µk (t)}

• E-step: compute the expected classes of all data points for each class
                                                                  (                    )
                          p (xi | cj , λt) p (cj | λt)         p xi | cj , µj (t) , σj I p (cj )
      p (cj | xi, λt) =                                =
                                  p (xi | λt)              k
                                                           ∑
                                                                 p (xi | cm, µm (t) , σmI) p (cm)
                                                           m=1


• M-step: compute µ given our data’s class membership distributions
                                                N
                                                ∑
                                                    p (cj | xi, λt) xi
                                                i=1
                                 µj (t + 1) =     N
                                                  ∑
                                                       p (cj | xi, λt)
                                                 i=1



Machine Learning                                                                                   41
E.M. for General GMM: E-step

• On the t’th iteration, let our estimates be

    λt = {µ1 (t) , µ2 (t) , ..., µk (t) , Σ1 (t) , Σ2 (t) , ..., Σk (t) , w1 (t) , w2 (t) , ..., wk (t)}


• E-step: compute the expected classes of all data points for each class

                                                 p (xi | cj , λt) p (cj | λt)
                   τij (t) ≡ p (cj | xi, λt) =
                                                         p (xi | λt)
                                                       (                        )
                                                      p xi | cj , µj (t) , Σj (t) wj (t)
                                            =     k
                                                  ∑
                                                       p (xi | cm, µm (t) , Σj (t)) wm (t)
                                                 m=1




Machine Learning                                                                                  42
E.M. for General GMM: M-step

• M-step: compute µ given our data’s class membership distributions

                        N
                        ∑                                                   N
                                                                            ∑
                              p (cj | xi, λt)                                   p (cj | xi, λt) xi
                        i=1                                                 i=1
      wj (t + 1) =                                        µj (t + 1) =        N
                                     N                                        ∑
                                                                              p (cj | xi, λt)
                                                                             i=1
                        1     N
                              ∑                                                1        N
                                                                                        ∑
                    =               τij (t)                             =                  τij (t) xi
                        N     i=1                                         N wj (t + 1) i=1


                            N
                            ∑                   [                      ][                  ]
                                                                                            T
                                  p (cj | xi, λt) xi − µj (t + 1) xi − µj (t + 1)
                            i=1
            Σj (t + 1) =                            N
                                                    ∑
                                                          p (cj | xi, λt)
                                                    i=1
                               1        N
                                        ∑         [               ][              ]
                                                                                   T
                        =                  τij (t) xi − µj (t + 1) xi − µj (t + 1)
                          N wj (t + 1) i=1


Machine Learning                                                                                     43
E.M. for General GMM: Initialization

• wj = 1/k, j = 1, 2, ..., k

• Each µj is set to a randomly selected point
    – Or use K-means for this initialization.

• Each Σj is computed using the equation in previous slide...




Machine Learning                                                44
Regularized E.M. for GMM

• In case of REM, the entropy H (·) is

                                         N
                                         ∑     k
                                               ∑
                   H (C | X; λt)   =−               p (cj | xi; λt) log p (cj | xi; λt)
                                         i=1 i=1
                                         N
                                         ∑     k
                                               ∑
                                   =−               τij (t) log τij (t)
                                         i=1 i=1


    and the likelihood will be

                         L (λt; X, C) =L (λt; X, C) − γH (C | X; λt)
                                         N
                                         ∑         k
                                                   ∑
                                     =       log         wj p (xi | cj , λt)
                                       i=1         j=1
                                               N
                                               ∑    k
                                                    ∑
                                         +γ               τij (t) log τij (t)
                                              i=1 i=1




Machine Learning                                                                          45
Regularized E.M. for GMM

• Some algebra [5] turns into:

                                     N
                                     ∑
                                           p (cj | xi, λt) (1 + γ log p (cj | xi, λt))
                                     i=1
                   wj (t + 1) =
                                                                 N
                                 1         N
                                           ∑
                               =                 τij (t) (1 + γ log τij (t))
                                 N         i=1




                                 N
                                 ∑
                                     p (cj | xi, λt) xi (1 + γ log p (cj | xi, λt))
                   µj (t + 1) = i=1
                                  N
                                  ∑
                                        p (cj | xi, λt) (1 + γ log p (cj | xi, λt))
                                  i=1
                                      1        N
                                               ∑
                             =                    τij (t) xi (1 + γ log τij (t))
                                 N wj (t + 1) i=1



Machine Learning                                                                         46
Regularized E.M. for GMM

• Some algebra [5] turns into (cont.):

                                       1        N
                                                ∑
                   Σj (t + 1) =                    τij (t) (1 + γ log τij (t)) dij (t + 1)
                                  N wj (t + 1) i=1

    where                                 [                ][                  ]
                                                                               T
                         dij (t + 1) = xi − µj (t + 1) xi − µj (t + 1)




Machine Learning                                                                             47
Demonstration

• EM for GMM

• REM for GMM




Machine Learning                   48
Local optimum solution

• E.M. is guaranteed to find the local optimal solution by monotonically increasing
  the log-likelihood

• Whether it converges to the global optimal solution depends on the initialization


      18                                     15

      16

      14

      12                                     10

      10

       8

       6                                      5

       4

       2

      0                                      0
      −10          −5   0   5      10   15   −10    −5     0     5      10    15




Machine Learning                                                                   49
GMM: Selecting the number of components

• We can run the E.M. algorithm with different numbers of components.
        – Need a criteria for selecting the “best” number of components

   15                                16                           16


                                     14                           14


                                     12                           12

   10
                                     10                           10


                                      8                            8


                                      6                            6
    5

                                      4                            4


                                      2                            2


   0                                 0                            0
   −10     −5      0   5   10   15   −10   −5   0   5   10   15   −10   −5   0   5   10   15




Machine Learning                                                                               50
GMM: Model Selection

• Empirically/Experimentally [Sure!]

• Cross-Validation [How?]

• BIC

• ...




Machine Learning                               51
GMM: Model Selection

• Empirically/Experimentally
    – Typically 3-5 components

• Cross-Validation: K-fold, leave-one-out...
    – Omit each point xi in turn, estimate the parameters θ −i on the basis of the
      remaining points, then evaluate
                                    N         (              )
                                    ∑                   −i
                                         log p xi | θ
                                   i=1

• BIC: find k (the number of components) that minimize the BIC

                                          (         )     dk
                        BIC = − log p data | θm          + log n
                                                          2

    where dk is the number of (effective) parameters in the k-component mixture.

Machine Learning                                                                  52
Contents

• Unsupervised Learning

• K-means clustering

• Expectation Maximization (E.M.)
    – Regularized EM
    – Model Selection

• Gaussian mixtures as a Density Estimator
    – Gaussian mixtures
    – EM for mixtures

• Gaussian mixtures for classification

• Case studies

Machine Learning                             53
Gaussian mixtures for classification
                                        p (x | y = i) p (y = i)
                      p (y = i | x) =
                                                 p (x)

• To build a Bayesian classifier based on GMM, we can use GMM to model data in
  each class
    – So each class is modeled by one k-component GMM.

• For example:
  Class 0: p (y = 0) , p (x | θ 0), (a 3-component mixture)
  Class 1: p (y = 1) , p (x | θ 1), (a 3-component mixture)
  Class 2: p (y = 2) , p (x | θ 2), (a 3-component mixture)
  ...




Machine Learning                                                           54
GMM for Classification

• As previous, each class is modeled by a k-component GMM.

• A new test sample x is classified according to

                           c = arg max p (y = i) p (x | θ i)
                                     i


    where
                                          k
                                          ∑
                          p (x | θ i) =         wiN (x; µi, Σi)
                                          i=1


• Simple, quick (and is actually used!)




Machine Learning                                                  55
Contents

• Unsupervised Learning

• K-means clustering

• Expectation Maximization (E.M.)
    – Regularized EM
    – Model Selection

• Gaussian mixtures as a Density Estimator
    – Gaussian mixtures
    – EM for mixtures

• Gaussian mixtures for classification

• Case studies

Machine Learning                             56
Case studies

• Background subtraction
    – GMM for each pixel

• Speech recognition
    – GMM for the underlying distribution of feature vectors of each phone

• Many, many others...




Machine Learning                                                             57
What you should already know?

• K-means as a trivial classifier

• E.M. - an algorithm for solving many MLE problems

• GMM - a tool for modeling data
    – Note 1: We can have a mixture model of many different types of distribution,
      not only Gaussians
    – Note 2: Compute the sum of Gaussians may be expensive, some approximations
      are available [3]

• Model selection:
    – Bayesian Information Criterion




Machine Learning                                                               58
Q&A




Machine Learning         59
References

[1] N. Laird A. Dempster and D. Rubin. Maximum likelihood from incomplete data
    via the em algorithm. Journal of the Royal Statistical Society. Series B (Method-
    ological), 39(1):pp. 1–38., 1977.

[2] David Arthur and Sergei Vassilvitskii. k-means ++ : The Advantages of Careful
    Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on
    Discrete algorithms, volume 8, pages 1027–1035, 2007.

[3] N. Gumerov C. Yang, R. Duraiswami and L. Davis. Improved fast gauss transform
    and efficient kernel density estimation. In IEEE International Conference on
    Computer Vision, pages pages 464–471, 2003.

[4] Charles Elkan. Using the Triangle Inequality to Accelerate k-Means. In Proceed-

Machine Learning                                                                   60
ings of the Twentieth International Conference on Machine Learning (ICML),
     2003.

[5] Keshu Zhang Haifeng Li and Tao Jiang. The regularized em algorithm. In
    Proceedings of the 20th National Conference on Artificial Intelligence, pages
    pages 807 – 812, Pittsburgh, PA, 2005.

[6] Greg Hamerly and Charles Elkan. Learning the k in k-means. In In Neural
    Information Processing Systems. MIT Press, 2003.

[7] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth
    Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: anal-
    ysis and implementation. IEEE Transactions on Pattern Analysis and Machine
    Intelligence, 24(7):881–892, July 2002.

[8] J MacQueen. Some methods for classification and analysis of multivariate obser-
    vations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics
    and Probability, volume 233, pages 281–297. University of California Press, 1967.

Machine Learning                                                                   61
[9] C.F. Wu. On the convergence properties of the em algorithm. The Annals of
    Statistics, 11:95–103, 1983.




Machine Learning                                                           62

Mais conteúdo relacionado

Mais procurados

MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...홍배 김
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationJason Anderson
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Universitat Politècnica de Catalunya
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Hojin Yang
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting treeDong Guo
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkHiroshi Kuwajima
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackarogozhnikov
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain홍배 김
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 

Mais procurados (20)

MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
 
Variational Autoencoders For Image Generation
Variational Autoencoders For Image GenerationVariational Autoencoders For Image Generation
Variational Autoencoders For Image Generation
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting tree
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Distributed ADMM
Distributed ADMMDistributed ADMM
Distributed ADMM
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 

Semelhante a K-MEANS EM AND MIXTURE MODELS

ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 3
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 3ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 3
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 3tingyuansenastro
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx36rajneekant
 
Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Multiple Kernel Learning based Approach to Representation and Feature Selecti...Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Multiple Kernel Learning based Approach to Representation and Feature Selecti...ICAC09
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdfRahul926331
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptGayathriRHICETCSESTA
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptGayathriRHICETCSESTA
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...MostafaHazemMostafaa
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdfEmanAsem4
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 

Semelhante a K-MEANS EM AND MIXTURE MODELS (20)

ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 3
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 3ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 3
ANU ASTR 4004 / 8004 Astronomical Computing : Lecture 3
 
Curve fitting
Curve fittingCurve fitting
Curve fitting
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
DimensionalityReduction.pptx
DimensionalityReduction.pptxDimensionalityReduction.pptx
DimensionalityReduction.pptx
 
Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Multiple Kernel Learning based Approach to Representation and Feature Selecti...Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Multiple Kernel Learning based Approach to Representation and Feature Selecti...
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
 
feedforward-network-
feedforward-network-feedforward-network-
feedforward-network-
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Svm V SVC
Svm V SVCSvm V SVC
Svm V SVC
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 

Mais de Vu Pham

Seq2 seq learning
Seq2 seq learningSeq2 seq learning
Seq2 seq learningVu Pham
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine LearningVu Pham
 
Probability and Statistics "Cheatsheet"
Probability and Statistics "Cheatsheet"Probability and Statistics "Cheatsheet"
Probability and Statistics "Cheatsheet"Vu Pham
 
Notes for Optimization Chapter 1 and 2
Notes for Optimization Chapter 1 and 2Notes for Optimization Chapter 1 and 2
Notes for Optimization Chapter 1 and 2Vu Pham
 
Markov Models
Markov ModelsMarkov Models
Markov ModelsVu Pham
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov ModelsVu Pham
 

Mais de Vu Pham (6)

Seq2 seq learning
Seq2 seq learningSeq2 seq learning
Seq2 seq learning
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
Probability and Statistics "Cheatsheet"
Probability and Statistics "Cheatsheet"Probability and Statistics "Cheatsheet"
Probability and Statistics "Cheatsheet"
 
Notes for Optimization Chapter 1 and 2
Notes for Optimization Chapter 1 and 2Notes for Optimization Chapter 1 and 2
Notes for Optimization Chapter 1 and 2
 
Markov Models
Markov ModelsMarkov Models
Markov Models
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 

Último

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 

Último (20)

Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 

K-MEANS EM AND MIXTURE MODELS

  • 1. Machine Learning K-means, E.M. and Mixture models VU Pham phvu@fit.hcmus.edu.vn Department of Computer Science November 22, 2010 Machine Learning
  • 2. Remind: Three Main Problems in ML • Three main problems in ML: – Regression: Linear Regression, Neural net... – Classification: Decision Tree, kNN, Bayessian Classifier... – Density Estimation: Gauss Naive DE,... • Today, we will learn: – K-means: a trivial unsupervised classification algorithm. – Expectation Maximization: a general algorithm for density estimation. ∗ We will see how to use EM in general cases and in specific case of GMM. – GMM: a tool for modelling Data-in-the-Wild (density estimator) ∗ We also learn how to use GMM in a Bayessian Classifier Machine Learning 1
  • 3. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) – Regularized EM – Model Selection • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification • Case studies Machine Learning 2
  • 4. Unsupervised Learning • So far, we have considered supervised learning techniques: – Label of each sample is included in the training set Sample Label x1 y1 ... ... xn yk • Unsupervised learning: – Traning set contains the samples only Sample Label x1 ... xn Machine Learning 3
  • 5. Unsupervised Learning 60 60 50 50 40 40 30 30 20 20 10 10 0 0 −10 0 10 20 30 40 50 −10 0 10 20 30 40 50 (a) Supervised learning. (b) Unsupervised learning. Figure 1: Unsupervised vs. Supervised Learning Machine Learning 4
  • 6. What is unsupervised learning useful for? • Collecting and labeling a large training set can be very expensive. • Be able to find features which are helpful for categorization. • Gain insight into the natural structure of the data. Machine Learning 5
  • 7. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) – Regularized EM – Model Selection • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification • Case studies Machine Learning 6
  • 8. K-means clustering • Clustering algorithms aim to find groups of “similar” data points among 60 the input data. 50 • K-means is an effective algorithm to ex- 40 tract a given number of clusters from a 30 training set. 20 • Once done, the cluster locations can 10 be used to classify data into distinct 0 classes. −10 0 10 20 30 40 50 Machine Learning 7
  • 9. K-means clustering • Given: – The dataset: {xn}N = {x1, x2, ..., xN} n=1 – Number of clusters: K (K < N ) • Goal: find a partition S = {Sk }K so that it minimizes the objective function k=1 N ∑ K ∑ J= rnk ∥ xn − µk ∥2 (1) n=1 k=1 where rnk = 1 if xn is assigned to cluster Sk , and rnj = 0 for j ̸= k. i.e. Find values for the {rnk } and the {µk } to minimize (1). Machine Learning 8
  • 10. K-means clustering N ∑ K ∑ J= rnk ∥ xn − µk ∥2 n=1 k=1 • Select some initial values for the µk . • Expectation: keep the µk fixed, minimize J respect to rnk . • Maximization: keep the rnk fixed, minimize J respect to the µk . • Loop until no change in the partitions (or maximum number of interations is exceeded). Machine Learning 9
  • 11. K-means clustering N ∑ K ∑ J= rnk ∥ xn − µk ∥2 n=1 k=1 • Expectation: J is linear function of rnk   1 if k = arg minj ∥ xn − µj ∥2     rnk =   0  otherwise • Maximization: setting the derivative of J with respect to µk to zero, gives: ∑ n rnk xn µk = ∑ n rnk Convergence of K-means: assured [why?], but may lead to local minimum of J [8] Machine Learning 10
  • 12. K-means clustering: How to understand? N ∑ K ∑ J= rnk ∥ xn − µk ∥2 n=1 k=1 • Expectation: minimize J respect to rnk – For each xn, find the “closest” cluster mean µk and put xn into cluster Sk . • Maximization: minimize J respect to µk – For each cluster Sk , re-estimate the cluster mean µk to be the average value of all samples in Sk . • Loop until no change in the partitions (or maximum number of interations is exceeded). Machine Learning 11
  • 14. K-means clustering: some variations • Initial cluster centroids: – Randomly selected – Iterative procedure: k-mean++ [2] • Number of clusters K: √ – Empirically/experimentally: 2 ∼ n – Learning [6] • Objective function: – General dissimilarity measure: k-medoids algorithm. • Speeding up: – kd-trees for pre-processing [7] – Triangle inequality for distance calculation [4] Machine Learning 13
  • 15. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) – Regularized EM – Model Selection • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification • Case studies Machine Learning 14
  • 16. Expectation Maximization E.M. Machine Learning 15
  • 17. Expectation Maximization • A general-purpose algorithm for MLE in a wide range of situations. • First formally stated by Dempster, Laird and Rubin in 1977 [1] – We even have several books discussing only on EM and its variations! • An excellent way of doing our unsupervised learning problem, as we will see – EM is also used widely in other domains. Machine Learning 16
  • 18. EM: a solution for MLE • Given a statistical model with: – a set X of observed data, – a set Z of unobserved latent data, – a vector of unknown parameters θ, – a likelihood function L (θ; X, Z) = p (X, Z | θ) • Roughly speaking, the aim of MLE is to determine θ = arg maxθ L (θ; X, Z) – We known the old trick: partial derivatives of the log likelihood... – But it is not always tractable [e.g.] – Other solutions are available. Machine Learning 17
  • 19. EM: General Case L (θ; X, Z) = p (X, Z | θ) • EM is just an iterative procedure for finding the MLE • Expectation step: keep the current estimate θ (t) fixed, calculate the expected value of the log likelihood function ( ) Q θ|θ (t) = E [log L (θ; X, Z)] = E [log p (X, Z | θ)] • Maximization step: Find the parameter that maximizes this quantity ( ) θ (t+1) = arg max Q θ | θ (t) θ Machine Learning 18
  • 20. EM: Motivation • If we know the value of the parameters θ, we can find the value of latent variables Z by maximizing the log likelihood over all possible values of Z – Searching on the value space of Z. • If we know Z, we can find an estimate of θ – Typically by grouping the observed data points according to the value of asso- ciated latent variable, – then averaging the values (or some functions of the values) of the points in each group. To understand this motivation, let’s take K-means as a trivial example... Machine Learning 19
  • 21. EM: informal description Both θ and Z are unknown, EM is an iterative algorithm: 1. Initialize the parameters θ to some random values. 2. Compute the best values of Z given these parameter values. 3. Use the just-computed values of Z to find better estimates for θ. 4. Iterate until convergence. Machine Learning 20
  • 22. EM Convergence • E.M. Convergence: Yes – After each iteration, p (X, Z | θ) must increase or remain [NOT OBVIOUS] – But it can not exceed 1 [OBVIOUS] – Hence it must converge [OBVIOUS] • Bad news: E.M. converges to local optimum. – Whether the algorithm converges to the global optimum depends on the ini- tialization. • Let’s take K-means as an example, again... • Details can be found in [9]. Machine Learning 21
  • 23. Regularized EM (REM) • EM tries to inference the latent (missing) data Z from the observations X – We want to choose the missing data that has a strong probabilistic relation to the observations, i.e. we assume that the observations contains lots of information about the missing data. – But E.M. does not have any control on the relationship between the missing data and the observations! • Regularized EM (REM) [5] tries to optimized the penalized likelihood L (θ | X, Z) = L (θ | X, Z) − γH (Z | X, θ) where H (Y ) is Shannon’s entropy of the random variable Y : ∑ H (Y ) = − p (y) log p (y) y and the positive value γ is the regularization parameter. [When γ = 0?] Machine Learning 22
  • 24. Regularized EM (REM) • E-step: unchanged • M-step: Find the parameter that maximizes this quantity ( ) θ (t+1) = arg max Q θ | θ (t) θ where ( ) ( ) Q θ|θ (t) =Q θ|θ (t) − γH (Z | X, θ) • REM is expected to converge faster than EM (and it does!) • So, to apply REM, we just need to determine the H (·) part... Machine Learning 23
  • 25. Model Selection • Considering a parametric model: – When estimating model parameters using MLE, it is possible to increase the likelihood by adding parameters – But may result in over-fitting. • e.g. K-means with different values of K... • Need a criteria for model selection, e.g. to “judge” which model configuration is better, how many parameters is sufficient... – Cross Validation – Akaike Information Criterion (AIC) – Bayesian Factor ∗ Bayesian Informaction Criterion (BIC) ∗ Deviance Information Criterion – ... Machine Learning 24
  • 26. Bayesian Information Criterion ( ) # of param BIC = − log p data | θ + log n 2 • Where: – θ:( the estimated parameters. ) – p data | θ : the maximized value of the likelihood function for the estimated model. – n: number of data points. – Note that there are other ways to write the BIC expression, but they are all equivalent. • Given any two estimated models, the model with the lower value of BIC is preferred. Machine Learning 25
  • 27. Bayesian Score • BIC is an asymptotic (large n) approximation to better (and hard to evaluate) Bayesian score ˆ Bayesian score = p (θ) p (data | θ) dθ θ • Given two models, the model selection is based on Bayes factor ˆ p (θ1) p (data | θ1) dθ1 K = ˆθ1 p (θ2) p (data | θ2) dθ2 θ2 Machine Learning 26
  • 28. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) – Regularized EM – Model Selection • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification • Case studies Machine Learning 27
  • 29. Remind: Bayes Classifier 70 60 50 40 30 20 10 0 −10 0 10 20 30 40 50 60 70 80 p (x | y = i) p (y = i) p (y = i | x) = p (x) Machine Learning 28
  • 30. Remind: Bayes Classifier 70 60 50 40 30 20 10 0 −10 0 10 20 30 40 50 60 70 80 In case of Gaussian Bayes Classifier: [ ] T d/2 1 exp −2 1 (x − µi) Σi (x − µi) pi (2π) ∥Σi ∥1/2 p (y = i | x) = p (x) How can we deal with the denominator p (x)? Machine Learning 29
  • 31. Remind: The Single Gaussian Distribution • Multivariate Gaussian   1 1 N (x; µ, Σ) = d/2 exp −  (x − µ)T Σ−1 (x − µ)  (2π) ∥ Σ ∥1/2 2 • For maximum likelihood ∂ ln N (x1, x2, ..., xN; µ, Σ) 0= ∂µ • and the solution is 1 N ∑ µM L = xi N i=1 1 N ∑ ΣM L = (xi − µM L)T (xi − µM L) N i=1 Machine Learning 30
  • 32. The GMM assumption • There are k components: {ci}k i=1 • Component ci has an associated mean vector µi µ2 • µ1 µ3 • Machine Learning 31
  • 33. The GMM assumption • There are k components: {ci}k i=1 • Component ci has an associated mean vector µi µ2 • Each component generates data from a Gaussian with mean µi and covariance µ1 matrix Σi • Each sample is generated according to µ3 the following guidelines: Machine Learning 32
  • 34. The GMM assumption • There are k components: {ci}k i=1 • Component ci has an associated mean vector µi • Each component generates data from a µ2 Gaussian with mean µi and covariance matrix Σi • Each sample is generated according to the following guidelines: – Randomly select component ci with probability P (ci) = wi, s.t. ∑k i=1 wi = 1 Machine Learning 33
  • 35. The GMM assumption • There are k components: {ci}k i=1 • Component ci has an associated mean vector µi µ2 • Each component generates data from a x Gaussian with mean µi and covariance matrix Σi • Each sample is generated according to the following guidelines: – Randomly select component ci with probability P (ci) = wi, s.t. ∑k i=1 wi = 1 – Sample ~ N (µi, Σi) Machine Learning 34
  • 36. Probability density function of GMM “Linear combination” of Gaussians: k ∑ k ∑ f (x) = wiN (x; µi, Σi) , where wi = 1 i=1 i=1 0.018 0.016 0.014 0.012 0.01 f (x) 0.008 2 2 w1 N µ1 , σ1 w2 N µ2 , σ2 0.006 2 w3 N µ3 , σ3 0.004 0.002 0 0 50 100 150 200 250 (a) The pdf of an 1D GMM with 3 components. (b) The pdf of an 2D GMM with 3 components. Figure 2: Probability density function of some GMMs. Machine Learning 35
  • 37. GMM: Problem definition k ∑ k ∑ f (x) = wiN (x; µi, Σi) , where wi = 1 i=1 i=1 Given a training set, how to model these data point using GMM? • Given: – The trainning set: {xi}N i=1 – Number of clusters: k • Goal: model this data using a mixture of Gaussians – Weights: w1, w2, ..., wk – Means and covariances: µ1, µ2, ..., µk ; Σ1, Σ2, ..., Σk Machine Learning 36
  • 38. Computing likelihoods in unsupervised case k ∑ k ∑ f (x) = wiN (x; µi, Σi) , where wi = 1 i=1 i=1 • Given a mixture of Gaussians, denoted by G. For any x, we can define the likelihood: P (x | G) = P (x | w1, µ1, Σ1, ..., wk , µk , Σk ) k ∑ = P (x | ci) P (ci) i=1 k ∑ = wiN (x; µi, Σi) i=1 • So we can define likelihood for the whole training set [Why?] N ∏ P (x1, x2, ..., xN | G) = P (xi | G) i=1 N ∑ ∏ k = wj N (xi; µj , Σj ) i=1 j=1 Machine Learning 37
  • 39. Estimating GMM parameters • We known this: Maximum Likelihood Estimation   N ∑ k ∑ ln P (X | G) = ln   wj N (xi; µj , Σj )  i=1 j=1 – For the max likelihood: ∂ ln P (X | G) 0= ∂µj – This leads to non-linear non-analytically-solvable equations! • Use gradient descent – Slow but doable • A much cuter and recently popular method... Machine Learning 38
  • 40. E.M. for GMM • Remember: – We have the training set {xi}N , the number of components k. i=1 – Assume we know p (c1) = w1, p (c2) = w2, ..., p (ck ) = wk – We don’t know µ1, µ2, ..., µk The likelihood: p (data | µ1, µ2, ..., µk ) = p (x1, x2, ..., xN | µ1, µ2, ..., µk ) N ∏ = p (xi | µ1, µ2, ..., µk ) i=1 N ∑ ∏ k = p (xi | wj , µ1, µ2, ..., µk ) p (cj ) i=1 j=1   N ∑ ∏ k 1 ( ) 2 = K exp − 2 xi − µj wi  i=1 j=1 2σ Machine Learning 39
  • 41. E.M. for GMM • For Max. Likelihood, we know ∂µ log p (data | µ1, µ2, ..., µk ) = 0 ∂ i • Some wild algebra turns this into: For Maximum Likelihood, for each j: N ∑ p (cj | xi, µ1, µ2, ..., µk ) xi i=1 µj = N ∑ p (cj | xi, µ1, µ2, ..., µk ) i=1 This is N non-linear equations of µj ’s. • So: – If, for each xi, we know p (cj | xi, µ1, µ2, ..., µk ), then we could easily compute µj , – If we know each µj , we could compute p (cj | xi, µ1, µ2, ..., µk ) for each xi and cj . Machine Learning 40
  • 42. E.M. for GMM • E.M. is coming: on the t’th iteration, let our estimates be λt = {µ1 (t) , µ2 (t) , ..., µk (t)} • E-step: compute the expected classes of all data points for each class ( ) p (xi | cj , λt) p (cj | λt) p xi | cj , µj (t) , σj I p (cj ) p (cj | xi, λt) = = p (xi | λt) k ∑ p (xi | cm, µm (t) , σmI) p (cm) m=1 • M-step: compute µ given our data’s class membership distributions N ∑ p (cj | xi, λt) xi i=1 µj (t + 1) = N ∑ p (cj | xi, λt) i=1 Machine Learning 41
  • 43. E.M. for General GMM: E-step • On the t’th iteration, let our estimates be λt = {µ1 (t) , µ2 (t) , ..., µk (t) , Σ1 (t) , Σ2 (t) , ..., Σk (t) , w1 (t) , w2 (t) , ..., wk (t)} • E-step: compute the expected classes of all data points for each class p (xi | cj , λt) p (cj | λt) τij (t) ≡ p (cj | xi, λt) = p (xi | λt) ( ) p xi | cj , µj (t) , Σj (t) wj (t) = k ∑ p (xi | cm, µm (t) , Σj (t)) wm (t) m=1 Machine Learning 42
  • 44. E.M. for General GMM: M-step • M-step: compute µ given our data’s class membership distributions N ∑ N ∑ p (cj | xi, λt) p (cj | xi, λt) xi i=1 i=1 wj (t + 1) = µj (t + 1) = N N ∑ p (cj | xi, λt) i=1 1 N ∑ 1 N ∑ = τij (t) = τij (t) xi N i=1 N wj (t + 1) i=1 N ∑ [ ][ ] T p (cj | xi, λt) xi − µj (t + 1) xi − µj (t + 1) i=1 Σj (t + 1) = N ∑ p (cj | xi, λt) i=1 1 N ∑ [ ][ ] T = τij (t) xi − µj (t + 1) xi − µj (t + 1) N wj (t + 1) i=1 Machine Learning 43
  • 45. E.M. for General GMM: Initialization • wj = 1/k, j = 1, 2, ..., k • Each µj is set to a randomly selected point – Or use K-means for this initialization. • Each Σj is computed using the equation in previous slide... Machine Learning 44
  • 46. Regularized E.M. for GMM • In case of REM, the entropy H (·) is N ∑ k ∑ H (C | X; λt) =− p (cj | xi; λt) log p (cj | xi; λt) i=1 i=1 N ∑ k ∑ =− τij (t) log τij (t) i=1 i=1 and the likelihood will be L (λt; X, C) =L (λt; X, C) − γH (C | X; λt) N ∑ k ∑ = log wj p (xi | cj , λt) i=1 j=1 N ∑ k ∑ +γ τij (t) log τij (t) i=1 i=1 Machine Learning 45
  • 47. Regularized E.M. for GMM • Some algebra [5] turns into: N ∑ p (cj | xi, λt) (1 + γ log p (cj | xi, λt)) i=1 wj (t + 1) = N 1 N ∑ = τij (t) (1 + γ log τij (t)) N i=1 N ∑ p (cj | xi, λt) xi (1 + γ log p (cj | xi, λt)) µj (t + 1) = i=1 N ∑ p (cj | xi, λt) (1 + γ log p (cj | xi, λt)) i=1 1 N ∑ = τij (t) xi (1 + γ log τij (t)) N wj (t + 1) i=1 Machine Learning 46
  • 48. Regularized E.M. for GMM • Some algebra [5] turns into (cont.): 1 N ∑ Σj (t + 1) = τij (t) (1 + γ log τij (t)) dij (t + 1) N wj (t + 1) i=1 where [ ][ ] T dij (t + 1) = xi − µj (t + 1) xi − µj (t + 1) Machine Learning 47
  • 49. Demonstration • EM for GMM • REM for GMM Machine Learning 48
  • 50. Local optimum solution • E.M. is guaranteed to find the local optimal solution by monotonically increasing the log-likelihood • Whether it converges to the global optimal solution depends on the initialization 18 15 16 14 12 10 10 8 6 5 4 2 0 0 −10 −5 0 5 10 15 −10 −5 0 5 10 15 Machine Learning 49
  • 51. GMM: Selecting the number of components • We can run the E.M. algorithm with different numbers of components. – Need a criteria for selecting the “best” number of components 15 16 16 14 14 12 12 10 10 10 8 8 6 6 5 4 4 2 2 0 0 0 −10 −5 0 5 10 15 −10 −5 0 5 10 15 −10 −5 0 5 10 15 Machine Learning 50
  • 52. GMM: Model Selection • Empirically/Experimentally [Sure!] • Cross-Validation [How?] • BIC • ... Machine Learning 51
  • 53. GMM: Model Selection • Empirically/Experimentally – Typically 3-5 components • Cross-Validation: K-fold, leave-one-out... – Omit each point xi in turn, estimate the parameters θ −i on the basis of the remaining points, then evaluate N ( ) ∑ −i log p xi | θ i=1 • BIC: find k (the number of components) that minimize the BIC ( ) dk BIC = − log p data | θm + log n 2 where dk is the number of (effective) parameters in the k-component mixture. Machine Learning 52
  • 54. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) – Regularized EM – Model Selection • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification • Case studies Machine Learning 53
  • 55. Gaussian mixtures for classification p (x | y = i) p (y = i) p (y = i | x) = p (x) • To build a Bayesian classifier based on GMM, we can use GMM to model data in each class – So each class is modeled by one k-component GMM. • For example: Class 0: p (y = 0) , p (x | θ 0), (a 3-component mixture) Class 1: p (y = 1) , p (x | θ 1), (a 3-component mixture) Class 2: p (y = 2) , p (x | θ 2), (a 3-component mixture) ... Machine Learning 54
  • 56. GMM for Classification • As previous, each class is modeled by a k-component GMM. • A new test sample x is classified according to c = arg max p (y = i) p (x | θ i) i where k ∑ p (x | θ i) = wiN (x; µi, Σi) i=1 • Simple, quick (and is actually used!) Machine Learning 55
  • 57. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) – Regularized EM – Model Selection • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification • Case studies Machine Learning 56
  • 58. Case studies • Background subtraction – GMM for each pixel • Speech recognition – GMM for the underlying distribution of feature vectors of each phone • Many, many others... Machine Learning 57
  • 59. What you should already know? • K-means as a trivial classifier • E.M. - an algorithm for solving many MLE problems • GMM - a tool for modeling data – Note 1: We can have a mixture model of many different types of distribution, not only Gaussians – Note 2: Compute the sum of Gaussians may be expensive, some approximations are available [3] • Model selection: – Bayesian Information Criterion Machine Learning 58
  • 61. References [1] N. Laird A. Dempster and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Method- ological), 39(1):pp. 1–38., 1977. [2] David Arthur and Sergei Vassilvitskii. k-means ++ : The Advantages of Careful Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, volume 8, pages 1027–1035, 2007. [3] N. Gumerov C. Yang, R. Duraiswami and L. Davis. Improved fast gauss transform and efficient kernel density estimation. In IEEE International Conference on Computer Vision, pages pages 464–471, 2003. [4] Charles Elkan. Using the Triangle Inequality to Accelerate k-Means. In Proceed- Machine Learning 60
  • 62. ings of the Twentieth International Conference on Machine Learning (ICML), 2003. [5] Keshu Zhang Haifeng Li and Tao Jiang. The regularized em algorithm. In Proceedings of the 20th National Conference on Artificial Intelligence, pages pages 807 – 812, Pittsburgh, PA, 2005. [6] Greg Hamerly and Charles Elkan. Learning the k in k-means. In In Neural Information Processing Systems. MIT Press, 2003. [7] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: anal- ysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881–892, July 2002. [8] J MacQueen. Some methods for classification and analysis of multivariate obser- vations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 233, pages 281–297. University of California Press, 1967. Machine Learning 61
  • 63. [9] C.F. Wu. On the convergence properties of the em algorithm. The Annals of Statistics, 11:95–103, 1983. Machine Learning 62