SlideShare uma empresa Scribd logo
1 de 220
Baixar para ler offline
MCMC and likelihood-free methods Part/day I: Markov chain methods




                       MCMC and likelihood-free methods
                       Part/day I: Markov chain methods

                                            Christian P. Robert

                                 Universit´ Paris-Dauphine, IUF, & CREST
                                          e


                            Monash University, EBS, July 18, 2012
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics




Motivations and leading example



   Computational issues in Bayesian
   statistics

   The Metropolis-Hastings Algorithm

   The Gibbs Sampler

   Population Monte Carlo
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


What is Bayesian statistics?

      Statistical model defined by a likelihood function

                                   f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn )

                                                                     [inversion of what varies]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


What is Bayesian statistics?

      Statistical model defined by a likelihood function

                                   f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn )

                                                                      [inversion of what varies]
   Bayesian approach turns the likelihood
   into a conditional density:

    π(θ|x1 , . . . , xn ) ∝ π(θ)L(θ|x1 , . . . , xn )

   using a reference measure (or a prior)
   π(θ)
                                                                    [Thomas Bayes, 1701–1761]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


What is Bayesian statistics?

      Statistical model defined by a likelihood function

                                   f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn )

                                                                      [inversion of what varies]
   Bayesian approach turns the likelihood
   into a conditional density:

    π(θ|x1 , . . . , xn ) ∝ π(θ)L(θ|x1 , . . . , xn )

   using a reference measure (or a prior)
   π(θ)
                                                                    [Thomas Bayes, 1701–1761]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


New perspective



              Uncertainty on the parameters θ of a model modeled through
              a probability distribution π on Θ, called prior distribution
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


New perspective



              Uncertainty on the parameters θ of a model modeled through
              a probability distribution π on Θ, called prior distribution
              Inference processed through distribution of θ conditional on x,
              π(θ|x), called posterior distribution

                                                            f (x|θ)π(θ)
                                            π(θ|x) =                       .
                                                            f (x|θ)π(θ) dθ
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


Justifications



              Semantic drift from unknown to random
              Actualization of the information on θ by extracting the
              information on θ contained in the observation x
              Allows incorporation of imperfect information in the decision
              process
              Unique mathematical way to condition upon the observations
              (conditional perspective)
              Penalization factor
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


Posterior distribution


                              π(θ|x) central to Bayesian inference
              Operates conditional upon the observation s
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


Posterior distribution


                              π(θ|x) central to Bayesian inference
              Operates conditional upon the observation s
              Incorporates the requirement of the Likelihood Principle
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


Posterior distribution


                              π(θ|x) central to Bayesian inference
              Operates conditional upon the observation s
              Incorporates the requirement of the Likelihood Principle
              Avoids averaging over the unobserved values of x
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


Posterior distribution


                              π(θ|x) central to Bayesian inference
              Operates conditional upon the observation s
              Incorporates the requirement of the Likelihood Principle
              Avoids averaging over the unobserved values of x
              Coherent updating of the information available on θ,
              independent of the order in which i.i.d. observations are
              collected
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     abc of Bayesian perspective


Posterior distribution


                              π(θ|x) central to Bayesian inference
              Operates conditional upon the observation s
              Incorporates the requirement of the Likelihood Principle
              Avoids averaging over the unobserved values of x
              Coherent updating of the information available on θ,
              independent of the order in which i.i.d. observations are
              collected
              Provides a complete inferential scope
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Latent structures make life harder!



      Even simple models may lead to computational complications, as
      in latent variable models

                                        f (x|θ) =        f (x, x |θ) dx
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Latent structures make life harder!



      Even simple models may lead to computational complications, as
      in latent variable models

                                        f (x|θ) =        f (x, x |θ) dx


              If (x, x ) observed, fine!
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Latent structures make life harder!



      Even simple models may lead to computational complications, as
      in latent variable models

                                        f (x|θ) =        f (x, x |θ) dx


              If (x, x ) observed, fine!
              If only x observed, trouble!
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


example: mixture models


      Models of mixtures of distributions:

                                        X ∼ fj with probability pj ,

      for j = 1, 2, . . . , k, with overall density

                                     X ∼ p1 f1 (x) + · · · + pk fk (x) .
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


example: mixture models

      Models of mixtures of distributions:

                                         X ∼ fj with probability pj ,

      for j = 1, 2, . . . , k, with overall density

                                     X ∼ p1 f1 (x) + · · · + pk fk (x) .

      For a sample of independent random variables (X1 , · · · , Xn ),
      sample density
                                     n
                                         {p1 f1 (xi ) + · · · + pk fk (xi )} .
                                   i=1
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


example: mixture models

      Models of mixtures of distributions:

                                         X ∼ fj with probability pj ,

      for j = 1, 2, . . . , k, with overall density

                                     X ∼ p1 f1 (x) + · · · + pk fk (x) .

                                     n
                                         {p1 f1 (xi ) + · · · + pk fk (xi )} .
                                   i=1

      Expanding this product of sums into a sum of products involves k n
      elementary terms: too prohibitive to compute in large samples.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Simple mixture (1)
                        3
                        2
                  µ2

                        1
                        0
                        −1




                                 −1               0               1              2           3

                                                                 µ1



                                      Case of the 0.3N (µ1 , 1) + 0.7N (µ2 , 1) likelihood
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Simple mixture (2)


      For mixture of two normal distributions,

                                        0.3N (µ1 , 1) + 0.7N (µ2 , 1) ,
      likelihood proportional to

                                 n
                                      [0.3ϕ (xi − µ1 ) + 0.7 ϕ (xi − µ2 )]
                               i=1

      containing 2n terms.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Complex maximisation

      Standard maximization techniques often fail to find the global
      maximum because of multimodality or undesirable behavior
      (usually at the frontier of the domain) of the likelihood function.

      Example
      In the special case

        f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } +                            exp{(−1/2σ 2 )(x − µ)2 }
                                                                    σ
      with        > 0 known,
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Complex maximisation

      Standard maximization techniques often fail to find the global
      maximum because of multimodality or undesirable behavior
      (usually at the frontier of the domain) of the likelihood function.

      Example
      In the special case

        f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } +                            exp{(−1/2σ 2 )(x − µ)2 }
                                                                    σ
      with        > 0 known, whatever n, the likelihood is unbounded:

                                   lim L(x1 , . . . , xn |µ = x1 , σ) = ∞
                                   σ→0
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Unbounded likelihood

                                                 n= 3                          n= 6



                                4




                                                                  4
                                3




                                                                  3
                                                              σ
                                2




                                                                  2
                                1




                                                                  1
                                    −2      0     2     4     6       −2   0     2     4    6

                                                  µ                              µ


                                                n= 12                          n= 24
                                4




                                                                  4
                                3




                                                                  3
                                                              σ
                                2




                                                                  2
                                1




                                                                  1
                                    −2      0     2     4     6       −2   0     2     4    6

                                                  µ                              µ


                                                n= 48                          n= 96
                                4




                                                                  4
                                3




                                                                  3
                                                              σ
                                2




                                                                  2
                                1




                                                                  1




                                    −2      0     2     4     6       −2   0     2     4    6

                                                  µ                              µ




                                         Case of the 0.3N (0, 1) + 0.7N (µ, σ) likelihood
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Mixture once again
         press for MA    Observations from

                  x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Mixture once again
         press for MA    Observations from

                  x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )

      Prior

         µi |σi ∼ N (ξi , σi /ni ),
                           2
                                                 σi ∼ I G (νi /2, s2 /2),
                                                  2
                                                                   i        p ∼ Be(α, β)
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Mixture once again
         press for MA    Observations from

                  x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )

      Prior

         µi |σi ∼ N (ξi , σi /ni ),
                           2
                                                      σi ∼ I G (νi /2, s2 /2),
                                                       2
                                                                        i            p ∼ Be(α, β)

      Posterior
                                                n
         π(θ|x1 , . . . , xn ) ∝                    {pϕ(xj ; µ1 , σ1 ) + (1 − p)ϕ(xj ; µ2 , σ2 )} π(θ)
                                            j=1
                                             n
                                     =                     ω(kt )π(θ|(kt ))
                                                =0 (kt )


                                                                                               [O(2n )]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Mixture once again (cont’d)



      For a given permutation (kt ), conditional posterior distribution
                                                           2
                                                          σ1
                π(θ|(kt )) = N              ξ1 (kt ),           × I G ((ν1 + )/2, s1 (kt )/2)
                                                        n1 +
                                         2
                                        σ2
                        ×N         ξ2 (kt ),                   × I G ((ν2 + n − )/2, s2 (kt )/2)
                                    n2 + n −
                        ×Be(α + , β + n − )
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Mixture once again (cont’d)

      where
                             1                                                                 2
        x1 (kt ) =
        ¯                          t=1 xkt ,                 s1 (kt ) =
                                                             ˆ             t=1 (xkt − x1 (kt )) ,
                                                                                      ¯
                              1      n                                    n                      2
        x2 (kt ) =
        ¯                    n−      t= +1      xkt ,        s2 (kt ) =
                                                             ˆ            t= +1 (xkt − x2 (kt ))
                                                                                        ¯

      and
                       n1 ξ1 + x1 (kt )
                                 ¯                       n2 ξ2 + (n − )¯2 (kt )
                                                                        x
            ξ1 (kt ) =                  ,     ξ2 (kt ) =                        ,
                            n1 +                               n2 + n −
                                         n1
            s1 (kt ) = s2 + s2 (kt ) +
                        1    ˆ1              (ξ1 − x1 (kt ))2 ,
                                                      ¯
                                       n1 +
                                       n2 (n − )
            s2 (kt ) = s2 + s2 (kt ) +
                        2    ˆ2                    (ξ2 − x2 (kt ))2 ,
                                                          ¯
                                       n2 + n −

      posterior updates of the hyperparameters
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Latent variables


Mixture once again




      Bayes estimator of θ:
                                                       n
                          δ π (x1 , . . . , xn ) =                  ω(kt )Eπ [θ|x, (kt )]
                                                      =0 (kt )

                                            Too costly: 2n terms
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The AR(p) model


AR(p) model

      Auto-regressive representation of a time series,
                                                               p
                        xt |xt−1 , . . . ∼ N          µ+            i (xt−i   − µ), σ 2
                                                             i=1
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The AR(p) model


AR(p) model

      Auto-regressive representation of a time series,
                                                               p
                        xt |xt−1 , . . . ∼ N          µ+            i (xt−i   − µ), σ 2
                                                             i=1


              Generalisation of AR(1)
              Among the most commonly used models in dynamic settings
              More challenging than the static models (stationarity
              constraints)
              Different models depending on the processing of the starting
              value x0
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The AR(p) model


Unwieldy stationarity constraints

      Practical difficulty: for complex models, stationarity constraints get
      quite involved to the point of being unknown in some cases

      Example (AR(1))
      Case of linear Markovian dependence on the last value
                                                                           i.i.d.
                          xt = µ + (xt−1 − µ) +                     t, t    ∼ N (0, σ 2 )

      If | | < 1, (xt )t∈Z can be written as
                                                            ∞
                                                                     j
                                                xt = µ +                 t−j
                                                           j=0

      and this is a stationary representation.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The AR(p) model


Stationary but...



      If | | > 1, alternative stationary representation
                                                           ∞
                                                                    −j
                                                xt = µ −                 t+j   .
                                                           j=1
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The AR(p) model


Stationary but...



      If | | > 1, alternative stationary representation
                                                           ∞
                                                                    −j
                                                xt = µ −                 t+j   .
                                                           j=1

      This stationary solution is criticized as artificial because xt is
      correlated with future white noises ( t )s>t , unlike the case when
      | | < 1.
      Non-causal representation...
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The AR(p) model


Stationarity+causality


      Stationarity constraints in the prior as a restriction on the values of
      θ.
      Theorem
      AR(p) model second-order stationary and causal iff the roots of the
      polynomial
                                                                    p
                                                P(x) = 1 −              ix
                                                                             i

                                                                i=1

      are all outside the unit circle
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The AR(p) model


Stationarity constraints


      Under stationarity constraints, complex parameter space: each
      value of needs to be checked for roots of corresponding
      polynomial with modulus less than 1
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The AR(p) model


Stationarity constraints


      Under stationarity constraints, complex parameter space: each
      value of needs to be checked for roots of corresponding
      polynomial with modulus less than 1

   E.g., for an AR(2) process with




                                                                         1.0
   autoregressive polynomial




                                                                         0.5
   P(u) = 1 − 1 u − 2 u2 , constraint is




                                                                         0.0
                                                                    θ2
                                                                                          q




                 1   +    2   < 1,       1   −   2   <1

                                                                         −0.5
                                                                         −1.0
   and | 2 | < 1                                                                −2   −1   0    1   2

                                                                                          θ1
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


The MA(q) model


      Alternative type of time series
                                                         q
                           xt = µ +             t   −         ϑj   t−j   ,     t   ∼ N (0, σ 2 )
                                                        j=1

      Stationary but, for identifiability considerations, the polynomial
                                                                         q
                                                Q(x) = 1 −                   ϑj xj
                                                                     j=1

      must have all its roots outside the unit circle
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


Identifiability


      Example
      For the MA(1) model, xt = µ +                         t   − ϑ1   t−1 ,

                                                var(xt ) = (1 + ϑ2 )σ 2
                                                                 1

      can also be written
                                                      1
                           xt = µ + ˜t−1 −               ˜t ,       ˜ ∼ N (0, ϑ2 σ 2 ) ,
                                                                               1
                                                      ϑ1
      Both pairs (ϑ1 , σ) & (1/ϑ1 , ϑ1 σ) lead to alternative
      representations of the same model.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


Properties of MA models




              Non-Markovian model (but special case of hidden Markov)
              Autocovariance γx (s) is null for |s| > q
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


Representations

      x1:T is a normal random variable with constant mean µ and
      covariance matrix
                         σ2
                                                                                               
                                        γ1      γ2   ...    γq           0       ...   0    0
                        γ1             σ2      γ1   . . . γq−1          γq      ...   0    0
                     Σ=                                                                        ,
                                                                                               
                                                     ..
                                                        .                                      
                                                                                              2
                         0               0       0   ...        0        0       ...   γ1   σ

      with (|s| ≤ q)
                                                             q−|s|
                                                         2
                                                γs = σ               ϑi ϑi+|s|
                                                             i=0

      Not manageable in practice [large T’s]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


Representations (contd.)

      Conditional on past ( 0 , . . . ,                 −q+1 ),

                        L(µ, ϑ1 , . . . , ϑq , σ|x1:T , 0 , . . . , −q+1 ) ∝
                                                                               2    
                                     T          
                                                                     q                
                                                                                       
                              −T                                                  2σ 2 ,
                            σ             exp −     xt − µ +            ϑj ˆt−j
                                                                                      
                                    t=1                             j=1               

      where (t > 0)
                                                q
                   ˆt = xt − µ +                    ϑj ˆt−j , ˆ0 =   0,   . . . , ˆ1−q =   1−q
                                            j=1

      Recursive definition of the likelihood, still costly O(T × q)
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


Representations (contd.)


      Encompassing approach for general time series models
      State-space representation

                                                  xt = Gyt + εt ,     (1)
                                                yt+1 = F yt + ξt ,    (2)

      (1) is the observation equation and (2) is the state equation
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


Representations (contd.)


      Encompassing approach for general time series models
      State-space representation

                                                  xt = Gyt + εt ,     (1)
                                                yt+1 = F yt + ξt ,    (2)

      (1) is the observation equation and (2) is the state equation

      Note
      This is a special case of hidden Markov model
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


MA(q) state-space representation

      For the MA(q) model, take

                                           yt = (    t−q , . . . , t−1 , t )

      and then
                                                                                    
                                       
                                        0        1    0    ...      0
                                                                                   0
                                       0                                          0
                                                 0    1    ...      0              
                                                                                 .
                         yt+1        =                    ...                 t+1  . 
                                                                      yt +
                                       
                                       0
                                                                                  .
                                                 0    0    ...      1             0
                                        0        0    0    ...      0               1
                             xt      = µ − ϑq           ϑq−1        ...   ϑ1   −1 yt .
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     The MA(q) model


MA(q) state-space representation (cont’d)


      Example
      For the MA(1) model, observation equation

                                                xt = (1 0)yt

      with
                                                yt = (y1t y2t )
      directed by the state equation

                                                0 1                       1
                                   yt+1 =           yt +            t+1        .
                                                0 0                       ϑ1
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Typology of problems


 c A typology of Bayes computational problems

       (i). latent variable models in general
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Typology of problems


 c A typology of Bayes computational problems

       (i). latent variable models in general
      (ii). use of a complex parameter space, as for instance in
            constrained parameter sets like those resulting from imposing
            stationarity constraints in dynamic models;
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Typology of problems


 c A typology of Bayes computational problems

       (i). latent variable models in general
      (ii). use of a complex parameter space, as for instance in
            constrained parameter sets like those resulting from imposing
            stationarity constraints in dynamic models;
     (iii). use of a complex sampling model with an intractable
            likelihood, as for instance in some graphical models;
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Typology of problems


 c A typology of Bayes computational problems

       (i). latent variable models in general
      (ii). use of a complex parameter space, as for instance in
            constrained parameter sets like those resulting from imposing
            stationarity constraints in dynamic models;
     (iii). use of a complex sampling model with an intractable
            likelihood, as for instance in some graphical models;
     (iv). use of a huge dataset;
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Typology of problems


 c A typology of Bayes computational problems

       (i). latent variable models in general
      (ii). use of a complex parameter space, as for instance in
            constrained parameter sets like those resulting from imposing
            stationarity constraints in dynamic models;
     (iii). use of a complex sampling model with an intractable
            likelihood, as for instance in some graphical models;
     (iv). use of a huge dataset;
      (v). use of a complex prior distribution (which may be the
            posterior distribution associated with an earlier sample);
MCMC and likelihood-free methods Part/day I: Markov chain methods
  Computational issues in Bayesian statistics
     Typology of problems


 c A typology of Bayes computational problems

       (i). latent variable models in general
      (ii). use of a complex parameter space, as for instance in
            constrained parameter sets like those resulting from imposing
            stationarity constraints in dynamic models;
     (iii). use of a complex sampling model with an intractable
            likelihood, as for instance in some graphical models;
     (iv). use of a huge dataset;
      (v). use of a complex prior distribution (which may be the
            posterior distribution associated with an earlier sample);
     (vi). use of a particular inferential procedure as for instance, Bayes
            factors
                                   π             P (θ ∈ Θ0 | x)     π(θ ∈ Θ0 )
                                  B01 (x) =                                    .
                                                 P (θ ∈ Θ1 | x)     π(θ ∈ Θ1 )
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm




The Metropolis-Hastings Algorithm



   Computational issues in Bayesian
   statistics

   The Metropolis-Hastings Algorithm

   The Gibbs Sampler

   Population Monte Carlo
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo basics


General purpose



      A major computational issue in Bayesian statistics:

      Given a density π known up to a normalizing constant, and an
      integrable function h, compute

                                                                    h(x)˜ (x)µ(dx)
                                                                         π
                      Π(h) =           h(x)π(x)µ(dx) =
                                                                      π (x)µ(dx)
                                                                      ˜

      when         h(x)˜ (x)µ(dx) is intractable.
                       π
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo basics


Monte Carlo 101

      Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
                                                                    N
                                      ΠM C (h) = N −1
                                      ˆ
                                        N                                 h(xi ).
                                                                    i=1

            ˆN                  as
      LLN: ΠM C (h) −→ Π(h)
      If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
                          √                                   L
            CLT:                ˆN
                              N ΠM C (h) − Π(h)                     N 0, Π [h − Π(h)]2   .
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo basics


Monte Carlo 101

      Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
                                                                    N
                                      ΠM C (h) = N −1
                                      ˆ
                                        N                                 h(xi ).
                                                                    i=1

            ˆN                  as
      LLN: ΠM C (h) −→ Π(h)
      If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
                          √                                   L
            CLT:                ˆN
                              N ΠM C (h) − Π(h)                     N 0, Π [h − Π(h)]2   .


      Caveat conducting to                          MCMC


      Often impossible or inefficient to simulate directly from Π
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Importance Sampling


Importance Sampling

      For Q proposal distribution such that Q(dx) = q(x)µ(dx),
      alternative representation

                               Π(h) =           h(x){π/q}(x)q(x)µ(dx).
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Importance Sampling


Importance Sampling

      For Q proposal distribution such that Q(dx) = q(x)µ(dx),
      alternative representation

                               Π(h) =           h(x){π/q}(x)q(x)µ(dx).


      Principle of importance (!)
      Generate an iid sample x1 , . . . , xN ∼ Q and estimate Π(h) by
                                                          N
                              ΠIS (h) = N −1
                              ˆ
                                Q,N                            h(xi ){π/q}(xi ).
                                                         i=1

         return to pMC
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Importance Sampling


Properties of importance


      Then
           ˆ        as
      LLN: ΠIS (h) −→ Π(h)
             Q,N                                         and if Q((hπ/q)2 ) < ∞,
                     √                                    L
          CLT:              ˆ Q,N
                         N (ΠIS (h) − Π(h))                    N 0, Q{(hπ/q − Π(h))2 } .
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Importance Sampling


Properties of importance


      Then
           ˆ        as
      LLN: ΠIS (h) −→ Π(h)
             Q,N                                         and if Q((hπ/q)2 ) < ∞,
                     √                                    L
          CLT:              ˆ Q,N
                         N (ΠIS (h) − Π(h))                    N 0, Q{(hπ/q − Π(h))2 } .


      Caveat
                                                              ˆ Q,N
      If normalizing constant of π unknown, impossible to use ΠIS

      Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Importance Sampling


Self-Normalised Importance Sampling

      Self normalized version
                                          N                         −1 N
                 ˆ Q,N
                 ΠSN IS (h) =                  {π/q}(xi )                   h(xi ){π/q}(xi ).
                                         i=1                          i=1
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Importance Sampling


Self-Normalised Importance Sampling

      Self normalized version
                                          N                         −1 N
                 ˆ Q,N
                 ΠSN IS (h) =                  {π/q}(xi )                   h(xi ){π/q}(xi ).
                                         i=1                          i=1

               ˆ           as
      LLN : ΠSN IS (h) −→ Π(h)
                 Q,N
      and if Π((1 + h2 )(π/q)) < ∞,
               √                                            L
      CLT :         ˆ Q,N
                 N (ΠSN IS (h) − Π(h))                          N      0, π {(π/q)(h − Π(h)}2 ) .
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Importance Sampling


Self-Normalised Importance Sampling

      Self normalized version
                                          N                         −1 N
                 ˆ Q,N
                 ΠSN IS (h) =                  {π/q}(xi )                   h(xi ){π/q}(xi ).
                                         i=1                          i=1

               ˆ           as
      LLN : ΠSN IS (h) −→ Π(h)
                 Q,N
      and if Π((1 + h2 )(π/q)) < ∞,
               √                                            L
      CLT :         ˆ Q,N
                 N (ΠSN IS (h) − Π(h))                          N      0, π {(π/q)(h − Π(h)}2 ) .

       c The quality of the SNIS approximation depends on the
      choice of Q
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains (MCMC)




      It is not necessary to use a sample from the distribution f to
      approximate the integral

                                           I=          h(x)f (x)dx ,
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains (MCMC)



      It is not necessary to use a sample from the distribution f to
      approximate the integral

                                           I=          h(x)f (x)dx ,


                                                            [notation warnin: π turned to f !]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains (MCMC)

      It is not necessary to use a sample from the distribution f to
      approximate the integral

                                           I=          h(x)f (x)dx ,




   We can obtain X1 , . . . , Xn ∼ f
   (approx) without directly simulating
   from f , using an ergodic Markov
   chain with stationary distribution f


                                                                       Andre¨ Markov
                                                                            ı
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains (2)




      Idea
      For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
      generated using a transition kernel with stationary distribution f
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains (2)
      Idea
      For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
      generated using a transition kernel with stationary distribution f


              irreducible Markov chain with stationary distribution f is
              ergodic with limiting distribution f under weak conditions
              hence convergence in distribution of (X (t) ) to a random
              variable from f .
              for T0 “large enough” T0 , X (T0 ) distributed from f
              Markov sequence is dependent sample X (T0 ) , X (T0 +1) , . . .
              generated from f
              Birkoff’s ergodic theorem extends LLN, sufficient for most
              approximation purposes
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Monte Carlo Methods based on Markov Chains


Running Monte Carlo via Markov Chains (2)




      Idea
      For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
      generated using a transition kernel with stationary distribution f


      Problem: How can one build a Markov chain with a given
      stationary distribution?
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


The Metropolis–Hastings algorithm


     Basics
   The algorithm uses the objective
   (target) density

                                  f

   and a conditional density

                              q(y|x)

   called the instrumental (or proposal)                            Nicholas Metropolis
   distribution
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


The MH algorithm

      Algorithm (Metropolis–Hastings)
      Given x(t) ,
         1. Generate Yt ∼ q(y|x(t) ).
         2. Take

                                              Yt        with prob. ρ(x(t) , Yt ),
                            X (t+1) =
                                              x(t)      with prob. 1 − ρ(x(t) , Yt ),

              where
                                                             f (y) q(x|y)
                                      ρ(x, y) = min                       ,1   .
                                                             f (x) q(y|x)
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Features



              Independent of normalizing constants for both f and q(·|x)
              (ie, those constants independent of x)
              Never move to values with f (y) = 0
              The chain (x(t) )t may take the same value several times in a
              row, even though f is a density wrt Lebesgue measure
              The sequence (yt )t is usually not a Markov chain
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties


         1. The M-H Markov chain is reversible, with
            invariant/stationary density f since it satisfies the detailed
            balance condition
                           f (y) K(y, x) = f (x) K(x, y)
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties


         1. The M-H Markov chain is reversible, with
            invariant/stationary density f since it satisfies the detailed
            balance condition
                           f (y) K(y, x) = f (x) K(x, y)
         2. As f is a probability measure, the chain is positive recurrent
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties


         1. The M-H Markov chain is reversible, with
            invariant/stationary density f since it satisfies the detailed
            balance condition
                           f (y) K(y, x) = f (x) K(x, y)
         2. As f is a probability measure, the chain is positive recurrent
         3. If
                                            f (Yt ) q(X (t) |Yt )
                                      Pr                            ≥ 1 < 1.   (1)
                                           f (X (t) ) q(Yt |X (t) )

              that is, the event {X (t+1) = X (t) } is possible, then the chain
              is aperiodic
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties (2)
         4. If
                                          q(y|x) > 0 for every (x, y),   (2)
              the chain is irreducible
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties (2)
         4. If
                                          q(y|x) > 0 for every (x, y),   (2)
            the chain is irreducible
         5. For M-H, f -irreducibility implies Harris recurrence
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     The Metropolis–Hastings algorithm


Convergence properties (2)
         4. If
                                          q(y|x) > 0 for every (x, y),                        (2)
            the chain is irreducible
         5. For M-H, f -irreducibility implies Harris recurrence
         6. Thus, for M-H satisfying (1) and (2)
                 (i) For h, with Ef |h(X)| < ∞,
                                               T
                                         1
                                  lim                h(X (t) ) =    h(x)df (x)      a.e. f.
                                T →∞     T     t=1

                (ii) and
                                         lim            K n (x, ·)µ(dx) − f        =0
                                         n→∞
                                                                              TV
                      for every initial distribution µ, where K n (x, ·) denotes the
                      kernel for n transitions.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Random walk Metropolis–Hastings



      Use of a local perturbation as proposal

                                                  Yt = X (t) + εt ,

       where εt ∼ g, independent of X (t) .
      The instrumental density is of the form g(y − x) and the Markov
      chain is a random walk if we take g to be symmetric g(x) = g(−x)
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Random walk Metropolis–Hastings [code]


      Algorithm (Random walk Metropolis)
      Given x(t)
         1. Generate Yt ∼ g(y − x(t) )
         2. Take
                                                                          f (Yt )
                                        
                                        Y           with prob. min 1,             ,
                             (t+1)         t
                         X            =                                  f (x(t) )
                                         (t)
                                         x           otherwise.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


The original example



      Example (Random walk and normal target)
                 Generate N (0, 1) based on the uniform proposal [−δ, δ]
         forget History!

      The probability of acceptance is then
                                                                    2
                              ρ(x(t) , yt ) = exp{(x(t) − yt )/2} ∧ 1.
                                                           2




                                                                        [Hastings (1970)]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


The original example


      Example (Random walk & normal (2))
                                             Sample statistics
                                      δ             0.1           0.5     1.0
                                   mean            0.399        -0.111   0.10
                                  variance         0.698         1.11    1.06


       c As δ ↑, we get better histograms and a faster exploration of the
      support of f .
MCMC and likelihood-free methods Part/day I: Markov chain methods
      The Metropolis-Hastings Algorithm
        Random-walk Metropolis-Hastings algorithms


   The original example




                                                                                        400
                                                            400
                        250




                                                     0.5




                                                                                 0.5




                                                                                                                               0.5
                                                                                        300
                        200




                                                            300
                                                     0.0




                                                                                 0.0




                                                                                                                               0.0
                        150




                                                                                        200
                                                            200
                                                     -0.5




                                                                                 -0.5




                                                                                                                               -0.5
                        100




                                                                                        100
                                                            100
                                                     -1.0




                                                                                 -1.0




                                                                                                                               -1.0
                        50




                                                     -1.5




                                                                                 -1.5




                                                                                                                               -1.5
                        0




                                                            0




                                                                                        0




                              -1   0         1   2                -2   0     2                -3   -2   -1   0     1   2   3
                                       (a)                             (b)                                   (c)




ples based on U [−δ, δ] with (a) δ = 0.1, (b) δ = 0.5 and (c) δ = 1.0, superimposed with the convergence of the means (15, 000 si
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Mixtures by random walk MH

      Example (Mixture models)
                                             n        k
                            π(θ|x) ∝                      p f (xj |µ , σ ) π(θ)
                                           j=1       =1
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Mixtures by random walk MH

      Example (Mixture models)
                                             n        k
                            π(θ|x) ∝                       p f (xj |µ , σ ) π(θ)
                                           j=1        =1

      Metropolis-Hastings proposal:

                                                  θ(t) + ωε(t) if u(t) < ρ(t)
                             θ(t+1) =
                                                  θ(t)         otherwise

      where
                                                  π(θ(t) + ωε(t) |x)
                                      ρ(t) =                         ∧1
                                                     π(θ(t) |x)
      and ω scaled for good acceptance rate
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Mixtures by random walk MH

                                       Random walk sampling (50000 iterations)
               2




                                                                            2
               1




                                                                            1
       theta




                                                                    theta
               0




                                                                            0
               -1




                                                                            -1
                     0.0   0.2   0.4       0.6        0.8     1.0                                 0.2         0.4               0.6                0.8     1.0              1.2
                                       p                                                                                              tau
               1.2




                                                                            0.0 1.0 2.0
               1.0




                                                                                                   -1               0                          1                 2
                                                                                                                                      theta
               0.8




                                                                            0 1 2 3 4 5 6
       tau
               0.6
               0.4




                                                                                            0.0         0.2               0.4                 0.6        0.8               1.0
                                                                                                                                       p
                                                                            0 2 4
               0.2




                     0.0   0.2   0.4       0.6        0.8     1.0                             0.2                   0.4                            0.6               0.8
                                       p                                                                                              tau


                                                 General case of a 3 component normal mixture
                                                                                                                                                               [Celeux & al., 2000]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Mixtures by random walk MH
           3
           2
      µ2

           1
           0




                                                                         X
           −1




                    −1                0             1               2            3

                                                   µ1

                               Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1)
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Convergence properties



      Uniform ergodicity prohibited by random walk structure
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Convergence properties



      Uniform ergodicity prohibited by random walk structure
      At best, geometric ergodicity:

      Theorem (Sufficient ergodicity)
      For a symmetric density f , log-concave in the tails, and a positive
      and symmetric density g, the chain (X (t) ) is geometrically ergodic.
                                           [Mengersen & Tweedie, 1996]
                                                                    no tail effect
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


illustration of the tail effect




                                                               1.5




                                                                                                 1.5
   Example (Comparison of tails)




                                                               1.0




                                                                                                 1.0
   Random walk Metropolis




                                                               0.5




                                                                                                 0.5
   Hastings algorithms based on a



                                                               0.0




                                                                                                 0.0
   N (0, 1) instrumental for the


                                                               -0.5




                                                                                                 -0.5
   generation of (left) a N (0, 1)
   distribution and (right) a
                                                               -1.0




                                                                                                 -1.0
   distribution with density                                   -1.5




                                                                                                 -1.5
   ψ(x) ∝ (1 + |x|)−3
                                                                      0   50   100   150   200          0   50   100   150   200

                                                                               (a)                               (b)
                                                  90% confidence envelopes of the means, derived from 500 parallel independent
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Further convergence properties


      Under assumptions                                                                         skip detailed convergence


              (A1) f is super-exponential, i.e. it is positive with positive continuous first derivative such that

              lim|x|→∞ n(x)         log f (x) = −∞ where n(x) := x/|x|.

              In words : exponential decay of f in every direction with rate
              tending to ∞
              (A2) lim sup|x|→∞ n(x) m(x) < 0, where m(x) =                  f (x)/|   f (x)|

              In words: non degeneracy of the countour manifold
              Cf (y) = {y : f (y) = f (x)}
      Q is geometrically ergodic, and
      V (x) ∝ f (x)−1/2 verifies the drift condition
                                               [Jarner & Hansen, 2000]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Further [further] convergence properties

                                                                       skip hyperdetailed convergence

      If P ψ-irreducible and aperiodic, for r = (r(n))n∈N real-valued non
      decreasing sequence, such that, for all n, m ∈ N,

                                         r(n + m) ≤ r(n)r(m),

      and r(0) = 1, for C a small set, τC = inf{n ≥ 1, Xn ∈ C}, and
      h ≥ 1, assume
                                                  τC −1
                                  sup Ex                  r(k)h(Xk ) < ∞,
                                  x∈C             k=0
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Further [further] convergence properties


      then,
                                                              τC −1
                 S(f, C, r) :=            x ∈ X, Ex                   r(k)h(Xk )   <∞
                                                              k=0

      is full and absorbing and for x ∈ S(f, C, r),

                                      lim r(n) P n (x, .) − f           h   = 0.
                                      n→∞


                                                                    [Tuominen & Tweedie, 1994]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Comments

              [CLT, Rosenthal’s inequality...] h-ergodicity implies CLT
              for additive (possibly unbounded) functionals of the chain,
              Rosenthal’s inequality and so on...
              [Control of the moments of the return-time] The
              condition implies (because h ≥ 1) that
                                                                    τC −1
                      sup Ex [r0 (τC )] ≤ sup Ex                            r(k)h(Xk )   < ∞,
                      x∈C                          x∈C              k=0

              where r0 (n) = n r(l) Can be used to derive bounds for
                               l=0
              the coupling time, an essential step to determine computable
              bounds, using coupling inequalities
                  [Roberts & Tweedie, 98; Fort & Moulines, 00; Jones et al., 02]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Alternative conditions



      The condition is not really easy to work with...
      [Possible alternative conditions]
      (a) [Tuominen, Tweedie, 1994] There exists a sequence
          (Vn )n∈N , Vn ≥ r(n)h, such that
                (i) supC V0 < ∞,
               (ii) {V0 = ∞} ⊂ {V1 = ∞} and
              (iii) P Vn+1 ≤ Vn − r(n)h + br(n)IC .
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Random-walk Metropolis-Hastings algorithms


Alternative conditions


       (b) [Fort 2000] ∃V ≥ f ≥ 1 and b < ∞, such that supC V < ∞
           and
                                                  σC
                      P V (x) + Ex                     ∆r(k)f (Xk )     ≤ V (x) + bIC (x)
                                              k=0

              where σC is the hitting time on C and

                      ∆r(k) = r(k) − r(k − 1), k ≥ 1 and ∆r(0) = r(0).

                                                                      τC −1
              Result (a) ⇔ (b) ⇔ supx∈C Ex                            k=0 r(k)f (Xk )   < ∞.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Langevin Algorithms


      Proposal based on the Langevin diffusion Lt is defined by the
      stochastic differential equation
                                                          1
                                      dLt = dBt +               log f (Lt )dt,
                                                          2
        where Bt is the standard Brownian motion
      Theorem
      The Langevin diffusion is the only non-explosive diffusion which is
      reversible with respect to f .
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Discretization


      Instead, consider the sequence

                                       σ2
              x(t+1) = x(t) +                  log f (x(t) ) + σεt ,   εt ∼ Np (0, Ip )
                                       2
        where σ 2 corresponds to the discretization step
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Discretization


      Instead, consider the sequence

                                       σ2
              x(t+1) = x(t) +                  log f (x(t) ) + σεt ,   εt ∼ Np (0, Ip )
                                       2
       where σ 2 corresponds to the discretization step
      Unfortunately, the discretized chain may be be transient, for
      instance when

                                      lim      σ2     log f (x)|x|−1 > 1
                                      x→±∞
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


MH correction


      Accept the new value Yt with probability
                                                                                    2
                                                           σ2
                         exp − Yt − x(t) −                 2        log f (x(t) )       2σ 2
            f (Yt )
                     ·                                                                         ∧1.
           f (x(t) )                                        σ2
                                                                                    2
                          exp −          x(t)   − Yt −      2       log f (Yt )         2σ 2

      Choice of the scaling factor σ
      Should lead to an acceptance rate of 0.574 to achieve optimal
      convergence rates (when the components of x are uncorrelated)
              [Roberts & Rosenthal, 1998; Girolami & Calderhead, 2011]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Optimizing the Acceptance Rate



      Problem of choice of the transition kernel from a practical point of
      view
      Most common alternatives:
       (a) a fully automated algorithm like ARMS;
       (b) an instrumental density g which approximates f , such that
           f /g is bounded for uniform ergodicity to apply;
       (c) a random walk
      In both cases (b) and (c), the choice of g is critical,
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Case of the random walk


      Different approach to acceptance rates
      A high acceptance rate does not indicate that the algorithm is
      moving correctly since it indicates that the random walk is moving
      too slowly on the surface of f .
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Case of the random walk


      Different approach to acceptance rates
      A high acceptance rate does not indicate that the algorithm is
      moving correctly since it indicates that the random walk is moving
      too slowly on the surface of f .
      If x(t) and yt are close, i.e. f (x(t) ) f (yt ) y is accepted with
      probability
                                       f (yt )
                              min              ,1     1.
                                     f (x(t) )
      For multimodal densities with well separated modes, the negative
      effect of limited moves on the surface of f clearly shows.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Case of the random walk (2)




      If the average acceptance rate is low, the successive values of f (yt )
      tend to be small compared with f (x(t) ), which means that the
      random walk moves quickly on the surface of f since it often
      reaches the “borders” of the support of f
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Rule of thumb




      In small dimensions, aim at an average acceptance rate of 50%. In
      large dimensions, at an average acceptance rate of 25%.
                                       [Gelman,Gilks and Roberts, 1995]
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Rule of thumb




      In small dimensions, aim at an average acceptance rate of 50%. In
      large dimensions, at an average acceptance rate of 25%.
                                       [Gelman,Gilks and Roberts, 1995]

      This rule is to be taken with a pinch of salt!
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Role of scale


      Example (Noisy AR(1))
      Hidden Markov chain from a regular AR(1) model,

                              xt+1 = ϕxt +            t+1           t   ∼ N (0, τ 2 )

      and observables
                                            yt |xt ∼ N (x2 , σ 2 )
                                                         t
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Role of scale


      Example (Noisy AR(1))
      Hidden Markov chain from a regular AR(1) model,

                              xt+1 = ϕxt +            t+1           t   ∼ N (0, τ 2 )

      and observables
                                            yt |xt ∼ N (x2 , σ 2 )
                                                         t

      The distribution of xt given xt−1 , xt+1 and yt is

                  −1                                                           τ2
           exp             (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 +                      (yt − x2 )2
                                                                                         t      .
                  2τ 2                                                         σ2
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Role of scale




      Example (Noisy AR(1) continued)
      For a Gaussian random walk with scale ω small enough, the
      random walk never jumps to the other mode. But if the scale ω is
      sufficiently large, the Markov chain explores both modes and give a
      satisfactory approximation of the target distribution.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Role of scale




      Markov chain based on a random walk with scale ω = .1.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Role of scale




      Markov chain based on a random walk with scale ω = .5.
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


MA(2)


      Since the constraints on (ϑ1 , ϑ2 ) are well-defined, use of a flat
      prior over the triangle as prior.
      Simple representation of the likelihood

      library(mnormt)
      ma2like=function(theta){
      n=length(y)
      sigma = toeplitz(c(1 +theta[1]^2+theta[2]^2,
        theta[1]+theta[1]*theta[2],theta[2],rep(0,n-3)))
      dmnorm(y,rep(0,n),sigma,log=TRUE)
      }
MCMC and likelihood-free methods Part/day I: Markov chain methods
  The Metropolis-Hastings Algorithm
     Extensions


Basic RWHM for MA(2)

      Algorithm 1 RW-HM-MA(2) sampler
          set ω and ϑ(1)
          for i = 2 to T do
                      ˜         (i−1)       (i−1)
            generate ϑj ∼ U(ϑj        − ω, ϑj     + ω)
            set p = 0 and ϑ (i) = ϑ(i−1)
                ˜
            if ϑ within the triangle then
                                  ˜
               p = exp(ma2like(ϑ) − ma2like(ϑ(i−1) ))
            end if
            if U < p then
                      ˜
               ϑ(i) = ϑ
            end if
          end for
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I
Monash University short course, part I

Mais conteúdo relacionado

Mais procurados

[A]BCel : a presentation at ABC in Roma
[A]BCel : a presentation at ABC in Roma[A]BCel : a presentation at ABC in Roma
[A]BCel : a presentation at ABC in RomaChristian Robert
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methodsChristian Robert
 
A Logical Language with a Prototypical Semantics
A Logical Language with a Prototypical SemanticsA Logical Language with a Prototypical Semantics
A Logical Language with a Prototypical SemanticsL. Thorne McCarty
 
A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...Konstantinos Giannakis
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelstuxette
 
Approximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingApproximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingSSA KPI
 
Differential analyses of structures in HiC data
Differential analyses of structures in HiC dataDifferential analyses of structures in HiC data
Differential analyses of structures in HiC datatuxette
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 
Markov Chain Monte Carlo explained
Markov Chain Monte Carlo explainedMarkov Chain Monte Carlo explained
Markov Chain Monte Carlo explaineddariodigiuni
 
Winter school-pq2016v2
Winter school-pq2016v2Winter school-pq2016v2
Winter school-pq2016v2Ludovic Perret
 
Random Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesRandom Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesFörderverein Technische Fakultät
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Christian Robert
 
Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsUniversity of Salerno
 
On Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
On Meme Self-Adaptation in Spatially-Structured Multimemetic AlgorithmsOn Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
On Meme Self-Adaptation in Spatially-Structured Multimemetic AlgorithmsRafael Nogueras
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelstuxette
 

Mais procurados (20)

[A]BCel : a presentation at ABC in Roma
[A]BCel : a presentation at ABC in Roma[A]BCel : a presentation at ABC in Roma
[A]BCel : a presentation at ABC in Roma
 
Introduction to MCMC methods
Introduction to MCMC methodsIntroduction to MCMC methods
Introduction to MCMC methods
 
A Logical Language with a Prototypical Semantics
A Logical Language with a Prototypical SemanticsA Logical Language with a Prototypical Semantics
A Logical Language with a Prototypical Semantics
 
A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...A quantum-inspired optimization heuristic for the multiple sequence alignment...
A quantum-inspired optimization heuristic for the multiple sequence alignment...
 
mcmc
mcmcmcmc
mcmc
 
Shanghai tutorial
Shanghai tutorialShanghai tutorial
Shanghai tutorial
 
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
 
A short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction modelsA short and naive introduction to using network in prediction models
A short and naive introduction to using network in prediction models
 
Approximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingApproximation in Stochastic Integer Programming
Approximation in Stochastic Integer Programming
 
Differential analyses of structures in HiC data
Differential analyses of structures in HiC dataDifferential analyses of structures in HiC data
Differential analyses of structures in HiC data
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
Markov Chain Monte Carlo explained
Markov Chain Monte Carlo explainedMarkov Chain Monte Carlo explained
Markov Chain Monte Carlo explained
 
Winter school-pq2016v2
Winter school-pq2016v2Winter school-pq2016v2
Winter school-pq2016v2
 
Random Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesRandom Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application Examples
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010
 
Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...Kernel methods and variable selection for exploratory analysis and multi-omic...
Kernel methods and variable selection for exploratory analysis and multi-omic...
 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
 
Introduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov ChainsIntroduction to Bootstrap and elements of Markov Chains
Introduction to Bootstrap and elements of Markov Chains
 
On Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
On Meme Self-Adaptation in Spatially-Structured Multimemetic AlgorithmsOn Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
On Meme Self-Adaptation in Spatially-Structured Multimemetic Algorithms
 
Convolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernelsConvolutional networks and graph networks through kernels
Convolutional networks and graph networks through kernels
 

Semelhante a Monash University short course, part I

MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methodsChristian Robert
 
slides of ABC talk at i-like workshop, Warwick, May 16
slides of ABC talk at i-like workshop, Warwick, May 16slides of ABC talk at i-like workshop, Warwick, May 16
slides of ABC talk at i-like workshop, Warwick, May 16Christian Robert
 
ch8Bayes.ppt
ch8Bayes.pptch8Bayes.ppt
ch8Bayes.pptImXaib
 
Pre- and Post-Selection Paradoxes, Measurement-Disturbance and Contextuality ...
Pre- and Post-Selection Paradoxes, Measurement-Disturbance and Contextuality ...Pre- and Post-Selection Paradoxes, Measurement-Disturbance and Contextuality ...
Pre- and Post-Selection Paradoxes, Measurement-Disturbance and Contextuality ...Matthew Leifer
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABCChristian Robert
 
Regularized Compression of A Noisy Blurred Image
Regularized Compression of A Noisy Blurred Image Regularized Compression of A Noisy Blurred Image
Regularized Compression of A Noisy Blurred Image ijcsa
 
Pittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminarsPittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminarsChristian Robert
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networksm.a.kirn
 
ABC in London, May 5, 2011
ABC in London, May 5, 2011ABC in London, May 5, 2011
ABC in London, May 5, 2011Christian Robert
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Zihui Li
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018Christian Robert
 
(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?Christian Robert
 
Parallel Bayesian Optimization
Parallel Bayesian OptimizationParallel Bayesian Optimization
Parallel Bayesian OptimizationSri Ambati
 

Semelhante a Monash University short course, part I (20)

MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methods
 
slides of ABC talk at i-like workshop, Warwick, May 16
slides of ABC talk at i-like workshop, Warwick, May 16slides of ABC talk at i-like workshop, Warwick, May 16
slides of ABC talk at i-like workshop, Warwick, May 16
 
ch8Bayes.ppt
ch8Bayes.pptch8Bayes.ppt
ch8Bayes.ppt
 
ch8Bayes.ppt
ch8Bayes.pptch8Bayes.ppt
ch8Bayes.ppt
 
von Mises lecture, Berlin
von Mises lecture, Berlinvon Mises lecture, Berlin
von Mises lecture, Berlin
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
 
Pre- and Post-Selection Paradoxes, Measurement-Disturbance and Contextuality ...
Pre- and Post-Selection Paradoxes, Measurement-Disturbance and Contextuality ...Pre- and Post-Selection Paradoxes, Measurement-Disturbance and Contextuality ...
Pre- and Post-Selection Paradoxes, Measurement-Disturbance and Contextuality ...
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
Regularized Compression of A Noisy Blurred Image
Regularized Compression of A Noisy Blurred Image Regularized Compression of A Noisy Blurred Image
Regularized Compression of A Noisy Blurred Image
 
Pittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminarsPittsburgh and Toronto "Halloween US trip" seminars
Pittsburgh and Toronto "Halloween US trip" seminars
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Bn
BnBn
Bn
 
ABC in London, May 5, 2011
ABC in London, May 5, 2011ABC in London, May 5, 2011
ABC in London, May 5, 2011
 
Winnow vs perceptron
Winnow vs perceptronWinnow vs perceptron
Winnow vs perceptron
 
ABC in Varanasi
ABC in VaranasiABC in Varanasi
ABC in Varanasi
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?
 
Parallel Bayesian Optimization
Parallel Bayesian OptimizationParallel Bayesian Optimization
Parallel Bayesian Optimization
 
Svm dbeth
Svm dbethSvm dbeth
Svm dbeth
 

Mais de Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 

Mais de Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 

Último

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 

Último (20)

Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 

Monash University short course, part I

  • 1. MCMC and likelihood-free methods Part/day I: Markov chain methods MCMC and likelihood-free methods Part/day I: Markov chain methods Christian P. Robert Universit´ Paris-Dauphine, IUF, & CREST e Monash University, EBS, July 18, 2012
  • 2. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Motivations and leading example Computational issues in Bayesian statistics The Metropolis-Hastings Algorithm The Gibbs Sampler Population Monte Carlo
  • 3. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective What is Bayesian statistics? Statistical model defined by a likelihood function f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn ) [inversion of what varies]
  • 4. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective What is Bayesian statistics? Statistical model defined by a likelihood function f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn ) [inversion of what varies] Bayesian approach turns the likelihood into a conditional density: π(θ|x1 , . . . , xn ) ∝ π(θ)L(θ|x1 , . . . , xn ) using a reference measure (or a prior) π(θ) [Thomas Bayes, 1701–1761]
  • 5. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective What is Bayesian statistics? Statistical model defined by a likelihood function f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn ) [inversion of what varies] Bayesian approach turns the likelihood into a conditional density: π(θ|x1 , . . . , xn ) ∝ π(θ)L(θ|x1 , . . . , xn ) using a reference measure (or a prior) π(θ) [Thomas Bayes, 1701–1761]
  • 6. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective New perspective Uncertainty on the parameters θ of a model modeled through a probability distribution π on Θ, called prior distribution
  • 7. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective New perspective Uncertainty on the parameters θ of a model modeled through a probability distribution π on Θ, called prior distribution Inference processed through distribution of θ conditional on x, π(θ|x), called posterior distribution f (x|θ)π(θ) π(θ|x) = . f (x|θ)π(θ) dθ
  • 8. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective Justifications Semantic drift from unknown to random Actualization of the information on θ by extracting the information on θ contained in the observation x Allows incorporation of imperfect information in the decision process Unique mathematical way to condition upon the observations (conditional perspective) Penalization factor
  • 9. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective Posterior distribution π(θ|x) central to Bayesian inference Operates conditional upon the observation s
  • 10. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective Posterior distribution π(θ|x) central to Bayesian inference Operates conditional upon the observation s Incorporates the requirement of the Likelihood Principle
  • 11. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective Posterior distribution π(θ|x) central to Bayesian inference Operates conditional upon the observation s Incorporates the requirement of the Likelihood Principle Avoids averaging over the unobserved values of x
  • 12. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective Posterior distribution π(θ|x) central to Bayesian inference Operates conditional upon the observation s Incorporates the requirement of the Likelihood Principle Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected
  • 13. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics abc of Bayesian perspective Posterior distribution π(θ|x) central to Bayesian inference Operates conditional upon the observation s Incorporates the requirement of the Likelihood Principle Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected Provides a complete inferential scope
  • 14. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f (x, x |θ) dx
  • 15. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f (x, x |θ) dx If (x, x ) observed, fine!
  • 16. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Latent structures make life harder! Even simple models may lead to computational complications, as in latent variable models f (x|θ) = f (x, x |θ) dx If (x, x ) observed, fine! If only x observed, trouble!
  • 17. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables example: mixture models Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) .
  • 18. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables example: mixture models Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . For a sample of independent random variables (X1 , · · · , Xn ), sample density n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1
  • 19. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables example: mixture models Models of mixtures of distributions: X ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density X ∼ p1 f1 (x) + · · · + pk fk (x) . n {p1 f1 (xi ) + · · · + pk fk (xi )} . i=1 Expanding this product of sums into a sum of products involves k n elementary terms: too prohibitive to compute in large samples.
  • 20. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Simple mixture (1) 3 2 µ2 1 0 −1 −1 0 1 2 3 µ1 Case of the 0.3N (µ1 , 1) + 0.7N (µ2 , 1) likelihood
  • 21. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Simple mixture (2) For mixture of two normal distributions, 0.3N (µ1 , 1) + 0.7N (µ2 , 1) , likelihood proportional to n [0.3ϕ (xi − µ1 ) + 0.7 ϕ (xi − µ2 )] i=1 containing 2n terms.
  • 22. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Complex maximisation Standard maximization techniques often fail to find the global maximum because of multimodality or undesirable behavior (usually at the frontier of the domain) of the likelihood function. Example In the special case f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ 2 )(x − µ)2 } σ with > 0 known,
  • 23. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Complex maximisation Standard maximization techniques often fail to find the global maximum because of multimodality or undesirable behavior (usually at the frontier of the domain) of the likelihood function. Example In the special case f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ 2 )(x − µ)2 } σ with > 0 known, whatever n, the likelihood is unbounded: lim L(x1 , . . . , xn |µ = x1 , σ) = ∞ σ→0
  • 24. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Unbounded likelihood n= 3 n= 6 4 4 3 3 σ 2 2 1 1 −2 0 2 4 6 −2 0 2 4 6 µ µ n= 12 n= 24 4 4 3 3 σ 2 2 1 1 −2 0 2 4 6 −2 0 2 4 6 µ µ n= 48 n= 96 4 4 3 3 σ 2 2 1 1 −2 0 2 4 6 −2 0 2 4 6 µ µ Case of the 0.3N (0, 1) + 0.7N (µ, σ) likelihood
  • 25. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Mixture once again press for MA Observations from x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )
  • 26. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Mixture once again press for MA Observations from x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 ) Prior µi |σi ∼ N (ξi , σi /ni ), 2 σi ∼ I G (νi /2, s2 /2), 2 i p ∼ Be(α, β)
  • 27. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Mixture once again press for MA Observations from x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 ) Prior µi |σi ∼ N (ξi , σi /ni ), 2 σi ∼ I G (νi /2, s2 /2), 2 i p ∼ Be(α, β) Posterior n π(θ|x1 , . . . , xn ) ∝ {pϕ(xj ; µ1 , σ1 ) + (1 − p)ϕ(xj ; µ2 , σ2 )} π(θ) j=1 n = ω(kt )π(θ|(kt )) =0 (kt ) [O(2n )]
  • 28. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Mixture once again (cont’d) For a given permutation (kt ), conditional posterior distribution 2 σ1 π(θ|(kt )) = N ξ1 (kt ), × I G ((ν1 + )/2, s1 (kt )/2) n1 + 2 σ2 ×N ξ2 (kt ), × I G ((ν2 + n − )/2, s2 (kt )/2) n2 + n − ×Be(α + , β + n − )
  • 29. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Mixture once again (cont’d) where 1 2 x1 (kt ) = ¯ t=1 xkt , s1 (kt ) = ˆ t=1 (xkt − x1 (kt )) , ¯ 1 n n 2 x2 (kt ) = ¯ n− t= +1 xkt , s2 (kt ) = ˆ t= +1 (xkt − x2 (kt )) ¯ and n1 ξ1 + x1 (kt ) ¯ n2 ξ2 + (n − )¯2 (kt ) x ξ1 (kt ) = , ξ2 (kt ) = , n1 + n2 + n − n1 s1 (kt ) = s2 + s2 (kt ) + 1 ˆ1 (ξ1 − x1 (kt ))2 , ¯ n1 + n2 (n − ) s2 (kt ) = s2 + s2 (kt ) + 2 ˆ2 (ξ2 − x2 (kt ))2 , ¯ n2 + n − posterior updates of the hyperparameters
  • 30. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Latent variables Mixture once again Bayes estimator of θ: n δ π (x1 , . . . , xn ) = ω(kt )Eπ [θ|x, (kt )] =0 (kt ) Too costly: 2n terms
  • 31. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The AR(p) model AR(p) model Auto-regressive representation of a time series, p xt |xt−1 , . . . ∼ N µ+ i (xt−i − µ), σ 2 i=1
  • 32. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The AR(p) model AR(p) model Auto-regressive representation of a time series, p xt |xt−1 , . . . ∼ N µ+ i (xt−i − µ), σ 2 i=1 Generalisation of AR(1) Among the most commonly used models in dynamic settings More challenging than the static models (stationarity constraints) Different models depending on the processing of the starting value x0
  • 33. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The AR(p) model Unwieldy stationarity constraints Practical difficulty: for complex models, stationarity constraints get quite involved to the point of being unknown in some cases Example (AR(1)) Case of linear Markovian dependence on the last value i.i.d. xt = µ + (xt−1 − µ) + t, t ∼ N (0, σ 2 ) If | | < 1, (xt )t∈Z can be written as ∞ j xt = µ + t−j j=0 and this is a stationary representation.
  • 34. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The AR(p) model Stationary but... If | | > 1, alternative stationary representation ∞ −j xt = µ − t+j . j=1
  • 35. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The AR(p) model Stationary but... If | | > 1, alternative stationary representation ∞ −j xt = µ − t+j . j=1 This stationary solution is criticized as artificial because xt is correlated with future white noises ( t )s>t , unlike the case when | | < 1. Non-causal representation...
  • 36. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The AR(p) model Stationarity+causality Stationarity constraints in the prior as a restriction on the values of θ. Theorem AR(p) model second-order stationary and causal iff the roots of the polynomial p P(x) = 1 − ix i i=1 are all outside the unit circle
  • 37. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The AR(p) model Stationarity constraints Under stationarity constraints, complex parameter space: each value of needs to be checked for roots of corresponding polynomial with modulus less than 1
  • 38. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The AR(p) model Stationarity constraints Under stationarity constraints, complex parameter space: each value of needs to be checked for roots of corresponding polynomial with modulus less than 1 E.g., for an AR(2) process with 1.0 autoregressive polynomial 0.5 P(u) = 1 − 1 u − 2 u2 , constraint is 0.0 θ2 q 1 + 2 < 1, 1 − 2 <1 −0.5 −1.0 and | 2 | < 1 −2 −1 0 1 2 θ1
  • 39. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model The MA(q) model Alternative type of time series q xt = µ + t − ϑj t−j , t ∼ N (0, σ 2 ) j=1 Stationary but, for identifiability considerations, the polynomial q Q(x) = 1 − ϑj xj j=1 must have all its roots outside the unit circle
  • 40. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model Identifiability Example For the MA(1) model, xt = µ + t − ϑ1 t−1 , var(xt ) = (1 + ϑ2 )σ 2 1 can also be written 1 xt = µ + ˜t−1 − ˜t , ˜ ∼ N (0, ϑ2 σ 2 ) , 1 ϑ1 Both pairs (ϑ1 , σ) & (1/ϑ1 , ϑ1 σ) lead to alternative representations of the same model.
  • 41. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model Properties of MA models Non-Markovian model (but special case of hidden Markov) Autocovariance γx (s) is null for |s| > q
  • 42. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model Representations x1:T is a normal random variable with constant mean µ and covariance matrix σ2   γ1 γ2 ... γq 0 ... 0 0  γ1 σ2 γ1 . . . γq−1 γq ... 0 0 Σ= ,   ..  .  2 0 0 0 ... 0 0 ... γ1 σ with (|s| ≤ q) q−|s| 2 γs = σ ϑi ϑi+|s| i=0 Not manageable in practice [large T’s]
  • 43. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model Representations (contd.) Conditional on past ( 0 , . . . , −q+1 ), L(µ, ϑ1 , . . . , ϑq , σ|x1:T , 0 , . . . , −q+1 ) ∝   2  T   q   −T  2σ 2 , σ exp − xt − µ + ϑj ˆt−j   t=1  j=1  where (t > 0) q ˆt = xt − µ + ϑj ˆt−j , ˆ0 = 0, . . . , ˆ1−q = 1−q j=1 Recursive definition of the likelihood, still costly O(T × q)
  • 44. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model Representations (contd.) Encompassing approach for general time series models State-space representation xt = Gyt + εt , (1) yt+1 = F yt + ξt , (2) (1) is the observation equation and (2) is the state equation
  • 45. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model Representations (contd.) Encompassing approach for general time series models State-space representation xt = Gyt + εt , (1) yt+1 = F yt + ξt , (2) (1) is the observation equation and (2) is the state equation Note This is a special case of hidden Markov model
  • 46. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model MA(q) state-space representation For the MA(q) model, take yt = ( t−q , . . . , t−1 , t ) and then    0 1 0 ... 0  0 0 0 0 1 ... 0     . yt+1 =  ... t+1  .   yt +  0  . 0 0 ... 1 0 0 0 0 ... 0 1 xt = µ − ϑq ϑq−1 ... ϑ1 −1 yt .
  • 47. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics The MA(q) model MA(q) state-space representation (cont’d) Example For the MA(1) model, observation equation xt = (1 0)yt with yt = (y1t y2t ) directed by the state equation 0 1 1 yt+1 = yt + t+1 . 0 0 ϑ1
  • 48. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Typology of problems c A typology of Bayes computational problems (i). latent variable models in general
  • 49. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Typology of problems c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models;
  • 50. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Typology of problems c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (iii). use of a complex sampling model with an intractable likelihood, as for instance in some graphical models;
  • 51. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Typology of problems c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (iii). use of a complex sampling model with an intractable likelihood, as for instance in some graphical models; (iv). use of a huge dataset;
  • 52. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Typology of problems c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (iii). use of a complex sampling model with an intractable likelihood, as for instance in some graphical models; (iv). use of a huge dataset; (v). use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample);
  • 53. MCMC and likelihood-free methods Part/day I: Markov chain methods Computational issues in Bayesian statistics Typology of problems c A typology of Bayes computational problems (i). latent variable models in general (ii). use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models; (iii). use of a complex sampling model with an intractable likelihood, as for instance in some graphical models; (iv). use of a huge dataset; (v). use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample); (vi). use of a particular inferential procedure as for instance, Bayes factors π P (θ ∈ Θ0 | x) π(θ ∈ Θ0 ) B01 (x) = . P (θ ∈ Θ1 | x) π(θ ∈ Θ1 )
  • 54. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis-Hastings Algorithm Computational issues in Bayesian statistics The Metropolis-Hastings Algorithm The Gibbs Sampler Population Monte Carlo
  • 55. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo basics General purpose A major computational issue in Bayesian statistics: Given a density π known up to a normalizing constant, and an integrable function h, compute h(x)˜ (x)µ(dx) π Π(h) = h(x)π(x)µ(dx) = π (x)µ(dx) ˜ when h(x)˜ (x)µ(dx) is intractable. π
  • 56. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo basics Monte Carlo 101 Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by N ΠM C (h) = N −1 ˆ N h(xi ). i=1 ˆN as LLN: ΠM C (h) −→ Π(h) If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞, √ L CLT: ˆN N ΠM C (h) − Π(h) N 0, Π [h − Π(h)]2 .
  • 57. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo basics Monte Carlo 101 Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by N ΠM C (h) = N −1 ˆ N h(xi ). i=1 ˆN as LLN: ΠM C (h) −→ Π(h) If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞, √ L CLT: ˆN N ΠM C (h) − Π(h) N 0, Π [h − Π(h)]2 . Caveat conducting to MCMC Often impossible or inefficient to simulate directly from Π
  • 58. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Importance Sampling Importance Sampling For Q proposal distribution such that Q(dx) = q(x)µ(dx), alternative representation Π(h) = h(x){π/q}(x)q(x)µ(dx).
  • 59. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Importance Sampling Importance Sampling For Q proposal distribution such that Q(dx) = q(x)µ(dx), alternative representation Π(h) = h(x){π/q}(x)q(x)µ(dx). Principle of importance (!) Generate an iid sample x1 , . . . , xN ∼ Q and estimate Π(h) by N ΠIS (h) = N −1 ˆ Q,N h(xi ){π/q}(xi ). i=1 return to pMC
  • 60. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Importance Sampling Properties of importance Then ˆ as LLN: ΠIS (h) −→ Π(h) Q,N and if Q((hπ/q)2 ) < ∞, √ L CLT: ˆ Q,N N (ΠIS (h) − Π(h)) N 0, Q{(hπ/q − Π(h))2 } .
  • 61. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Importance Sampling Properties of importance Then ˆ as LLN: ΠIS (h) −→ Π(h) Q,N and if Q((hπ/q)2 ) < ∞, √ L CLT: ˆ Q,N N (ΠIS (h) − Π(h)) N 0, Q{(hπ/q − Π(h))2 } . Caveat ˆ Q,N If normalizing constant of π unknown, impossible to use ΠIS Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
  • 62. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Importance Sampling Self-Normalised Importance Sampling Self normalized version N −1 N ˆ Q,N ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). i=1 i=1
  • 63. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Importance Sampling Self-Normalised Importance Sampling Self normalized version N −1 N ˆ Q,N ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). i=1 i=1 ˆ as LLN : ΠSN IS (h) −→ Π(h) Q,N and if Π((1 + h2 )(π/q)) < ∞, √ L CLT : ˆ Q,N N (ΠSN IS (h) − Π(h)) N 0, π {(π/q)(h − Π(h)}2 ) .
  • 64. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Importance Sampling Self-Normalised Importance Sampling Self normalized version N −1 N ˆ Q,N ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). i=1 i=1 ˆ as LLN : ΠSN IS (h) −→ Π(h) Q,N and if Π((1 + h2 )(π/q)) < ∞, √ L CLT : ˆ Q,N N (ΠSN IS (h) − Π(h)) N 0, π {(π/q)(h − Π(h)}2 ) . c The quality of the SNIS approximation depends on the choice of Q
  • 65. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx ,
  • 66. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx , [notation warnin: π turned to f !]
  • 67. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (MCMC) It is not necessary to use a sample from the distribution f to approximate the integral I= h(x)f (x)dx , We can obtain X1 , . . . , Xn ∼ f (approx) without directly simulating from f , using an ergodic Markov chain with stationary distribution f Andre¨ Markov ı
  • 68. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f
  • 69. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f irreducible Markov chain with stationary distribution f is ergodic with limiting distribution f under weak conditions hence convergence in distribution of (X (t) ) to a random variable from f . for T0 “large enough” T0 , X (T0 ) distributed from f Markov sequence is dependent sample X (T0 ) , X (T0 +1) , . . . generated from f Birkoff’s ergodic theorem extends LLN, sufficient for most approximation purposes
  • 70. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Monte Carlo Methods based on Markov Chains Running Monte Carlo via Markov Chains (2) Idea For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is generated using a transition kernel with stationary distribution f Problem: How can one build a Markov chain with a given stationary distribution?
  • 71. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The Metropolis–Hastings algorithm Basics The algorithm uses the objective (target) density f and a conditional density q(y|x) called the instrumental (or proposal) Nicholas Metropolis distribution
  • 72. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm The MH algorithm Algorithm (Metropolis–Hastings) Given x(t) , 1. Generate Yt ∼ q(y|x(t) ). 2. Take Yt with prob. ρ(x(t) , Yt ), X (t+1) = x(t) with prob. 1 − ρ(x(t) , Yt ), where f (y) q(x|y) ρ(x, y) = min ,1 . f (x) q(y|x)
  • 73. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Features Independent of normalizing constants for both f and q(·|x) (ie, those constants independent of x) Never move to values with f (y) = 0 The chain (x(t) )t may take the same value several times in a row, even though f is a density wrt Lebesgue measure The sequence (yt )t is usually not a Markov chain
  • 74. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f (y) K(y, x) = f (x) K(x, y)
  • 75. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f (y) K(y, x) = f (x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent
  • 76. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties 1. The M-H Markov chain is reversible, with invariant/stationary density f since it satisfies the detailed balance condition f (y) K(y, x) = f (x) K(x, y) 2. As f is a probability measure, the chain is positive recurrent 3. If f (Yt ) q(X (t) |Yt ) Pr ≥ 1 < 1. (1) f (X (t) ) q(Yt |X (t) ) that is, the event {X (t+1) = X (t) } is possible, then the chain is aperiodic
  • 77. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4. If q(y|x) > 0 for every (x, y), (2) the chain is irreducible
  • 78. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4. If q(y|x) > 0 for every (x, y), (2) the chain is irreducible 5. For M-H, f -irreducibility implies Harris recurrence
  • 79. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm The Metropolis–Hastings algorithm Convergence properties (2) 4. If q(y|x) > 0 for every (x, y), (2) the chain is irreducible 5. For M-H, f -irreducibility implies Harris recurrence 6. Thus, for M-H satisfying (1) and (2) (i) For h, with Ef |h(X)| < ∞, T 1 lim h(X (t) ) = h(x)df (x) a.e. f. T →∞ T t=1 (ii) and lim K n (x, ·)µ(dx) − f =0 n→∞ TV for every initial distribution µ, where K n (x, ·) denotes the kernel for n transitions.
  • 80. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Random walk Metropolis–Hastings Use of a local perturbation as proposal Yt = X (t) + εt , where εt ∼ g, independent of X (t) . The instrumental density is of the form g(y − x) and the Markov chain is a random walk if we take g to be symmetric g(x) = g(−x)
  • 81. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Random walk Metropolis–Hastings [code] Algorithm (Random walk Metropolis) Given x(t) 1. Generate Yt ∼ g(y − x(t) ) 2. Take f (Yt )  Y with prob. min 1, , (t+1) t X = f (x(t) )  (t) x otherwise.
  • 82. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms The original example Example (Random walk and normal target) Generate N (0, 1) based on the uniform proposal [−δ, δ] forget History! The probability of acceptance is then 2 ρ(x(t) , yt ) = exp{(x(t) − yt )/2} ∧ 1. 2 [Hastings (1970)]
  • 83. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms The original example Example (Random walk & normal (2)) Sample statistics δ 0.1 0.5 1.0 mean 0.399 -0.111 0.10 variance 0.698 1.11 1.06 c As δ ↑, we get better histograms and a faster exploration of the support of f .
  • 84. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms The original example 400 400 250 0.5 0.5 0.5 300 200 300 0.0 0.0 0.0 150 200 200 -0.5 -0.5 -0.5 100 100 100 -1.0 -1.0 -1.0 50 -1.5 -1.5 -1.5 0 0 0 -1 0 1 2 -2 0 2 -3 -2 -1 0 1 2 3 (a) (b) (c) ples based on U [−δ, δ] with (a) δ = 0.1, (b) δ = 0.5 and (c) δ = 1.0, superimposed with the convergence of the means (15, 000 si
  • 85. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Mixtures by random walk MH Example (Mixture models) n k π(θ|x) ∝ p f (xj |µ , σ ) π(θ) j=1 =1
  • 86. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Mixtures by random walk MH Example (Mixture models) n k π(θ|x) ∝ p f (xj |µ , σ ) π(θ) j=1 =1 Metropolis-Hastings proposal: θ(t) + ωε(t) if u(t) < ρ(t) θ(t+1) = θ(t) otherwise where π(θ(t) + ωε(t) |x) ρ(t) = ∧1 π(θ(t) |x) and ω scaled for good acceptance rate
  • 87. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Mixtures by random walk MH Random walk sampling (50000 iterations) 2 2 1 1 theta theta 0 0 -1 -1 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 1.2 p tau 1.2 0.0 1.0 2.0 1.0 -1 0 1 2 theta 0.8 0 1 2 3 4 5 6 tau 0.6 0.4 0.0 0.2 0.4 0.6 0.8 1.0 p 0 2 4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 p tau General case of a 3 component normal mixture [Celeux & al., 2000]
  • 88. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Mixtures by random walk MH 3 2 µ2 1 0 X −1 −1 0 1 2 3 µ1 Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1)
  • 89. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure
  • 90. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Convergence properties Uniform ergodicity prohibited by random walk structure At best, geometric ergodicity: Theorem (Sufficient ergodicity) For a symmetric density f , log-concave in the tails, and a positive and symmetric density g, the chain (X (t) ) is geometrically ergodic. [Mengersen & Tweedie, 1996] no tail effect
  • 91. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms illustration of the tail effect 1.5 1.5 Example (Comparison of tails) 1.0 1.0 Random walk Metropolis 0.5 0.5 Hastings algorithms based on a 0.0 0.0 N (0, 1) instrumental for the -0.5 -0.5 generation of (left) a N (0, 1) distribution and (right) a -1.0 -1.0 distribution with density -1.5 -1.5 ψ(x) ∝ (1 + |x|)−3 0 50 100 150 200 0 50 100 150 200 (a) (b) 90% confidence envelopes of the means, derived from 500 parallel independent
  • 92. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Further convergence properties Under assumptions skip detailed convergence (A1) f is super-exponential, i.e. it is positive with positive continuous first derivative such that lim|x|→∞ n(x) log f (x) = −∞ where n(x) := x/|x|. In words : exponential decay of f in every direction with rate tending to ∞ (A2) lim sup|x|→∞ n(x) m(x) < 0, where m(x) = f (x)/| f (x)| In words: non degeneracy of the countour manifold Cf (y) = {y : f (y) = f (x)} Q is geometrically ergodic, and V (x) ∝ f (x)−1/2 verifies the drift condition [Jarner & Hansen, 2000]
  • 93. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Further [further] convergence properties skip hyperdetailed convergence If P ψ-irreducible and aperiodic, for r = (r(n))n∈N real-valued non decreasing sequence, such that, for all n, m ∈ N, r(n + m) ≤ r(n)r(m), and r(0) = 1, for C a small set, τC = inf{n ≥ 1, Xn ∈ C}, and h ≥ 1, assume τC −1 sup Ex r(k)h(Xk ) < ∞, x∈C k=0
  • 94. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Further [further] convergence properties then, τC −1 S(f, C, r) := x ∈ X, Ex r(k)h(Xk ) <∞ k=0 is full and absorbing and for x ∈ S(f, C, r), lim r(n) P n (x, .) − f h = 0. n→∞ [Tuominen & Tweedie, 1994]
  • 95. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Comments [CLT, Rosenthal’s inequality...] h-ergodicity implies CLT for additive (possibly unbounded) functionals of the chain, Rosenthal’s inequality and so on... [Control of the moments of the return-time] The condition implies (because h ≥ 1) that τC −1 sup Ex [r0 (τC )] ≤ sup Ex r(k)h(Xk ) < ∞, x∈C x∈C k=0 where r0 (n) = n r(l) Can be used to derive bounds for l=0 the coupling time, an essential step to determine computable bounds, using coupling inequalities [Roberts & Tweedie, 98; Fort & Moulines, 00; Jones et al., 02]
  • 96. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Alternative conditions The condition is not really easy to work with... [Possible alternative conditions] (a) [Tuominen, Tweedie, 1994] There exists a sequence (Vn )n∈N , Vn ≥ r(n)h, such that (i) supC V0 < ∞, (ii) {V0 = ∞} ⊂ {V1 = ∞} and (iii) P Vn+1 ≤ Vn − r(n)h + br(n)IC .
  • 97. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Random-walk Metropolis-Hastings algorithms Alternative conditions (b) [Fort 2000] ∃V ≥ f ≥ 1 and b < ∞, such that supC V < ∞ and σC P V (x) + Ex ∆r(k)f (Xk ) ≤ V (x) + bIC (x) k=0 where σC is the hitting time on C and ∆r(k) = r(k) − r(k − 1), k ≥ 1 and ∆r(0) = r(0). τC −1 Result (a) ⇔ (b) ⇔ supx∈C Ex k=0 r(k)f (Xk ) < ∞.
  • 98. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Langevin Algorithms Proposal based on the Langevin diffusion Lt is defined by the stochastic differential equation 1 dLt = dBt + log f (Lt )dt, 2 where Bt is the standard Brownian motion Theorem The Langevin diffusion is the only non-explosive diffusion which is reversible with respect to f .
  • 99. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Discretization Instead, consider the sequence σ2 x(t+1) = x(t) + log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretization step
  • 100. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Discretization Instead, consider the sequence σ2 x(t+1) = x(t) + log f (x(t) ) + σεt , εt ∼ Np (0, Ip ) 2 where σ 2 corresponds to the discretization step Unfortunately, the discretized chain may be be transient, for instance when lim σ2 log f (x)|x|−1 > 1 x→±∞
  • 101. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions MH correction Accept the new value Yt with probability 2 σ2 exp − Yt − x(t) − 2 log f (x(t) ) 2σ 2 f (Yt ) · ∧1. f (x(t) ) σ2 2 exp − x(t) − Yt − 2 log f (Yt ) 2σ 2 Choice of the scaling factor σ Should lead to an acceptance rate of 0.574 to achieve optimal convergence rates (when the components of x are uncorrelated) [Roberts & Rosenthal, 1998; Girolami & Calderhead, 2011]
  • 102. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Optimizing the Acceptance Rate Problem of choice of the transition kernel from a practical point of view Most common alternatives: (a) a fully automated algorithm like ARMS; (b) an instrumental density g which approximates f , such that f /g is bounded for uniform ergodicity to apply; (c) a random walk In both cases (b) and (c), the choice of g is critical,
  • 103. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f .
  • 104. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Case of the random walk Different approach to acceptance rates A high acceptance rate does not indicate that the algorithm is moving correctly since it indicates that the random walk is moving too slowly on the surface of f . If x(t) and yt are close, i.e. f (x(t) ) f (yt ) y is accepted with probability f (yt ) min ,1 1. f (x(t) ) For multimodal densities with well separated modes, the negative effect of limited moves on the surface of f clearly shows.
  • 105. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Case of the random walk (2) If the average acceptance rate is low, the successive values of f (yt ) tend to be small compared with f (x(t) ), which means that the random walk moves quickly on the surface of f since it often reaches the “borders” of the support of f
  • 106. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995]
  • 107. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Rule of thumb In small dimensions, aim at an average acceptance rate of 50%. In large dimensions, at an average acceptance rate of 25%. [Gelman,Gilks and Roberts, 1995] This rule is to be taken with a pinch of salt!
  • 108. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Role of scale Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + t+1 t ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t
  • 109. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Role of scale Example (Noisy AR(1)) Hidden Markov chain from a regular AR(1) model, xt+1 = ϕxt + t+1 t ∼ N (0, τ 2 ) and observables yt |xt ∼ N (x2 , σ 2 ) t The distribution of xt given xt−1 , xt+1 and yt is −1 τ2 exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2 t . 2τ 2 σ2
  • 110. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Role of scale Example (Noisy AR(1) continued) For a Gaussian random walk with scale ω small enough, the random walk never jumps to the other mode. But if the scale ω is sufficiently large, the Markov chain explores both modes and give a satisfactory approximation of the target distribution.
  • 111. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Role of scale Markov chain based on a random walk with scale ω = .1.
  • 112. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Role of scale Markov chain based on a random walk with scale ω = .5.
  • 113. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions MA(2) Since the constraints on (ϑ1 , ϑ2 ) are well-defined, use of a flat prior over the triangle as prior. Simple representation of the likelihood library(mnormt) ma2like=function(theta){ n=length(y) sigma = toeplitz(c(1 +theta[1]^2+theta[2]^2, theta[1]+theta[1]*theta[2],theta[2],rep(0,n-3))) dmnorm(y,rep(0,n),sigma,log=TRUE) }
  • 114. MCMC and likelihood-free methods Part/day I: Markov chain methods The Metropolis-Hastings Algorithm Extensions Basic RWHM for MA(2) Algorithm 1 RW-HM-MA(2) sampler set ω and ϑ(1) for i = 2 to T do ˜ (i−1) (i−1) generate ϑj ∼ U(ϑj − ω, ϑj + ω) set p = 0 and ϑ (i) = ϑ(i−1) ˜ if ϑ within the triangle then ˜ p = exp(ma2like(ϑ) − ma2like(ϑ(i−1) )) end if if U < p then ˜ ϑ(i) = ϑ end if end for