SlideShare uma empresa Scribd logo
1 de 192
Baixar para ler offline
Approximate Bayesian Computation:
        how Bayesian can it be?

             Christian P. Robert
ISBA LECTURES ON BAYESIAN FOUNDATIONS
          ISBA 2012, Kyoto, Japan

        Universit´ Paris-Dauphine, IuF, & CREST
                 e
       http://www.ceremade.dauphine.fr/~xian


                  June 24, 2012
This talk is dedicated to the memory of our dear friend
and fellow Bayesian, George Casella, 1951–2012
Outline
Introduction

Approximate Bayesian computation

ABC as an inference machine
Approximate Bayesian computation



   Introduction
       Monte Carlo basics
       simulation-based methods in Econometrics

   Approximate Bayesian computation

   ABC as an inference machine
General issue




   Given a density π known up to a normalizing constant, e.g. a
   posterior density, and an integrable function h, how can one use

                                           h(x)˜ (x)µ(dx)
                                                π
              Ih =    h(x)π(x)µ(dx) =
                                             π (x)µ(dx)
                                             ˜

   when   h(x)˜ (x)µ(dx) is intractable?
              π
Monte Carlo basics

   Generate an iid sample x1 , . . . , xN from π and estimate Ih by
                                           N
                          Imc (h) = N −1
                          ˆ
                            N                    h(xi ).
                                           i=1

               ˆ        as
   since [LLN] IMC (h) −→ Ih
                 N

   Furthermore, if Ih2 =    h2 (x)π(x)µ(dx) < ∞,
                  √                    L
         [CLT]          ˆ
                      N IMC (h) − Ih
                          N                 N 0, I [h − Ih ]2    .


   Caveat
   Often impossible or inefficient to simulate directly from π
Monte Carlo basics

   Generate an iid sample x1 , . . . , xN from π and estimate Ih by
                                           N
                          Imc (h) = N −1
                          ˆ
                            N                    h(xi ).
                                           i=1

               ˆ        as
   since [LLN] IMC (h) −→ Ih
                 N

   Furthermore, if Ih2 =    h2 (x)π(x)µ(dx) < ∞,
                  √                    L
         [CLT]          ˆ
                      N IMC (h) − Ih
                          N                 N 0, I [h − Ih ]2    .


   Caveat
   Often impossible or inefficient to simulate directly from π
Importance Sampling



   For Q proposal distribution with density q

                    Ih =     h(x){π/q}(x)q(x)µ(dx).


   Principle
   Generate an iid sample x1 , . . . , xN ∼ Q and estimate Ih by
                                     N
                   IIS (h) = N −1
                   ˆ
                     Q,N                  h(xi ){π/q}(xi ).
                                    i=1
Importance Sampling



   For Q proposal distribution with density q

                    Ih =     h(x){π/q}(x)q(x)µ(dx).


   Principle
   Generate an iid sample x1 , . . . , xN ∼ Q and estimate Ih by
                                     N
                   IIS (h) = N −1
                   ˆ
                     Q,N                  h(xi ){π/q}(xi ).
                                    i=1
Importance Sampling (convergence)



   Then
         ˆ        as
   [LLN] IIS (h) −→ Ih
           Q,N                  and if Q((hπ/q)2 ) < ∞,
               √                      L
       [CLT]         ˆ
                   N(IIS (h) − Ih )
                       Q,N                N 0, Q{(hπ/q − Ih )2 } .


   Caveat
                                                      ˆ
   If normalizing constant unknown, impossible to use IIS (h)
                                                        Q,N

   Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
Importance Sampling (convergence)



   Then
         ˆ        as
   [LLN] IIS (h) −→ Ih
           Q,N                  and if Q((hπ/q)2 ) < ∞,
               √                      L
       [CLT]         ˆ
                   N(IIS (h) − Ih )
                       Q,N                N 0, Q{(hπ/q − Ih )2 } .


   Caveat
                                                      ˆ
   If normalizing constant unknown, impossible to use IIS (h)
                                                        Q,N

   Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
Self-normalised importance Sampling


   Self normalized version
                           N                  −1 N
           ˆ
           ISNIS (h)   =         {π/q}(xi )            h(xi ){π/q}(xi ).
             Q,N
                           i=1                   i=1

           ˆ            as
   [LLN] ISNIS (h) −→ Ih
             Q,N
   and if I((1+h2 )(π/q)) < ∞,
           √                        L
   [CLT]         ˆ
              N(ISNIS (h) − Ih )
                   Q,N                  N     0, π {(π/q)(h − Ih }2 ) .

   Caveat
                                              ˆ
   If π cannot be computed, impossible to use ISNIS (h)
                                                Q,N
Self-normalised importance Sampling


   Self normalized version
                           N                  −1 N
           ˆ
           ISNIS (h)   =         {π/q}(xi )            h(xi ){π/q}(xi ).
             Q,N
                           i=1                   i=1

           ˆ            as
   [LLN] ISNIS (h) −→ Ih
             Q,N
   and if I((1+h2 )(π/q)) < ∞,
           √                        L
   [CLT]         ˆ
              N(ISNIS (h) − Ih )
                   Q,N                  N     0, π {(π/q)(h − Ih }2 ) .

   Caveat
                                              ˆ
   If π cannot be computed, impossible to use ISNIS (h)
                                                Q,N
Self-normalised importance Sampling


   Self normalized version
                           N                  −1 N
           ˆ
           ISNIS (h)   =         {π/q}(xi )            h(xi ){π/q}(xi ).
             Q,N
                           i=1                   i=1

           ˆ            as
   [LLN] ISNIS (h) −→ Ih
             Q,N
   and if I((1+h2 )(π/q)) < ∞,
           √                        L
   [CLT]         ˆ
              N(ISNIS (h) − Ih )
                   Q,N                  N     0, π {(π/q)(h − Ih }2 ) .

   Caveat
                                              ˆ
   If π cannot be computed, impossible to use ISNIS (h)
                                                Q,N
Perspectives



   What is the fundamental issue?
       a mere computational issue (optimism: can be solved /
       pessimism: too costly in the short term)
       more of a inferencial issue (optimism: gathering legit from
       classical B approach / pessimism: lacking the coherence of
       classical B approach)
       calling for a new methodology (optimism: equivalent to
       classical B approach / pessimism: not always convergent)
Perspectives



   What is the fundamental issue?
       a mere computational issue (optimism: can be solved /
       pessimism: too costly in the short term)
       more of a inferencial issue (optimism: gathering legit from
       classical B approach / pessimism: lacking the coherence of
       classical B approach)
       calling for a new methodology (optimism: equivalent to
       classical B approach / pessimism: not always convergent)
Perspectives



   What is the fundamental issue?
       a mere computational issue (optimism: can be solved /
       pessimism: too costly in the short term)
       more of a inferencial issue (optimism: gathering legit from
       classical B approach / pessimism: lacking the coherence of
       classical B approach)
       calling for a new methodology (optimism: equivalent to
       classical B approach / pessimism: not always convergent)
Econom’ections



                                                               Model choice



   Similar exploration of simulation-based and approximation
   techniques in Econometrics
       Simulated method of moments
       Method of simulated moments
       Simulated pseudo-maximum-likelihood
       Indirect inference
                                       [Gouri´roux & Monfort, 1996]
                                             e
Simulated method of moments

                      o
  Given observations y1:n from a model

                     yt = r (y1:(t−1) , t , θ) ,      t   ∼ g (·)


    1. simulate   1:n ,   derive

                               yt (θ) = r (y1:(t−1) ,     t , θ)

    2. and estimate θ by
                                         n
                              arg min         (yto − yt (θ))2
                                    θ
                                        t=1
Simulated method of moments

                      o
  Given observations y1:n from a model

                     yt = r (y1:(t−1) , t , θ) ,            t   ∼ g (·)


    1. simulate   1:n ,   derive

                               yt (θ) = r (y1:(t−1) ,           t , θ)

    2. and estimate θ by
                                        n               n                 2
                           arg min           yto   −            yt (θ)
                                   θ
                                       t=1             t=1
Method of simulated moments

  Given a statistic vector K (y ) with

                      Eθ [K (Yt )|y1:(t−1) ] = k(y1:(t−1) ; θ)

  find an unbiased estimator of k(y1:(t−1) ; θ),

                                ˜
                                k( t , y1:(t−1) ; θ)

  Estimate θ by
                         n                S
           arg min            K (yt ) −         ˜ t
                                                k( s , y1:(t−1) ; θ)/S
                  θ
                        t=1               s=1

                                                        [Pakes & Pollard, 1989]
Indirect inference




                                                 ˆ
   Minimise [in θ] a distance between estimators β based on a
   pseudo-model for genuine observations and for observations
   simulated under the true model and the parameter θ.

                             [Gouri´roux, Monfort, & Renault, 1993;
                                   e
                             Smith, 1993; Gallant & Tauchen, 1996]
Indirect inference (PML vs. PSE)


   Example of the pseudo-maximum-likelihood (PML)

                ˆ
                β(y) = arg max          log f (yt |β, y1:(t−1) )
                               β
                                    t


   leading to
                          ˆ        ˆ
                arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2
                     θ

   when
                    ys (θ) ∼ f (y|θ)        s = 1, . . . , S
Indirect inference (PML vs. PSE)


   Example of the pseudo-score-estimator (PSE)
                                                                      2
                ˆ                        ∂ log f
                β(y) = arg min                   (yt |β, y1:(t−1) )
                             β
                                     t
                                            ∂β

   leading to
                             ˆ        ˆ
                   arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2
                        θ

   when
                       ys (θ) ∼ f (y|θ)        s = 1, . . . , S
Consistent indirect inference



           ...in order to get a unique solution the dimension of
       the auxiliary parameter β must be larger than or equal to
       the dimension of the initial parameter θ. If the problem is
       just identified the different methods become easier...

   Consistency depending on the criterion and on the asymptotic
   identifiability of θ
                                [Gouri´roux & Monfort, 1996, p. 66]
                                       e
Consistent indirect inference



           ...in order to get a unique solution the dimension of
       the auxiliary parameter β must be larger than or equal to
       the dimension of the initial parameter θ. If the problem is
       just identified the different methods become easier...

   Consistency depending on the criterion and on the asymptotic
   identifiability of θ
                                [Gouri´roux & Monfort, 1996, p. 66]
                                       e
Choice of pseudo-model




   Arbitrariness of pseudo-model
   Pick model such that
        ˆ
    1. β(θ) not flat (i.e. sensitive to changes in θ)
        ˆ
    2. β(θ) not dispersed (i.e. robust agains changes in ys (θ))
                                          [Frigessi & Heggland, 2004]
Empirical likelihood

   Another approximation method (not yet related with simulation)

   Definition
   For dataset y = (y1 , . . . , yn ), and parameter of interest θ, pick
   constraints
                                  E[h(Y , θ)] = 0
   uniquely identifying θ and define the empirical likelihood as
                                                 n
                            Lel (θ|y) = max          pi
                                          p
                                              i=1

    for p in the set {p ∈ [0; 1]n ,    pi = 1,        i   pi h(yi , θ) = 0}.

                                                                    [Owen, 1988]
Empirical likelihood



   Another approximation method (not yet related with simulation)

   Example
   When θ = Ef [Y ], empirical likelihood is the maximum of

                                p1 · · · pn

   under constraint
                         p1 y1 + . . . + pn yn = θ
ABCel


  Another approximation method (now related with simulation!)
  Importance sampling implementation
  Algorithm 1: Raw ABCel sampler
    Given observation y
    for i = 1 to M do
      Generate θ i from the prior distribution π(·)
      Set the weight ωi = Lel (θ i |y)
    end for
    Proceed with pairs (θi , ωi ) as in regular importance sampling

                                 [Mengersen, Pudlo & Robert, 2012]
A?B?C?




    A stands for approximate
    [wrong likelihood]
    B stands for Bayesian
    C stands for computation
    [producing a parameter
    sample]
A?B?C?




    A stands for approximate
    [wrong likelihood]
    B stands for Bayesian
    C stands for computation
    [producing a parameter
    sample]
A?B?C?




    A stands for approximate
    [wrong likelihood]
    B stands for Bayesian
    C stands for computation
    [producing a parameter
    sample]
How much Bayesian?




      asymptotically so (meaningfull?)
      approximation error unknown (w/o simulation)
      pragmatic Bayes (there is no other solution!)
Approximate Bayesian computation



   Introduction

   Approximate Bayesian computation
      Genetics of ABC
      ABC basics
      Advances and interpretations
      Alphabet (compu-)soup

   ABC as an inference machine
Genetic background of ABC




   ABC is a recent computational technique that only requires being
   able to sample from the likelihood f (·|θ)
   This technique stemmed from population genetics models, about
   15 years ago, and population geneticists still contribute
   significantly to methodological developments of ABC.
                             [Griffith & al., 1997; Tavar´ & al., 1999]
                                                          e
Demo-genetic inference



   Each model is characterized by a set of parameters θ that cover
   historical (time divergence, admixture time ...), demographics
   (population sizes, admixture rates, migration rates, ...) and genetic
   (mutation rate, ...) factors
   The goal is to estimate these parameters from a dataset of
   polymorphism (DNA sample) y observed at the present time

   Problem:
   most of the time, we can not calculate the likelihood of the
   polymorphism data f (y|θ).
Demo-genetic inference



   Each model is characterized by a set of parameters θ that cover
   historical (time divergence, admixture time ...), demographics
   (population sizes, admixture rates, migration rates, ...) and genetic
   (mutation rate, ...) factors
   The goal is to estimate these parameters from a dataset of
   polymorphism (DNA sample) y observed at the present time

   Problem:
   most of the time, we can not calculate the likelihood of the
   polymorphism data f (y|θ).
A genuine example of application
             !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
             1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+




                                                      94


   Pygmies populations: do they have a common origin? Was there a
   lot of exchanges between pygmies and non-pygmies populations?
Alternative !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03!
            scenarios
           1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+

        Différents scénarios possibles, choix de scenario par ABC




                                                             Verdu et al. 2009

                                                                    96
Untractable likelihood



   Missing (too missing!) data structure:

                      f (y|θ) =        f (y|G , θ)f (G |θ)dG
                                   G

   cannot be computed in a manageable way...
   The genealogies are considered as nuisance parameters
       This modelling clearly differs from the phylogenetic perspective
       where the tree is the parameter of interest.
Untractable likelihood



   Missing (too missing!) data structure:

                      f (y|θ) =        f (y|G , θ)f (G |θ)dG
                                   G

   cannot be computed in a manageable way...
   The genealogies are considered as nuisance parameters
       This modelling clearly differs from the phylogenetic perspective
       where the tree is the parameter of interest.
Untractable likelihood



So, what can we do when the
likelihood function f (y|θ) is
well-defined but impossible / too
costly to compute...?
    MCMC cannot be implemented!
    shall we give up Bayesian
    inference altogether?!
    or settle for an almost Bayesian
    inference...?
Untractable likelihood



So, what can we do when the
likelihood function f (y|θ) is
well-defined but impossible / too
costly to compute...?
    MCMC cannot be implemented!
    shall we give up Bayesian
    inference altogether?!
    or settle for an almost Bayesian
    inference...?
ABC methodology

  Bayesian setting: target is π(θ)f (x|θ)
  When likelihood f (x|θ) not in closed form, likelihood-free rejection
  technique:
  Foundation
  For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
  jointly simulating
                       θ ∼ π(θ) , z ∼ f (z|θ ) ,
  until the auxiliary variable z is equal to the observed value, z = y,
  then the selected
                                θ ∼ π(θ|y)

          [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997]
                                                     e
ABC methodology

  Bayesian setting: target is π(θ)f (x|θ)
  When likelihood f (x|θ) not in closed form, likelihood-free rejection
  technique:
  Foundation
  For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
  jointly simulating
                       θ ∼ π(θ) , z ∼ f (z|θ ) ,
  until the auxiliary variable z is equal to the observed value, z = y,
  then the selected
                                θ ∼ π(θ|y)

          [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997]
                                                     e
ABC methodology

  Bayesian setting: target is π(θ)f (x|θ)
  When likelihood f (x|θ) not in closed form, likelihood-free rejection
  technique:
  Foundation
  For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
  jointly simulating
                       θ ∼ π(θ) , z ∼ f (z|θ ) ,
  until the auxiliary variable z is equal to the observed value, z = y,
  then the selected
                                θ ∼ π(θ|y)

          [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997]
                                                     e
Why does it work?!



   The proof is trivial:

                       f (θi ) ∝         π(θi )f (z|θi )Iy (z)
                                   z∈D
                             ∝ π(θi )f (y|θi )
                             = π(θi |y) .

                                                            [Accept–Reject 101]
A as A...pproximative



   When y is a continuous random variable, strict equality z = y is
   replaced with a tolerance zone

                                (y, z) ≤

   where is a distance
   Output distributed from
                                     def
                π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < )

                                               [Pritchard et al., 1999]
A as A...pproximative



   When y is a continuous random variable, strict equality z = y is
   replaced with a tolerance zone

                                (y, z) ≤

   where is a distance
   Output distributed from
                                     def
                π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < )

                                               [Pritchard et al., 1999]
ABC algorithm


  In most implementations, further degree of A...pproximation:

  Algorithm 1 Likelihood-free rejection sampler
    for i = 1 to N do
      repeat
         generate θ from the prior distribution π(·)
         generate z from the likelihood f (·|θ )
      until ρ{η(z), η(y)} ≤
      set θi = θ
    end for

  where η(y) defines a (not necessarily sufficient) statistic
Output


  The likelihood-free algorithm samples from the marginal in z of:

                                    π(θ)f (z|θ)IA ,y (z)
                    π (θ, z|y) =                           ,
                                   A ,y ×Θ π(θ)f (z|θ)dzdθ

  where A   ,y   = {z ∈ D|ρ(η(z), η(y)) < }.
  The idea behind ABC is that the summary statistics coupled with a
  small tolerance should provide a good approximation of the
  posterior distribution:

                     π (θ|y) =     π (θ, z|y)dz ≈ π(θ|y) .


                                                               ...does it?!
Output


  The likelihood-free algorithm samples from the marginal in z of:

                                    π(θ)f (z|θ)IA ,y (z)
                    π (θ, z|y) =                           ,
                                   A ,y ×Θ π(θ)f (z|θ)dzdθ

  where A   ,y   = {z ∈ D|ρ(η(z), η(y)) < }.
  The idea behind ABC is that the summary statistics coupled with a
  small tolerance should provide a good approximation of the
  posterior distribution:

                     π (θ|y) =     π (θ, z|y)dz ≈ π(θ|y) .


                                                               ...does it?!
Output


  The likelihood-free algorithm samples from the marginal in z of:

                                    π(θ)f (z|θ)IA ,y (z)
                    π (θ, z|y) =                           ,
                                   A ,y ×Θ π(θ)f (z|θ)dzdθ

  where A   ,y   = {z ∈ D|ρ(η(z), η(y)) < }.
  The idea behind ABC is that the summary statistics coupled with a
  small tolerance should provide a good approximation of the
  posterior distribution:

                     π (θ|y) =     π (θ, z|y)dz ≈ π(θ|y) .


                                                               ...does it?!
Convergence of ABC (first attempt)




   What happens when          → 0?
   If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary
   δ > 0, there exists 0 such that < 0 implies
Convergence of ABC (first attempt)




   What happens when          → 0?
   If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary
   δ > 0, there exists 0 such that < 0 implies

         π(θ)    f (z|θ)IA ,y (z) dz      π(θ)f (y|θ)(1 δ)µ(B )
                                     ∈
          A ,y ×Θ π(θ)f (z|θ)dzdθ        Θ π(θ)f (y|θ)dθ(1 ± δ)µ(B )
Convergence of ABC (first attempt)




   What happens when          → 0?
   If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary
   δ > 0, there exists 0 such that < 0 implies

         π(θ)    f (z|θ)IA ,y (z) dz      π(θ)f (y|θ)(1 δ)
                                                            XX )
                                                            µ(BX
                                     ∈
          A ,y ×Θ π(θ)f (z|θ)dzdθ        Θ π(θ)f (y|θ)dθ(1 ± δ) X
                                                               µ(B )
                                                               X
                                                                 X
Convergence of ABC (first attempt)



   What happens when          → 0?
   If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary
   δ  0, there exists 0 such that  0 implies

         π(θ)    f (z|θ)IA ,y (z) dz      π(θ)f (y|θ)(1 δ)
                                                            XX )
                                                            µ(BX
                                     ∈
          A ,y ×Θ π(θ)f (z|θ)dzdθ        Θ π(θ)f (y|θ)dθ(1 ± δ) X
                                                               µ(B )
                                                               X
                                                                 X

                    [Proof extends to other continuous-in-0 kernels K ]
Convergence of ABC (second attempt)


   What happens when                   → 0?
   For B ⊂ Θ, we have

               A         f (z|θ)dz                                    f (z|θ)π(θ)dθ
                    ,y                                            B
                                        π(θ)dθ =                                       dz
    B A   ,y ×Θ
                    π(θ)f (z|θ)dzdθ                   A   ,y   A ,y ×Θ π(θ)f (z|θ)dzdθ

                         B   f (z|θ)π(θ)dθ            m(z)
      =                                                              dz
           A   ,y
                                m(z)         A ,y ×Θ π(θ)f (z|θ)dzdθ
                                           m(z)
      =             π(B|z)                                dz
           A   ,y                 A ,y ×Θ π(θ)f (z|θ)dzdθ



   which indicates convergence for a continuous π(B|z).
Convergence of ABC (second attempt)


   What happens when                   → 0?
   For B ⊂ Θ, we have

               A         f (z|θ)dz                                    f (z|θ)π(θ)dθ
                    ,y                                            B
                                        π(θ)dθ =                                       dz
    B A   ,y ×Θ
                    π(θ)f (z|θ)dzdθ                   A   ,y   A ,y ×Θ π(θ)f (z|θ)dzdθ

                         B   f (z|θ)π(θ)dθ            m(z)
      =                                                              dz
           A   ,y
                                m(z)         A ,y ×Θ π(θ)f (z|θ)dzdθ
                                           m(z)
      =             π(B|z)                                dz
           A   ,y                 A ,y ×Θ π(θ)f (z|θ)dzdθ



   which indicates convergence for a continuous π(B|z).
Convergence (do not attempt!)


   ...and the above does not apply to insufficient statistics:
   If η(y) is not a sufficient statistics, the best one can hope for is

                         π(θ|η(y) ,   not π(θ|y)

   If η(y) is an ancillary statistic, the whole information contained in
   y is lost!, the “best” one can hope for is

                             π(θ|η(y) = π(η)

                                                               Bummer!!!
Convergence (do not attempt!)


   ...and the above does not apply to insufficient statistics:
   If η(y) is not a sufficient statistics, the best one can hope for is

                         π(θ|η(y) ,   not π(θ|y)

   If η(y) is an ancillary statistic, the whole information contained in
   y is lost!, the “best” one can hope for is

                             π(θ|η(y) = π(η)

                                                               Bummer!!!
Convergence (do not attempt!)


   ...and the above does not apply to insufficient statistics:
   If η(y) is not a sufficient statistics, the best one can hope for is

                         π(θ|η(y) ,   not π(θ|y)

   If η(y) is an ancillary statistic, the whole information contained in
   y is lost!, the “best” one can hope for is

                             π(θ|η(y) = π(η)

                                                               Bummer!!!
Convergence (do not attempt!)


   ...and the above does not apply to insufficient statistics:
   If η(y) is not a sufficient statistics, the best one can hope for is

                         π(θ|η(y) ,   not π(θ|y)

   If η(y) is an ancillary statistic, the whole information contained in
   y is lost!, the “best” one can hope for is

                             π(θ|η(y) = π(η)

                                                               Bummer!!!
Probit modelling on Pima Indian women

   Example (R benchmark)
   200 Pima Indian women with observed variables
          plasma glucose concentration in oral glucose tolerance test
          diastolic blood pressure
          diabetes pedigree function
          presence/absence of diabetes


   Probability of diabetes as function of variables

                           P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

   200 observations of Pima.tr and g -prior modelling:

                                         β ∼ N3 (0, n XT X)−1

   importance function inspired from MLE estimator distribution
                                                       ˆ ˆ
                                                β ∼ N (β, Σ)
Probit modelling on Pima Indian women

   Example (R benchmark)
   200 Pima Indian women with observed variables
          plasma glucose concentration in oral glucose tolerance test
          diastolic blood pressure
          diabetes pedigree function
          presence/absence of diabetes


   Probability of diabetes as function of variables

                           P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

   200 observations of Pima.tr and g -prior modelling:

                                         β ∼ N3 (0, n XT X)−1

   importance function inspired from MLE estimator distribution
                                                       ˆ ˆ
                                                β ∼ N (β, Σ)
Probit modelling on Pima Indian women

   Example (R benchmark)
   200 Pima Indian women with observed variables
          plasma glucose concentration in oral glucose tolerance test
          diastolic blood pressure
          diabetes pedigree function
          presence/absence of diabetes


   Probability of diabetes as function of variables

                           P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

   200 observations of Pima.tr and g -prior modelling:

                                         β ∼ N3 (0, n XT X)−1

   importance function inspired from MLE estimator distribution
                                                       ˆ ˆ
                                                β ∼ N (β, Σ)
Probit modelling on Pima Indian women

   Example (R benchmark)
   200 Pima Indian women with observed variables
          plasma glucose concentration in oral glucose tolerance test
          diastolic blood pressure
          diabetes pedigree function
          presence/absence of diabetes


   Probability of diabetes as function of variables

                           P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,

   200 observations of Pima.tr and g -prior modelling:

                                         β ∼ N3 (0, n XT X)−1

   importance function inspired from MLE estimator distribution
                                                       ˆ ˆ
                                                β ∼ N (β, Σ)
Pima Indian benchmark




                                                                80
                   100




                                                                                                       1.0
                   80




                                                                60




                                                                                                       0.8
                   60




                                                                                                       0.6
         Density




                                                      Density




                                                                                             Density
                                                                40
                   40




                                                                                                       0.4
                                                                20
                   20




                                                                                                       0.2
                                                                                                       0.0
                   0




                                                                0




                     −0.005   0.010   0.020   0.030                  −0.05   −0.03   −0.01                   −1.0   0.0   1.0   2.0




   Figure: Comparison between density estimates of the marginals on β1
   (left), β2 (center) and β3 (right) from ABC rejection samples (red) and
   MCMC samples (black)

                                                                             .
MA example



  MA(q) model
                                         q
                          xt =   t   +         ϑi    t−i
                                         i=1

  Simple prior: uniform over the inverse [real and complex] roots in
                                               q
                         Q(u) = 1 −                 ϑi u i
                                             i=1

  under the identifiability conditions
MA example




  MA(q) model
                                         q
                          xt =   t   +         ϑi   t−i
                                         i=1

  Simple prior: uniform prior over the identifiability zone, i.e. triangle
  for MA(2)
MA example (2)

  ABC algorithm thus made of
    1. picking a new value (ϑ1 , ϑ2 ) in the triangle
    2. generating an iid sequence ( t )−qt≤T
    3. producing a simulated series (xt )1≤t≤T
  Distance: basic distance between the series
                                                 T
                ρ((xt )1≤t≤T , (xt )1≤t≤T ) =         (xt − xt )2
                                                t=1

  or distance between summary statistics like the q = 2
  autocorrelations
                                    T
                            τj =           xt xt−j
                                   t=j+1
MA example (2)

  ABC algorithm thus made of
    1. picking a new value (ϑ1 , ϑ2 ) in the triangle
    2. generating an iid sequence ( t )−qt≤T
    3. producing a simulated series (xt )1≤t≤T
  Distance: basic distance between the series
                                                 T
                ρ((xt )1≤t≤T , (xt )1≤t≤T ) =         (xt − xt )2
                                                t=1

  or distance between summary statistics like the q = 2
  autocorrelations
                                    T
                            τj =           xt xt−j
                                   t=j+1
Comparison of distance impact




   Evaluation of the tolerance on the ABC sample against both
   distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comparison of distance impact



            4




                                              1.5
            3




                                              1.0
            2




                                              0.5
            1




                                              0.0
            0




                0.0   0.2   0.4   0.6   0.8         −2.0   −1.0    0.0   0.5   1.0   1.5

                             θ1                                   θ2




   Evaluation of the tolerance on the ABC sample against both
   distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comparison of distance impact



            4




                                              1.5
            3




                                              1.0
            2




                                              0.5
            1




                                              0.0
            0




                0.0   0.2   0.4   0.6   0.8         −2.0   −1.0    0.0   0.5   1.0   1.5

                             θ1                                   θ2




   Evaluation of the tolerance on the ABC sample against both
   distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
Comments




     Role of distance paramount (because     = 0)
      matters little if “small enough”
     representative of “curse of dimensionality”
     the data as a whole may be paradoxically weakly informative
     for ABC
ABC (simul’) advances


                                                        how approximative is ABC?

   Simulating from the prior is often poor in efficiency
   Either modify the proposal distribution on θ to increase the density
   of x’s within the vicinity of y ...
        [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

   ...or by viewing the problem as a conditional density estimation
   and by developing techniques to allow for larger
                                               [Beaumont et al., 2002]

   .....or even by including   in the inferential framework [ABCµ ]
                                                    [Ratmann et al., 2009]
ABC (simul’) advances


                                                        how approximative is ABC?

   Simulating from the prior is often poor in efficiency
   Either modify the proposal distribution on θ to increase the density
   of x’s within the vicinity of y ...
        [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

   ...or by viewing the problem as a conditional density estimation
   and by developing techniques to allow for larger
                                               [Beaumont et al., 2002]

   .....or even by including   in the inferential framework [ABCµ ]
                                                    [Ratmann et al., 2009]
ABC (simul’) advances


                                                        how approximative is ABC?

   Simulating from the prior is often poor in efficiency
   Either modify the proposal distribution on θ to increase the density
   of x’s within the vicinity of y ...
        [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

   ...or by viewing the problem as a conditional density estimation
   and by developing techniques to allow for larger
                                               [Beaumont et al., 2002]

   .....or even by including   in the inferential framework [ABCµ ]
                                                    [Ratmann et al., 2009]
ABC (simul’) advances


                                                        how approximative is ABC?

   Simulating from the prior is often poor in efficiency
   Either modify the proposal distribution on θ to increase the density
   of x’s within the vicinity of y ...
        [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007]

   ...or by viewing the problem as a conditional density estimation
   and by developing techniques to allow for larger
                                               [Beaumont et al., 2002]

   .....or even by including   in the inferential framework [ABCµ ]
                                                    [Ratmann et al., 2009]
ABC-NP

  Better usage of [prior] simulations by
 adjustement: instead of throwing away
 θ such that ρ(η(z), η(y))  , replace
 θ’s with locally regressed transforms
  (use with BIC)



          θ∗ = θ − {η(z) − η(y)}T β
                                  ˆ           [Csill´ry et al., TEE, 2010]
                                                    e

          ˆ
   where β is obtained by [NP] weighted least square regression on
   (η(z) − η(y)) with weights

                            Kδ {ρ(η(z), η(y))}

                                      [Beaumont et al., 2002, Genetics]
ABC-NP (regression)



   Also found in the subsequent literature, e.g. in        Fearnhead-Prangle (2012)   :
   weight directly simulation by

                           Kδ {ρ(η(z(θ)), η(y))}

   or
                           S
                       1
                                 Kδ {ρ(η(zs (θ)), η(y))}
                       S
                           s=1

                                       [consistent estimate of f (η|θ)]
   Curse of dimensionality: poor estimate when d = dim(η) is large...
ABC-NP (regression)



   Also found in the subsequent literature, e.g. in        Fearnhead-Prangle (2012)   :
   weight directly simulation by

                           Kδ {ρ(η(z(θ)), η(y))}

   or
                           S
                       1
                                 Kδ {ρ(η(zs (θ)), η(y))}
                       S
                           s=1

                                       [consistent estimate of f (η|θ)]
   Curse of dimensionality: poor estimate when d = dim(η) is large...
ABC-NP (density estimation)



   Use of the kernel weights

                             Kδ {ρ(η(z(θ)), η(y))}

   leads to the NP estimate of the posterior expectation

                         i   θi Kδ {ρ(η(z(θi )), η(y))}
                             i Kδ {ρ(η(z(θi )), η(y))}

                                                          [Blum, JASA, 2010]
ABC-NP (density estimation)



   Use of the kernel weights

                           Kδ {ρ(η(z(θ)), η(y))}

   leads to the NP estimate of the posterior conditional density
                        ˜
                        Kb (θi − θ)Kδ {ρ(η(z(θi )), η(y))}
                    i
                            i Kδ {ρ(η(z(θi )), η(y))}

                                                     [Blum, JASA, 2010]
ABC-NP (density estimations)



   Other versions incorporating regression adjustments
                             ˜
                             Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))}
                         i
                                  i Kδ {ρ(η(z(θi )), η(y))}

   In all cases, error

      E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
        g
                               c
              var(ˆ (θ|y)) =
                  g                (1 + oP (1))
                             nbδ d
ABC-NP (density estimations)



   Other versions incorporating regression adjustments
                             ˜
                             Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))}
                         i
                                  i Kδ {ρ(η(z(θi )), η(y))}

   In all cases, error

      E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
        g
                               c
              var(ˆ (θ|y)) =
                  g                (1 + oP (1))
                             nbδ d
                                                          [Blum, JASA, 2010]
ABC-NP (density estimations)



   Other versions incorporating regression adjustments
                             ˜
                             Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))}
                         i
                                  i Kδ {ρ(η(z(θi )), η(y))}

   In all cases, error

      E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d )
        g
                               c
              var(ˆ (θ|y)) =
                  g                (1 + oP (1))
                             nbδ d
                                                    [standard NP calculations]
ABC-NCH


  Incorporating non-linearities and heterocedasticities:

                                                  σ (η(y))
                                                  ˆ
                θ∗ = m(η(y)) + [θ − m(η(z))]
                     ˆ              ˆ
                                                  σ (η(z))
                                                  ˆ

  where
      m(η) estimated by non-linear regression (e.g., neural network)
      ˆ
      σ (η) estimated by non-linear regression on residuals
      ˆ

                     log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi
                              ˆ

                                               [Blum  Fran¸ois, 2009]
                                                           c
ABC-NCH


  Incorporating non-linearities and heterocedasticities:

                                                  σ (η(y))
                                                  ˆ
                θ∗ = m(η(y)) + [θ − m(η(z))]
                     ˆ              ˆ
                                                  σ (η(z))
                                                  ˆ

  where
      m(η) estimated by non-linear regression (e.g., neural network)
      ˆ
      σ (η) estimated by non-linear regression on residuals
      ˆ

                     log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi
                              ˆ

                                               [Blum  Fran¸ois, 2009]
                                                           c
ABC-NCH (2)



  Why neural network?
      fights curse of dimensionality
      selects relevant summary statistics
      provides automated dimension reduction
      offers a model choice capability
      improves upon multinomial logistic
                                            [Blum  Fran¸ois, 2009]
                                                        c
ABC-NCH (2)



  Why neural network?
      fights curse of dimensionality
      selects relevant summary statistics
      provides automated dimension reduction
      offers a model choice capability
      improves upon multinomial logistic
                                            [Blum  Fran¸ois, 2009]
                                                        c
ABC-MCMC



                                                      how approximative is ABC?

  Markov chain (θ(t) ) created via the transition function
           
           θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y
           
           
                                                       π(θ )Kω (θ(t) |θ )
  θ(t+1) =                        and u ∼ U(0, 1) ≤   π(θ(t) )Kω (θ |θ(t) )
                                                                              ,
             
              (t)
             θ                  otherwise,

  has the posterior π(θ|y ) as stationary distribution
                                                 [Marjoram et al, 2003]
ABC-MCMC



                                                      how approximative is ABC?

  Markov chain (θ(t) ) created via the transition function
           
           θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y
           
           
                                                       π(θ )Kω (θ(t) |θ )
  θ(t+1) =                        and u ∼ U(0, 1) ≤   π(θ(t) )Kω (θ |θ(t) )
                                                                              ,
             
              (t)
             θ                  otherwise,

  has the posterior π(θ|y ) as stationary distribution
                                                 [Marjoram et al, 2003]
ABC-MCMC (2)


  Algorithm 2 Likelihood-free MCMC sampler
    Use Algorithm 1 to get (θ(0) , z(0) )
    for t = 1 to N do
      Generate θ from Kω ·|θ(t−1) ,
      Generate z from the likelihood f (·|θ ),
      Generate u from U[0,1] ,
                 π(θ )Kω (θ(t−1) |θ )
      if u ≤                           I
               π(θ(t−1) Kω (θ |θ(t−1) ) A   ,y   (z ) then
         set  (θ (t) , z(t) ) = (θ , z )

      else
         (θ(t) , z(t) )) = (θ(t−1) , z(t−1) ),
      end if
    end for
Why does it work?


   Acceptance probability does not involve calculating the likelihood
   and

          π (θ , z |y)              q(θ (t−1) |θ )f (z(t−1) |θ (t−1) )
                                ×
      π (θ (t−1) , z(t−1) |y)            q(θ |θ (t−1) )f (z |θ )
                                             π(θ ) XX|θ IA ,y (z )
                                                     )
                                                    f (z X  X
                                =
                                    π(θ (t−1) ) f (z(t−1) |θ (t−1) ) IA ,y (z(t−1) )
                                    q(θ (t−1) |θ ) f (z(t−1) |θ (t−1) )
                                ×
                                         q(θ |θ (t−1) ) XX|θ
                                                         )
                                                        f (z XX
Why does it work?


   Acceptance probability does not involve calculating the likelihood
   and

          π (θ , z |y)              q(θ (t−1) |θ )f (z(t−1) |θ (t−1) )
                                ×
      π (θ (t−1) , z(t−1) |y)            q(θ |θ (t−1) )f (z |θ )
                                             π(θ ) XX|θ IA ,y (z )
                                                     )
                                                    f (z X  X
                                =               hh            (
                                    π(θ (t−1) ) ((hhhhh) IA ,y (z(t−1) )
                                                f (z(t−1) |θ (t−1)
                                                     ((( h   ((
                                                   hh            (
                                    q(θ (t−1) |θ ) ((hhhhh)
                                                   f (z(t−1) |θ (t−1)
                                                                ((
                                                        ((( h
                                ×
                                         q(θ |θ (t−1) ) XX|θ
                                                         )
                                                        f (z XX
Why does it work?


   Acceptance probability does not involve calculating the likelihood
   and

          π (θ , z |y)              q(θ (t−1) |θ )f (z(t−1) |θ (t−1) )
                                ×
      π (θ (t−1) , z(t−1) |y)            q(θ |θ (t−1) )f (z |θ )
                                             π(θ ) XX|θ IA ,y (z )
                                                     )
                                                    f (z X  X
                                =               hh            (
                                    π(θ (t−1) ) ((hhhhh) X(t−1) )
                                                f (z(t−1) |θ (t−1) IAXX
                                                             (( X
                                                     ((( h ,y (zX           
                                                                         X
                                                   hh              (
                                    q(θ (t−1) |θ ) ((hhhhh)
                                                   f (z(t−1) |θ (t−1)
                                                                  ((
                                                        ((( h
                                ×
                                         q(θ |θ (t−1) ) X|θ
                                                        X )
                                                        f (z XX
                                      π(θ )q(θ (t−1) |θ )
                                =                              IA ,y (z )
                                    π(θ (t−1) q(θ |θ (t−1) )
A toy example




   Case of
                           1        1
                      x ∼ N (θ, 1) + N (−θ, 1)
                           2        2
   under prior θ ∼ N (0, 10)

   ABC sampler
   thetas=rnorm(N,sd=10)
   zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1)
   eps=quantile(abs(zed-x),.01)
   abc=thetas[abs(zed-x)eps]
A toy example




   Case of
                           1        1
                      x ∼ N (θ, 1) + N (−θ, 1)
                           2        2
   under prior θ ∼ N (0, 10)

   ABC sampler
   thetas=rnorm(N,sd=10)
   zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1)
   eps=quantile(abs(zed-x),.01)
   abc=thetas[abs(zed-x)eps]
A toy example


   Case of
                           1        1
                      x ∼ N (θ, 1) + N (−θ, 1)
                           2        2
   under prior θ ∼ N (0, 10)

   ABC-MCMC sampler
   metas=rep(0,N)
   metas[1]=rnorm(1,sd=10)
   zed[1]=x
   for (t in 2:N){
     metas[t]=rnorm(1,mean=metas[t-1],sd=5)
     zed[t]=rnorm(1,mean=(1-2*(runif(1).5))*metas[t],sd=1)
     if ((abs(zed[t]-x)eps)||(runif(1)dnorm(metas[t],sd=10)/dnorm(metas[t-1],sd=10))){
      metas[t]=metas[t-1]
      zed[t]=zed[t-1]}
     }
A toy example


                x =2
0.00   0.05   0.10   0.15   0.20   0.25   0.30
                                                                 A toy example




    −4
    −2

θ
    0
                                                          x =2




    2
    4
A toy example


                x = 10
A toy example


         0.25
         0.20
         0.15
         0.10
         0.05
         0.00              x = 10




                −10   −5      0     5   10

                               θ
A toy example


                x = 50
A toy example


         0.10
         0.08
         0.06
         0.04
         0.02
         0.00               x = 50




                −40   −20      0     20   40

                                θ
ABC-PMC

  Use of a transition kernel as in population Monte Carlo with
  manageable IS correction
  Generate a sample at iteration t by
                                 N
                                       (t−1)              (t−1)
                 πt (θ(t) ) ∝
                 ˆ                    ωj       Kt (θ(t) |θj       )
                                j=1

  modulo acceptance of the associated xt , and use an importance
                                                  (t)
  weight associated with an accepted simulation θi
                         (t)           (t)          (t)
                       ωi       ∝ π(θi ) πt (θi ) .
                                         ˆ

  c Still likelihood free
                                                       [Beaumont et al., 2009]
Our ABC-PMC algorithm

  Given a decreasing sequence of approximation levels             1    ≥ ... ≥           T,

    1. At iteration t = 1,
            For i = 1, ..., N
                                (1)                        (1)
                 Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y )                           1
                      (1)
                 Set ωi = 1/N
                                                                 (1)
       Take τ 2 as twice the empirical variance of the θi ’s
    2. At iteration 2 ≤ t ≤ T ,
            For i = 1, ..., N, repeat
                                      (t−1)                              (t−1)
                 Pick θi from the θj          ’s with probabilities ωj
                              (t)           2                          (t)
                 generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi )
            until (x, y )      t
                  (t)         (t)     N       (t−1)      −1      (t)         (t−1)
            Set ωi      ∝ π(θi )/     j=1   ωj        ϕ σt  θi         − θj          )
             2                                                                           (t)
       Take τt+1 as twice the weighted empirical variance of the θi ’s
Sequential Monte Carlo


   SMC is a simulation technique to approximate a sequence of
   related probability distributions πn with π0 “easy” and πT as
   target.
   Iterated IS as PMC: particles moved from time n to time n via
   kernel Kn and use of a sequence of extended targets πn˜
                                             n
                    πn (z0:n ) = πn (zn )
                    ˜                             Lj (zj+1 , zj )
                                            j=0

   where the Lj ’s are backward Markov kernels [check that πn (zn ) is a
   marginal]
                           [Del Moral, Doucet  Jasra, Series B, 2006]
Sequential Monte Carlo (2)

   Algorithm 3 SMC sampler
               (0)
     sample zi ∼ γ0 (x) (i = 1, . . . , N)
                         (0)        (0)        (0)
     compute weights wi = π0 (zi )/γ0 (zi )
     for t = 1 to N do
       if ESS(w (t−1) )  NT then
          resample N particles z (t−1) and set weights to 1
       end if
                  (t−1)        (t−1)
       generate zi      ∼ Kt (zi     , ·) and set weights to
                                            (t)         (t)    (t−1)
                     (t)      (t−1)    πt (zi ))Lt−1 (zi ), zi         ))
                 wi        = wi−1            (t−1)        (t−1)     (t)
                                      πt−1 (zi     ))Kt (zi     ), zi ))

     end for

                              [Del Moral, Doucet  Jasra, Series B, 2006]
ABC-SMC

                                      [Del Moral, Doucet  Jasra, 2009]

  True derivation of an SMC-ABC algorithm
  Use of a kernel Kn associated with target π            n   and derivation of the
  backward kernel
                                      π n (z )Kn (z , z)
                     Ln−1 (z, z ) =
                                            πn (z)

  Update of the weights
                                        M                  m
                                        m=1 IA       n
                                                         (xin )
                  win ∝ wi(n−1)       M                m
                                                     (xi(n−1) )
                                      m=1 IA   n−1

        m
  when xin ∼ K (xi(n−1) , ·)
ABC-SMCM




  Modification: Makes M repeated simulations of the pseudo-data z
  given the parameter, rather than using a single [M = 1] simulation,
  leading to weight that is proportional to the number of accepted
  zi s
                                  M
                               1
                      ω(θ) =         Iρ(η(y),η(zi ))
                              M
                                i=1

  [limit in M means exact simulation from (tempered) target]
Properties of ABC-SMC



  The ABC-SMC method properly uses a backward kernel L(z, z ) to
  simplify the importance weight and to remove the dependence on
  the unknown likelihood from this weight. Update of importance
  weights is reduced to the ratio of the proportions of surviving
  particles
  Major assumption: the forward kernel K is supposed to be invariant
  against the true target [tempered version of the true posterior]
  Adaptivity in ABC-SMC algorithm only found in on-line
  construction of the thresholds t , slowly enough to keep a large
  number of accepted transitions
Properties of ABC-SMC



  The ABC-SMC method properly uses a backward kernel L(z, z ) to
  simplify the importance weight and to remove the dependence on
  the unknown likelihood from this weight. Update of importance
  weights is reduced to the ratio of the proportions of surviving
  particles
  Major assumption: the forward kernel K is supposed to be invariant
  against the true target [tempered version of the true posterior]
  Adaptivity in ABC-SMC algorithm only found in on-line
  construction of the thresholds t , slowly enough to keep a large
  number of accepted transitions
A mixture example (1)


   Toy model of Sisson et al. (2007): if

        θ ∼ U(−10, 10) ,   x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) ,

   then the posterior distribution associated with y = 0 is the normal
   mixture
                θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100)
   restricted to [−10, 10].
   Furthermore, true target available as

   π(θ||x|  ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .
A mixture example (2)

   Recovery of the target, whether using a fixed standard deviation of
   τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s.
      1.0




                                           1.0




                                                                                1.0




                                                                                                                     1.0




                                                                                                                                                          1.0
      0.8




                                           0.8




                                                                                0.8




                                                                                                                     0.8




                                                                                                                                                          0.8
      0.6




                                           0.6




                                                                                0.6




                                                                                                                     0.6




                                                                                                                                                          0.6
      0.4




                                           0.4




                                                                                0.4




                                                                                                                     0.4




                                                                                                                                                          0.4
      0.2




                                           0.2




                                                                                0.2




                                                                                                                     0.2




                                                                                                                                                          0.2
      0.0




                                           0.0




                                                                                0.0




                                                                                                                     0.0




                                                                                                                                                          0.0
            −3   −2   −1   0   1   2   3         −3   −2   −1   0   1   2   3         −3   −2   −1   0   1   2   3         −3   −2   −1   0   1   2   3         −3   −2   −1   0   1   2   3

                           θ                                    θ                                    θ                                    θ                                    θ
ABC inference machine



   Introduction

   Approximate Bayesian computation

   ABC as an inference machine
     Error inc.
     Exact BC and approximate targets
     summary statistic
     Series B discussion2
How much Bayesian?


    maybe a convergent method
    of inference (meaningful?
    sufficient? foreign?)
    approximation error
    unknown (w/o simulation)
    pragmatic Bayes (there is no
    other solution!)
    many calibration issues
    (tolerance, distance,
    statistics)

                                   ...should Bayesians care?!
How much Bayesian?


    maybe a convergent method
    of inference (meaningful?
    sufficient? foreign?)
    approximation error
    unknown (w/o simulation)
    pragmatic Bayes (there is no
    other solution!)
    many calibration issues
    (tolerance, distance,
    statistics)


                                   ...should Bayesians care?!
ABCµ



  Idea Infer about the error as well:
  Use of a joint density

                  f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

  where y is the data, and ξ( |y, θ) is the prior predictive density of
  ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ)
  Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
  approximation.
             [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
ABCµ



  Idea Infer about the error as well:
  Use of a joint density

                  f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

  where y is the data, and ξ( |y, θ) is the prior predictive density of
  ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ)
  Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
  approximation.
             [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
ABCµ



  Idea Infer about the error as well:
  Use of a joint density

                  f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( )

  where y is the data, and ξ( |y, θ) is the prior predictive density of
  ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ)
  Warning! Replacement of ξ( |y, θ) with a non-parametric kernel
  approximation.
             [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
ABCµ details


   Multidimensional distances ρk (k = 1, . . . , K ) and errors
   k = ρk (ηk (z), ηk (y)), with


                        ˆ              1
    k   ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) =           K [{ k −ρk (ηk (zb ), ηk (y))}/hk ]
                                      Bhk
                                            b

                                              ˆ
   then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)
   ABCµ involves acceptance probability

                                                ˆ
                    π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ )
                                                 ˆ
                    π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
ABCµ details


   Multidimensional distances ρk (k = 1, . . . , K ) and errors
   k = ρk (ηk (z), ηk (y)), with


                        ˆ              1
    k   ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) =           K [{ k −ρk (ηk (zb ), ηk (y))}/hk ]
                                      Bhk
                                            b

                                              ˆ
   then used in replacing ξ( |y, θ) with mink ξk ( |y, θ)
   ABCµ involves acceptance probability

                                                ˆ
                    π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ )
                                                 ˆ
                    π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
ABCµ multiple errors




                       [ c Ratmann et al., PNAS, 2009]
ABCµ for model choice




                        [ c Ratmann et al., PNAS, 2009]
Questions about ABCµ [and model choice]


   For each model under comparison, marginal posterior on       used to
   assess the fit of the model (HPD includes 0 or not).
       Is the data informative about ? [Identifiability]
       How much does the prior π( ) impact the comparison?
       How is using both ξ( |x0 , θ) and π ( ) compatible with a
       standard probability model? [remindful of Wilkinson’s eABC ]
       Where is the penalisation for complexity in the model
       comparison?
                                 [X, Mengersen  Chen, 2010, PNAS]
Questions about ABCµ [and model choice]


   For each model under comparison, marginal posterior on       used to
   assess the fit of the model (HPD includes 0 or not).
       Is the data informative about ? [Identifiability]
       How much does the prior π( ) impact the comparison?
       How is using both ξ( |x0 , θ) and π ( ) compatible with a
       standard probability model? [remindful of Wilkinson’s eABC ]
       Where is the penalisation for complexity in the model
       comparison?
                                 [X, Mengersen  Chen, 2010, PNAS]
Wilkinson’s exact BC (not exactly!)

   ABC approximation error (i.e. non-zero tolerance) replaced with
   exact simulation from a controlled approximation to the target,
   convolution of true posterior with kernel function

                                π(θ)f (z|θ)K (y − z)
                π (θ, z|y) =                            ,
                               π(θ)f (z|θ)K (y − z)dzdθ

   with K kernel parameterised by bandwidth .
                                                     [Wilkinson, 2008]

   Theorem
   The ABC algorithm based on the assumption of a randomised
   observation y = y + ξ, ξ ∼ K , and an acceptance probability of
                   ˜

                               K (y − z)/M

   gives draws from the posterior distribution π(θ|y).
Wilkinson’s exact BC (not exactly!)

   ABC approximation error (i.e. non-zero tolerance) replaced with
   exact simulation from a controlled approximation to the target,
   convolution of true posterior with kernel function

                                π(θ)f (z|θ)K (y − z)
                π (θ, z|y) =                            ,
                               π(θ)f (z|θ)K (y − z)dzdθ

   with K kernel parameterised by bandwidth .
                                                     [Wilkinson, 2008]

   Theorem
   The ABC algorithm based on the assumption of a randomised
   observation y = y + ξ, ξ ∼ K , and an acceptance probability of
                   ˜

                               K (y − z)/M

   gives draws from the posterior distribution π(θ|y).
How exact a BC?




      “Using to represent measurement error is
      straightforward, whereas using to model the model
      discrepancy is harder to conceptualize and not as
      commonly used”
                                        [Richard Wilkinson, 2008]
How exact a BC?

  Pros
         Pseudo-data from true model and observed data from noisy
         model
         Interesting perspective in that outcome is completely
         controlled
         Link with ABCµ and assuming y is observed with a
         measurement error with density K
         Relates to the theory of model approximation
                                               [Kennedy  O’Hagan, 2001]
  Cons
         Requires K to be bounded by M
         True approximation error never assessed
         Requires a modification of the standard ABC algorithm
ABC for HMMs



  Specific case of a hidden     Markov model



                             Xt+1 ∼ Qθ (Xt , ·)
                             Yt+1 ∼ gθ (·|xt )
              0
  where only y1:n is observed.
                                   [Dean, Singh, Jasra,  Peters, 2011]
  Use of specific constraints, adapted to the Markov structure:
                         0                       0
                 y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )
ABC for HMMs



  Specific case of a hidden     Markov model



                             Xt+1 ∼ Qθ (Xt , ·)
                             Yt+1 ∼ gθ (·|xt )
              0
  where only y1:n is observed.
                                   [Dean, Singh, Jasra,  Peters, 2011]
  Use of specific constraints, adapted to the Markov structure:
                         0                       0
                 y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )
ABC-MLE for HMMs


  ABC-MLE defined by
         ˆ                       0                      0
         θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , )
                    θ


  Exact MLE for the likelihood     same basis as Wilkinson!


                                 0
                            pθ (y1 , . . . , yn )

  corresponding to the perturbed process

                 (xt , yt + zt )1≤t≤n      zt ∼ U(B(0, 1)

                                 [Dean, Singh, Jasra,  Peters, 2011]
ABC-MLE for HMMs


  ABC-MLE defined by
         ˆ                       0                      0
         θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , )
                    θ


  Exact MLE for the likelihood     same basis as Wilkinson!


                                 0
                            pθ (y1 , . . . , yn )

  corresponding to the perturbed process

                 (xt , yt + zt )1≤t≤n      zt ∼ U(B(0, 1)

                                 [Dean, Singh, Jasra,  Peters, 2011]
ABC-MLE is biased



      ABC-MLE is asymptotically (in n) biased with target

                      l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )]


      but ABC-MLE converges to true value in the sense

                              l n (θn ) → l (θ)

      for all sequences (θn ) converging to θ and   n
ABC-MLE is biased



      ABC-MLE is asymptotically (in n) biased with target

                      l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )]


      but ABC-MLE converges to true value in the sense

                              l n (θn ) → l (θ)

      for all sequences (θn ) converging to θ and   n
Noisy ABC-MLE



  Idea: Modify instead the data from the start
                         0
                       (y1 + ζ1 , . . . , yn + ζn )

                                                      [   see Fearnhead-Prangle   ]
  noisy ABC-MLE estimate
                         0                           0
      arg max Pθ Y1 ∈ B(y1 + ζ1 , ), . . . , Yn ∈ B(yn + ζn , )
           θ

                                 [Dean, Singh, Jasra,  Peters, 2011]
Consistent noisy ABC-MLE




      Degrading the data improves the estimation performances:
          Noisy ABC-MLE is asymptotically (in n) consistent
          under further assumptions, the noisy ABC-MLE is
          asymptotically normal
          increase in variance of order −2
      likely degradation in precision or computing time due to the
      lack of summary statistic [curse of dimensionality]
SMC for ABC likelihood

   Algorithm 4 SMC ABC for HMMs
     Given θ
     for k = 1, . . . , n do
                             1 1                N    N
       generate proposals (xk , yk ), . . . , (xk , yk ) from the model
       weigh each proposal with ωk   l =I                     l
                                              B(yk + ζk , ) (yk )
                                                  0

                                                          l
       renormalise the weights and sample the xk ’s accordingly
     end for
     approximate the likelihood by
                                n       N
                                              l
                                             ωk N
                               k=1     l=1




                                    [Jasra, Singh, Martin,  McCoy, 2010]
Which summary?


  Fundamental difficulty of the choice of the summary statistic when
  there is no non-trivial sufficient statistics
  Starting from a large collection of summary statistics is available,
  Joyce and Marjoram (2008) consider the sequential inclusion into
  the ABC target, with a stopping rule based on a likelihood ratio
  test
      Not taking into account the sequential nature of the tests
      Depends on parameterisation
      Order of inclusion matters
      likelihood ratio test?!
Which summary?


  Fundamental difficulty of the choice of the summary statistic when
  there is no non-trivial sufficient statistics
  Starting from a large collection of summary statistics is available,
  Joyce and Marjoram (2008) consider the sequential inclusion into
  the ABC target, with a stopping rule based on a likelihood ratio
  test
      Not taking into account the sequential nature of the tests
      Depends on parameterisation
      Order of inclusion matters
      likelihood ratio test?!
Which summary?


  Fundamental difficulty of the choice of the summary statistic when
  there is no non-trivial sufficient statistics
  Starting from a large collection of summary statistics is available,
  Joyce and Marjoram (2008) consider the sequential inclusion into
  the ABC target, with a stopping rule based on a likelihood ratio
  test
      Not taking into account the sequential nature of the tests
      Depends on parameterisation
      Order of inclusion matters
      likelihood ratio test?!
Which summary for model choice?



   Depending on the choice of η(·), the Bayes factor based on this
   insufficient statistic,

                    η          π1 (θ 1 )f1η (η(y)|θ 1 ) dθ 1
                   B12 (y) =                                 ,
                               π2 (θ 2 )f2η (η(y)|θ 2 ) dθ 2

   is consistent or not.
                                    [X, Cornuet, Marin,  Pillai, 2012]
   Consistency only depends on the range of Ei [η(y)] under both
   models.
                                [Marin, Pillai, X,  Rousseau, 2012]
Which summary for model choice?



   Depending on the choice of η(·), the Bayes factor based on this
   insufficient statistic,

                    η          π1 (θ 1 )f1η (η(y)|θ 1 ) dθ 1
                   B12 (y) =                                 ,
                               π2 (θ 2 )f2η (η(y)|θ 2 ) dθ 2

   is consistent or not.
                                    [X, Cornuet, Marin,  Pillai, 2012]
   Consistency only depends on the range of Ei [η(y)] under both
   models.
                                [Marin, Pillai, X,  Rousseau, 2012]
Semi-automatic ABC


  Fearnhead and Prangle (2010) study ABC and the selection of the
  summary statistic in close proximity to Wilkinson’s proposal
  ABC then considered from a purely inferential viewpoint and
  calibrated for estimation purposes
  Use of a randomised (or ‘noisy’) version of the summary statistics

                           η (y) = η(y) + τ
                           ˜

  Derivation of a well-calibrated version of ABC, i.e. an algorithm
  that gives proper predictions for the distribution associated with
  this randomised summary statistic [calibration constraint: ABC
  approximation with same posterior mean as the true randomised
  posterior]
Semi-automatic ABC


  Fearnhead and Prangle (2010) study ABC and the selection of the
  summary statistic in close proximity to Wilkinson’s proposal
  ABC then considered from a purely inferential viewpoint and
  calibrated for estimation purposes
  Use of a randomised (or ‘noisy’) version of the summary statistics

                           η (y) = η(y) + τ
                           ˜

  Derivation of a well-calibrated version of ABC, i.e. an algorithm
  that gives proper predictions for the distribution associated with
  this randomised summary statistic [calibration constraint: ABC
  approximation with same posterior mean as the true randomised
  posterior]
Summary [of FP/statistics)




      Optimality of the posterior expectation E[θ|y] of the
      parameter of interest as summary statistics η(y)!
      Use of the standard quadratic loss function

                           (θ − θ0 )T A(θ − θ0 ) .
Summary [of FP/statistics)




      Optimality of the posterior expectation E[θ|y] of the
      parameter of interest as summary statistics η(y)!
      Use of the standard quadratic loss function

                           (θ − θ0 )T A(θ − θ0 ) .
Details on Fearnhead and Prangle (FP) ABC


   Use of a summary statistic S(·), an importance proposal g (·), a
   kernel K (·) ≤ 1 and a bandwidth h  0 such that

                        (θ, ysim ) ∼ g (θ)f (ysim |θ)

   is accepted with probability (hence the bound)

                         K [{S(ysim ) − sobs }/h]

   and the corresponding importance weight defined by

                                π(θ) g (θ)

                                            [Fearnhead  Prangle, 2012]
Errors, errors, and errors


   Three levels of approximation
       π(θ|yobs ) by π(θ|sobs ) loss of information
                                                                    [ignored]
       π(θ|sobs ) by

                                   π(s)K [{s − sobs }/h]π(θ|s) ds
               πABC (θ|sobs ) =
                                      π(s)K [{s − sobs }/h] ds

       noisy observations
       πABC (θ|sobs ) by importance Monte Carlo based on N
       simulations, represented by var(a(θ)|sobs )/Nacc [expected
       number of acceptances]
Average acceptance asymptotics



   For the average acceptance probability/approximate likelihood

          p(θ|sobs ) =   f (ysim |θ) K [{S(ysim ) − sobs }/h] dysim ,

   overall acceptance probability

           p(sobs ) =    p(θ|sobs ) π(θ) dθ = π(sobs )hd + o(hd )

                                                          [FP, Lemma 1]
Optimal importance proposal



   Best choice of importance proposal in terms of effective sample size

                     g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2

                                   [Not particularly useful in practice]

       note that p(θ|sobs ) is an approximate likelihood
       reminiscent of parallel tempering
       could be approximately achieved by attrition of half of the
       data
Optimal importance proposal



   Best choice of importance proposal in terms of effective sample size

                     g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2

                                   [Not particularly useful in practice]

       note that p(θ|sobs ) is an approximate likelihood
       reminiscent of parallel tempering
       could be approximately achieved by attrition of half of the
       data
Calibration of h


       “This result gives insight into how S(·) and h affect the Monte
       Carlo error. To minimize Monte Carlo error, we need hd to be not
       too small. Thus ideally we want S(·) to be a low dimensional
       summary of the data that is sufficiently informative about θ that
       π(θ|sobs ) is close, in some sense, to π(θ|yobs )” (FP, p.5)


       turns h into an absolute value while it should be
       context-dependent and user-calibrated
       only addresses one term in the approximation error and
       acceptance probability (“curse of dimensionality”)
       h large prevents πABC (θ|sobs ) to be close to π(θ|sobs )
       d small prevents π(θ|sobs ) to be close to π(θ|yobs ) (“curse of
       [dis]information”)
Calibrating ABC


       “If πABC is calibrated, then this means that probability statements
       that are derived from it are appropriate, and in particular that we
       can use πABC to quantify uncertainty in estimates” (FP, p.5)


   Definition
   For 0  q  1 and subset A, event Eq (A) made of sobs such that
   PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if

                            Pr(θ ∈ A|Eq (A)) = q



       unclear meaning of conditioning on Eq (A)
Calibrating ABC


       “If πABC is calibrated, then this means that probability statements
       that are derived from it are appropriate, and in particular that we
       can use πABC to quantify uncertainty in estimates” (FP, p.5)


   Definition
   For 0  q  1 and subset A, event Eq (A) made of sobs such that
   PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if

                            Pr(θ ∈ A|Eq (A)) = q



       unclear meaning of conditioning on Eq (A)
Calibrated ABC




   Theorem (FP)
   Noisy ABC, where

                   sobs = S(yobs ) + h ,    ∼ K (·)

   is calibrated
                                                      [Wilkinson, 2008]
                      no condition on h!!
Calibrated ABC




   Consequence: when h = ∞

   Theorem (FP)
   The prior distribution is always calibrated

   is this a relevant property then?
More questions about calibrated ABC


      “Calibration is not universally accepted by Bayesians. It is even more
      questionable here as we care how statements we make relate to the
      real world, not to a mathematically defined posterior.” R. Wilkinson


      Same reluctance about the prior being calibrated
      Property depending on prior, likelihood, and summary
      Calibration is a frequentist property (almost a p-value!)
      More sensible to account for the simulator’s imperfections
      than using noisy-ABC against a meaningless based measure
                                                            [Wilkinson, 2012]
Converging ABC



  Theorem (FP)
  For noisy ABC, the expected noisy-ABC log-likelihood,

    E {log[p(θ|sobs )]} =   log[p(θ|S(yobs ) + )]π(yobs |θ0 )K ( )dyobs d ,

  has its maximum at θ = θ0 .

  True for any choice of summary statistic? even ancilary statistics?!
                                   [Imposes at least identifiability...]
  Relevant in asymptotia and not for the data
Converging ABC



  Corollary
  For noisy ABC, the ABC posterior converges onto a point mass on
  the true parameter value as m → ∞.

  For standard ABC, not always the case (unless h goes to zero).
  Strength of regularity conditions (c1) and (c2) in Bernardo
   Smith, 1994?
                [out-of-reach constraints on likelihood and posterior]
  Again, there must be conditions imposed upon summary
  statistics...
Loss motivated statistic

   Under quadratic loss function,

   Theorem (FP)
                                           ˆ
    (i) The minimal posterior error E[L(θ, θ)|yobs ] occurs when
        ˆ = E(θ|yobs ) (!)
        θ
    (ii) When h → 0, EABC (θ|sobs ) converges to E(θ|yobs )
                                           ˆ
   (iii) If S(yobs ) = E[θ|yobs ] then for θ = EABC [θ|sobs ]

                 ˆ
          E[L(θ, θ)|yobs ] = trace(AΣ) + h2     xT AxK (x)dx + o(h2 ).

   measure-theoretic difficulties?
   dependence of sobs on h makes me uncomfortable inherent to noisy
   ABC
   Relevant for choice of K ?
Optimal summary statistic

       “We take a different approach, and weaken the requirement for
       πABC to be a good approximation to π(θ|yobs ). We argue for πABC
       to be a good approximation solely in terms of the accuracy of
       certain estimates of the parameters.” (FP, p.5)

   From this result, FP
       derive their choice of summary statistic,

                                  S(y) = E(θ|y)

                                                               [almost sufficient]
       suggest

                 h = O(N −1/(2+d) ) and h = O(N −1/(4+d) )

       as optimal bandwidths for noisy and standard ABC.
Optimal summary statistic

       “We take a different approach, and weaken the requirement for
       πABC to be a good approximation to π(θ|yobs ). We argue for πABC
       to be a good approximation solely in terms of the accuracy of
       certain estimates of the parameters.” (FP, p.5)

   From this result, FP
       derive their choice of summary statistic,

                                  S(y) = E(θ|y)

                                           [wow! EABC [θ|S(yobs )] = E[θ|yobs ]]
       suggest

                 h = O(N −1/(2+d) ) and h = O(N −1/(4+d) )

       as optimal bandwidths for noisy and standard ABC.
Caveat



  Since E(θ|yobs ) is most usually unavailable, FP suggest
    (i) use a pilot run of ABC to determine a region of non-negligible
        posterior mass;
   (ii) simulate sets of parameter values and data;
  (iii) use the simulated sets of parameter values and data to
        estimate the summary statistic; and
   (iv) run ABC with this choice of summary statistic.
  where is the assessment of the first stage error?
Caveat



  Since E(θ|yobs ) is most usually unavailable, FP suggest
    (i) use a pilot run of ABC to determine a region of non-negligible
        posterior mass;
   (ii) simulate sets of parameter values and data;
  (iii) use the simulated sets of parameter values and data to
        estimate the summary statistic; and
   (iv) run ABC with this choice of summary statistic.
  where is the assessment of the first stage error?
Approximating the summary statistic




   As Beaumont et al. (2002) and Blum and Fran¸ois (2010), FP
                                                    c
   use a linear regression to approximate E(θ|yobs ):
                             (i)
                      θi = β0 + β (i) f (yobs ) +   i

   with f made of standard transforms
                                                [Further justifications?]
[my]questions about semi-automatic ABC

      dependence on h and S(·) in the early stage
      reduction of Bayesian inference to point estimation
      approximation error in step (i) not accounted for
      not parameterisation invariant
      practice shows that proper approximation to genuine posterior
      distributions stems from using a (much) larger number of
      summary statistics than the dimension of the parameter
      the validity of the approximation to the optimal summary
      statistic depends on the quality of the pilot run
      important inferential issues like model choice are not covered
      by this approach.
                                                            [X, 2012]
More about semi-automatic ABC

                              [End of section derived from comments on Read Paper, just appeared]

      “The apparently arbitrary nature of the choice of summary statistics
      has always been perceived as the Achilles heel of ABC.” M.
      Beaumont

      “Curse of dimensionality” linked with the increase of the
      dimension of the summary statistic
      Connection with principal component analysis
                                                                            [Itan et al., 2010]
      Connection with partial least squares
                                                                    [Wegman et al., 2009]
      Beaumont et al. (2002) postprocessed output is used as input
      by FP to run a second ABC
More about semi-automatic ABC

                              [End of section derived from comments on Read Paper, just appeared]

      “The apparently arbitrary nature of the choice of summary statistics
      has always been perceived as the Achilles heel of ABC.” M.
      Beaumont

      “Curse of dimensionality” linked with the increase of the
      dimension of the summary statistic
      Connection with principal component analysis
                                                                            [Itan et al., 2010]
      Connection with partial least squares
                                                                    [Wegman et al., 2009]
      Beaumont et al. (2002) postprocessed output is used as input
      by FP to run a second ABC
Wood’s alternative

   Instead of a non-parametric kernel approximation to the likelihood
                       1
                                K {η(yr ) − η(yobs )}
                       R    r

   Wood (2010) suggests a normal approximation

                           η(y(θ)) ∼ Nd (µθ , Σθ )

   whose parameters can be approximated based on the R simulations
   (for each value of θ).
       Parametric versus non-parametric rate [Uh?!]
       Automatic weighting of components of η(·) through Σθ
       Dependence on normality assumption (pseudo-likelihood?)
                                 [Cornebise, Girolami  Kosmidis, 2012]
Wood’s alternative

   Instead of a non-parametric kernel approximation to the likelihood
                       1
                                K {η(yr ) − η(yobs )}
                       R    r

   Wood (2010) suggests a normal approximation

                           η(y(θ)) ∼ Nd (µθ , Σθ )

   whose parameters can be approximated based on the R simulations
   (for each value of θ).
       Parametric versus non-parametric rate [Uh?!]
       Automatic weighting of components of η(·) through Σθ
       Dependence on normality assumption (pseudo-likelihood?)
                                 [Cornebise, Girolami  Kosmidis, 2012]
Reinterpretation and extensions

   Reinterpretation of ABC output as joint simulation from

                         π (x, y |θ) = f (x|θ)¯Y |X (y |x)
                         ¯                    π

   where
                            πY |X (y |x) = K (y − x)
                            ¯

   Reinterpretation of noisy ABC
   if y |y obs ∼ πY |X (·|y obs ), then marginally
      ¯          ¯

                                 y ∼ πY |θ (·|θ0 )
                                 ¯ ¯

  c Explain for the consistency of Bayesian inference based on y and π
                                                               ¯     ¯
                                       [Lee, Andrieu  Doucet, 2012]
Reinterpretation and extensions

   Reinterpretation of ABC output as joint simulation from

                         π (x, y |θ) = f (x|θ)¯Y |X (y |x)
                         ¯                    π

   where
                            πY |X (y |x) = K (y − x)
                            ¯

   Reinterpretation of noisy ABC
   if y |y obs ∼ πY |X (·|y obs ), then marginally
      ¯          ¯

                                 y ∼ πY |θ (·|θ0 )
                                 ¯ ¯

  c Explain for the consistency of Bayesian inference based on y and π
                                                               ¯     ¯
                                       [Lee, Andrieu  Doucet, 2012]
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?
ABC: How Bayesian can it be?

Mais conteúdo relacionado

Mais procurados

QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
The Statistical and Applied Mathematical Sciences Institute
 
Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605
ketanaka
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...
zukun
 

Mais procurados (20)

Micro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effectMicro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effect
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
YSC 2013
YSC 2013YSC 2013
YSC 2013
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Representation formula for traffic flow estimation on a network
Representation formula for traffic flow estimation on a networkRepresentation formula for traffic flow estimation on a network
Representation formula for traffic flow estimation on a network
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...Principal component analysis and matrix factorizations for learning (part 3) ...
Principal component analysis and matrix factorizations for learning (part 3) ...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Classification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metricsClassification with mixtures of curved Mahalanobis metrics
Classification with mixtures of curved Mahalanobis metrics
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
Loss Calibrated Variational Inference
Loss Calibrated Variational InferenceLoss Calibrated Variational Inference
Loss Calibrated Variational Inference
 

Semelhante a ABC: How Bayesian can it be?

Xian's abc
Xian's abcXian's abc
Xian's abc
Deb Roy
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
Ryan White
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Anmol Dwivedi
 

Semelhante a ABC: How Bayesian can it be? (20)

Xian's abc
Xian's abcXian's abc
Xian's abc
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
 
Shanghai tutorial
Shanghai tutorialShanghai tutorial
Shanghai tutorial
 
Conceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian ProcessesConceptual Introduction to Gaussian Processes
Conceptual Introduction to Gaussian Processes
 
Two dimensional Pool Boiling
Two dimensional Pool BoilingTwo dimensional Pool Boiling
Two dimensional Pool Boiling
 
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...How to find a cheap surrogate to approximate Bayesian Update Formula and to a...
How to find a cheap surrogate to approximate Bayesian Update Formula and to a...
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Madrid easy
Madrid easyMadrid easy
Madrid easy
 
Non-sampling functional approximation of linear and non-linear Bayesian Update
Non-sampling functional approximation of linear and non-linear Bayesian UpdateNon-sampling functional approximation of linear and non-linear Bayesian Update
Non-sampling functional approximation of linear and non-linear Bayesian Update
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
 
A Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann HypothesisA Proof of the Generalized Riemann Hypothesis
A Proof of the Generalized Riemann Hypothesis
 
Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
 
cswiercz-general-presentation
cswiercz-general-presentationcswiercz-general-presentation
cswiercz-general-presentation
 
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence MeasuresLinear Discriminant Analysis (LDA) Under f-Divergence Measures
Linear Discriminant Analysis (LDA) Under f-Divergence Measures
 
Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data
Let's Practice What We Preach: Likelihood Methods for Monte Carlo DataLet's Practice What We Preach: Likelihood Methods for Monte Carlo Data
Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data
 

Mais de Christian Robert

Mais de Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 

Último

Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

ABC: How Bayesian can it be?

  • 1. Approximate Bayesian Computation: how Bayesian can it be? Christian P. Robert ISBA LECTURES ON BAYESIAN FOUNDATIONS ISBA 2012, Kyoto, Japan Universit´ Paris-Dauphine, IuF, & CREST e http://www.ceremade.dauphine.fr/~xian June 24, 2012
  • 2. This talk is dedicated to the memory of our dear friend and fellow Bayesian, George Casella, 1951–2012
  • 4. Approximate Bayesian computation Introduction Monte Carlo basics simulation-based methods in Econometrics Approximate Bayesian computation ABC as an inference machine
  • 5. General issue Given a density π known up to a normalizing constant, e.g. a posterior density, and an integrable function h, how can one use h(x)˜ (x)µ(dx) π Ih = h(x)π(x)µ(dx) = π (x)µ(dx) ˜ when h(x)˜ (x)µ(dx) is intractable? π
  • 6. Monte Carlo basics Generate an iid sample x1 , . . . , xN from π and estimate Ih by N Imc (h) = N −1 ˆ N h(xi ). i=1 ˆ as since [LLN] IMC (h) −→ Ih N Furthermore, if Ih2 = h2 (x)π(x)µ(dx) < ∞, √ L [CLT] ˆ N IMC (h) − Ih N N 0, I [h − Ih ]2 . Caveat Often impossible or inefficient to simulate directly from π
  • 7. Monte Carlo basics Generate an iid sample x1 , . . . , xN from π and estimate Ih by N Imc (h) = N −1 ˆ N h(xi ). i=1 ˆ as since [LLN] IMC (h) −→ Ih N Furthermore, if Ih2 = h2 (x)π(x)µ(dx) < ∞, √ L [CLT] ˆ N IMC (h) − Ih N N 0, I [h − Ih ]2 . Caveat Often impossible or inefficient to simulate directly from π
  • 8. Importance Sampling For Q proposal distribution with density q Ih = h(x){π/q}(x)q(x)µ(dx). Principle Generate an iid sample x1 , . . . , xN ∼ Q and estimate Ih by N IIS (h) = N −1 ˆ Q,N h(xi ){π/q}(xi ). i=1
  • 9. Importance Sampling For Q proposal distribution with density q Ih = h(x){π/q}(x)q(x)µ(dx). Principle Generate an iid sample x1 , . . . , xN ∼ Q and estimate Ih by N IIS (h) = N −1 ˆ Q,N h(xi ){π/q}(xi ). i=1
  • 10. Importance Sampling (convergence) Then ˆ as [LLN] IIS (h) −→ Ih Q,N and if Q((hπ/q)2 ) < ∞, √ L [CLT] ˆ N(IIS (h) − Ih ) Q,N N 0, Q{(hπ/q − Ih )2 } . Caveat ˆ If normalizing constant unknown, impossible to use IIS (h) Q,N Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
  • 11. Importance Sampling (convergence) Then ˆ as [LLN] IIS (h) −→ Ih Q,N and if Q((hπ/q)2 ) < ∞, √ L [CLT] ˆ N(IIS (h) − Ih ) Q,N N 0, Q{(hπ/q − Ih )2 } . Caveat ˆ If normalizing constant unknown, impossible to use IIS (h) Q,N Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
  • 12. Self-normalised importance Sampling Self normalized version N −1 N ˆ ISNIS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). Q,N i=1 i=1 ˆ as [LLN] ISNIS (h) −→ Ih Q,N and if I((1+h2 )(π/q)) < ∞, √ L [CLT] ˆ N(ISNIS (h) − Ih ) Q,N N 0, π {(π/q)(h − Ih }2 ) . Caveat ˆ If π cannot be computed, impossible to use ISNIS (h) Q,N
  • 13. Self-normalised importance Sampling Self normalized version N −1 N ˆ ISNIS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). Q,N i=1 i=1 ˆ as [LLN] ISNIS (h) −→ Ih Q,N and if I((1+h2 )(π/q)) < ∞, √ L [CLT] ˆ N(ISNIS (h) − Ih ) Q,N N 0, π {(π/q)(h − Ih }2 ) . Caveat ˆ If π cannot be computed, impossible to use ISNIS (h) Q,N
  • 14. Self-normalised importance Sampling Self normalized version N −1 N ˆ ISNIS (h) = {π/q}(xi ) h(xi ){π/q}(xi ). Q,N i=1 i=1 ˆ as [LLN] ISNIS (h) −→ Ih Q,N and if I((1+h2 )(π/q)) < ∞, √ L [CLT] ˆ N(ISNIS (h) − Ih ) Q,N N 0, π {(π/q)(h − Ih }2 ) . Caveat ˆ If π cannot be computed, impossible to use ISNIS (h) Q,N
  • 15. Perspectives What is the fundamental issue? a mere computational issue (optimism: can be solved / pessimism: too costly in the short term) more of a inferencial issue (optimism: gathering legit from classical B approach / pessimism: lacking the coherence of classical B approach) calling for a new methodology (optimism: equivalent to classical B approach / pessimism: not always convergent)
  • 16. Perspectives What is the fundamental issue? a mere computational issue (optimism: can be solved / pessimism: too costly in the short term) more of a inferencial issue (optimism: gathering legit from classical B approach / pessimism: lacking the coherence of classical B approach) calling for a new methodology (optimism: equivalent to classical B approach / pessimism: not always convergent)
  • 17. Perspectives What is the fundamental issue? a mere computational issue (optimism: can be solved / pessimism: too costly in the short term) more of a inferencial issue (optimism: gathering legit from classical B approach / pessimism: lacking the coherence of classical B approach) calling for a new methodology (optimism: equivalent to classical B approach / pessimism: not always convergent)
  • 18. Econom’ections Model choice Similar exploration of simulation-based and approximation techniques in Econometrics Simulated method of moments Method of simulated moments Simulated pseudo-maximum-likelihood Indirect inference [Gouri´roux & Monfort, 1996] e
  • 19. Simulated method of moments o Given observations y1:n from a model yt = r (y1:(t−1) , t , θ) , t ∼ g (·) 1. simulate 1:n , derive yt (θ) = r (y1:(t−1) , t , θ) 2. and estimate θ by n arg min (yto − yt (θ))2 θ t=1
  • 20. Simulated method of moments o Given observations y1:n from a model yt = r (y1:(t−1) , t , θ) , t ∼ g (·) 1. simulate 1:n , derive yt (θ) = r (y1:(t−1) , t , θ) 2. and estimate θ by n n 2 arg min yto − yt (θ) θ t=1 t=1
  • 21. Method of simulated moments Given a statistic vector K (y ) with Eθ [K (Yt )|y1:(t−1) ] = k(y1:(t−1) ; θ) find an unbiased estimator of k(y1:(t−1) ; θ), ˜ k( t , y1:(t−1) ; θ) Estimate θ by n S arg min K (yt ) − ˜ t k( s , y1:(t−1) ; θ)/S θ t=1 s=1 [Pakes & Pollard, 1989]
  • 22. Indirect inference ˆ Minimise [in θ] a distance between estimators β based on a pseudo-model for genuine observations and for observations simulated under the true model and the parameter θ. [Gouri´roux, Monfort, & Renault, 1993; e Smith, 1993; Gallant & Tauchen, 1996]
  • 23. Indirect inference (PML vs. PSE) Example of the pseudo-maximum-likelihood (PML) ˆ β(y) = arg max log f (yt |β, y1:(t−1) ) β t leading to ˆ ˆ arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2 θ when ys (θ) ∼ f (y|θ) s = 1, . . . , S
  • 24. Indirect inference (PML vs. PSE) Example of the pseudo-score-estimator (PSE) 2 ˆ ∂ log f β(y) = arg min (yt |β, y1:(t−1) ) β t ∂β leading to ˆ ˆ arg min ||β(yo ) − β(y1 (θ), . . . , yS (θ))||2 θ when ys (θ) ∼ f (y|θ) s = 1, . . . , S
  • 25. Consistent indirect inference ...in order to get a unique solution the dimension of the auxiliary parameter β must be larger than or equal to the dimension of the initial parameter θ. If the problem is just identified the different methods become easier... Consistency depending on the criterion and on the asymptotic identifiability of θ [Gouri´roux & Monfort, 1996, p. 66] e
  • 26. Consistent indirect inference ...in order to get a unique solution the dimension of the auxiliary parameter β must be larger than or equal to the dimension of the initial parameter θ. If the problem is just identified the different methods become easier... Consistency depending on the criterion and on the asymptotic identifiability of θ [Gouri´roux & Monfort, 1996, p. 66] e
  • 27. Choice of pseudo-model Arbitrariness of pseudo-model Pick model such that ˆ 1. β(θ) not flat (i.e. sensitive to changes in θ) ˆ 2. β(θ) not dispersed (i.e. robust agains changes in ys (θ)) [Frigessi & Heggland, 2004]
  • 28. Empirical likelihood Another approximation method (not yet related with simulation) Definition For dataset y = (y1 , . . . , yn ), and parameter of interest θ, pick constraints E[h(Y , θ)] = 0 uniquely identifying θ and define the empirical likelihood as n Lel (θ|y) = max pi p i=1 for p in the set {p ∈ [0; 1]n , pi = 1, i pi h(yi , θ) = 0}. [Owen, 1988]
  • 29. Empirical likelihood Another approximation method (not yet related with simulation) Example When θ = Ef [Y ], empirical likelihood is the maximum of p1 · · · pn under constraint p1 y1 + . . . + pn yn = θ
  • 30. ABCel Another approximation method (now related with simulation!) Importance sampling implementation Algorithm 1: Raw ABCel sampler Given observation y for i = 1 to M do Generate θ i from the prior distribution π(·) Set the weight ωi = Lel (θ i |y) end for Proceed with pairs (θi , ωi ) as in regular importance sampling [Mengersen, Pudlo & Robert, 2012]
  • 31. A?B?C? A stands for approximate [wrong likelihood] B stands for Bayesian C stands for computation [producing a parameter sample]
  • 32. A?B?C? A stands for approximate [wrong likelihood] B stands for Bayesian C stands for computation [producing a parameter sample]
  • 33. A?B?C? A stands for approximate [wrong likelihood] B stands for Bayesian C stands for computation [producing a parameter sample]
  • 34. How much Bayesian? asymptotically so (meaningfull?) approximation error unknown (w/o simulation) pragmatic Bayes (there is no other solution!)
  • 35. Approximate Bayesian computation Introduction Approximate Bayesian computation Genetics of ABC ABC basics Advances and interpretations Alphabet (compu-)soup ABC as an inference machine
  • 36. Genetic background of ABC ABC is a recent computational technique that only requires being able to sample from the likelihood f (·|θ) This technique stemmed from population genetics models, about 15 years ago, and population geneticists still contribute significantly to methodological developments of ABC. [Griffith & al., 1997; Tavar´ & al., 1999] e
  • 37. Demo-genetic inference Each model is characterized by a set of parameters θ that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we can not calculate the likelihood of the polymorphism data f (y|θ).
  • 38. Demo-genetic inference Each model is characterized by a set of parameters θ that cover historical (time divergence, admixture time ...), demographics (population sizes, admixture rates, migration rates, ...) and genetic (mutation rate, ...) factors The goal is to estimate these parameters from a dataset of polymorphism (DNA sample) y observed at the present time Problem: most of the time, we can not calculate the likelihood of the polymorphism data f (y|θ).
  • 39. A genuine example of application !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ 94 Pygmies populations: do they have a common origin? Was there a lot of exchanges between pygmies and non-pygmies populations?
  • 40. Alternative !""#$%&'()*+,(-*.&(/+0$'"1)()&$/+2!,03! scenarios 1/+*%*'"4*+56(""4&7()&$/.+.1#+4*.+8-9':*.+ Différents scénarios possibles, choix de scenario par ABC Verdu et al. 2009 96
  • 41. Untractable likelihood Missing (too missing!) data structure: f (y|θ) = f (y|G , θ)f (G |θ)dG G cannot be computed in a manageable way... The genealogies are considered as nuisance parameters This modelling clearly differs from the phylogenetic perspective where the tree is the parameter of interest.
  • 42. Untractable likelihood Missing (too missing!) data structure: f (y|θ) = f (y|G , θ)f (G |θ)dG G cannot be computed in a manageable way... The genealogies are considered as nuisance parameters This modelling clearly differs from the phylogenetic perspective where the tree is the parameter of interest.
  • 43. Untractable likelihood So, what can we do when the likelihood function f (y|θ) is well-defined but impossible / too costly to compute...? MCMC cannot be implemented! shall we give up Bayesian inference altogether?! or settle for an almost Bayesian inference...?
  • 44. Untractable likelihood So, what can we do when the likelihood function f (y|θ) is well-defined but impossible / too costly to compute...? MCMC cannot be implemented! shall we give up Bayesian inference altogether?! or settle for an almost Bayesian inference...?
  • 45. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997] e
  • 46. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997] e
  • 47. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 2984; Tavar´ et al., 1997] e
  • 48. Why does it work?! The proof is trivial: f (θi ) ∝ π(θi )f (z|θi )Iy (z) z∈D ∝ π(θi )f (y|θi ) = π(θi |y) . [Accept–Reject 101]
  • 49. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone (y, z) ≤ where is a distance Output distributed from def π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < ) [Pritchard et al., 1999]
  • 50. A as A...pproximative When y is a continuous random variable, strict equality z = y is replaced with a tolerance zone (y, z) ≤ where is a distance Output distributed from def π(θ) Pθ { (y, z) < } ∝ π(θ| (y, z) < ) [Pritchard et al., 1999]
  • 51. ABC algorithm In most implementations, further degree of A...pproximation: Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} ≤ set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic
  • 52. Output The likelihood-free algorithm samples from the marginal in z of: π(θ)f (z|θ)IA ,y (z) π (θ, z|y) = , A ,y ×Θ π(θ)f (z|θ)dzdθ where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) . ...does it?!
  • 53. Output The likelihood-free algorithm samples from the marginal in z of: π(θ)f (z|θ)IA ,y (z) π (θ, z|y) = , A ,y ×Θ π(θ)f (z|θ)dzdθ where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) . ...does it?!
  • 54. Output The likelihood-free algorithm samples from the marginal in z of: π(θ)f (z|θ)IA ,y (z) π (θ, z|y) = , A ,y ×Θ π(θ)f (z|θ)dzdθ where A ,y = {z ∈ D|ρ(η(z), η(y)) < }. The idea behind ABC is that the summary statistics coupled with a small tolerance should provide a good approximation of the posterior distribution: π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) . ...does it?!
  • 55. Convergence of ABC (first attempt) What happens when → 0? If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary δ > 0, there exists 0 such that < 0 implies
  • 56. Convergence of ABC (first attempt) What happens when → 0? If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary δ > 0, there exists 0 such that < 0 implies π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ)µ(B ) ∈ A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ)µ(B )
  • 57. Convergence of ABC (first attempt) What happens when → 0? If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary δ > 0, there exists 0 such that < 0 implies π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ) XX ) µ(BX ∈ A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ) X µ(B ) X X
  • 58. Convergence of ABC (first attempt) What happens when → 0? If f (·|θ) is continuous in y , uniformly in θ [!], given an arbitrary δ 0, there exists 0 such that 0 implies π(θ) f (z|θ)IA ,y (z) dz π(θ)f (y|θ)(1 δ) XX ) µ(BX ∈ A ,y ×Θ π(θ)f (z|θ)dzdθ Θ π(θ)f (y|θ)dθ(1 ± δ) X µ(B ) X X [Proof extends to other continuous-in-0 kernels K ]
  • 59. Convergence of ABC (second attempt) What happens when → 0? For B ⊂ Θ, we have A f (z|θ)dz f (z|θ)π(θ)dθ ,y B π(θ)dθ = dz B A ,y ×Θ π(θ)f (z|θ)dzdθ A ,y A ,y ×Θ π(θ)f (z|θ)dzdθ B f (z|θ)π(θ)dθ m(z) = dz A ,y m(z) A ,y ×Θ π(θ)f (z|θ)dzdθ m(z) = π(B|z) dz A ,y A ,y ×Θ π(θ)f (z|θ)dzdθ which indicates convergence for a continuous π(B|z).
  • 60. Convergence of ABC (second attempt) What happens when → 0? For B ⊂ Θ, we have A f (z|θ)dz f (z|θ)π(θ)dθ ,y B π(θ)dθ = dz B A ,y ×Θ π(θ)f (z|θ)dzdθ A ,y A ,y ×Θ π(θ)f (z|θ)dzdθ B f (z|θ)π(θ)dθ m(z) = dz A ,y m(z) A ,y ×Θ π(θ)f (z|θ)dzdθ m(z) = π(B|z) dz A ,y A ,y ×Θ π(θ)f (z|θ)dzdθ which indicates convergence for a continuous π(B|z).
  • 61. Convergence (do not attempt!) ...and the above does not apply to insufficient statistics: If η(y) is not a sufficient statistics, the best one can hope for is π(θ|η(y) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can hope for is π(θ|η(y) = π(η) Bummer!!!
  • 62. Convergence (do not attempt!) ...and the above does not apply to insufficient statistics: If η(y) is not a sufficient statistics, the best one can hope for is π(θ|η(y) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can hope for is π(θ|η(y) = π(η) Bummer!!!
  • 63. Convergence (do not attempt!) ...and the above does not apply to insufficient statistics: If η(y) is not a sufficient statistics, the best one can hope for is π(θ|η(y) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can hope for is π(θ|η(y) = π(η) Bummer!!!
  • 64. Convergence (do not attempt!) ...and the above does not apply to insufficient statistics: If η(y) is not a sufficient statistics, the best one can hope for is π(θ|η(y) , not π(θ|y) If η(y) is an ancillary statistic, the whole information contained in y is lost!, the “best” one can hope for is π(θ|η(y) = π(η) Bummer!!!
  • 65. Probit modelling on Pima Indian women Example (R benchmark) 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes Probability of diabetes as function of variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , 200 observations of Pima.tr and g -prior modelling: β ∼ N3 (0, n XT X)−1 importance function inspired from MLE estimator distribution ˆ ˆ β ∼ N (β, Σ)
  • 66. Probit modelling on Pima Indian women Example (R benchmark) 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes Probability of diabetes as function of variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , 200 observations of Pima.tr and g -prior modelling: β ∼ N3 (0, n XT X)−1 importance function inspired from MLE estimator distribution ˆ ˆ β ∼ N (β, Σ)
  • 67. Probit modelling on Pima Indian women Example (R benchmark) 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes Probability of diabetes as function of variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , 200 observations of Pima.tr and g -prior modelling: β ∼ N3 (0, n XT X)−1 importance function inspired from MLE estimator distribution ˆ ˆ β ∼ N (β, Σ)
  • 68. Probit modelling on Pima Indian women Example (R benchmark) 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes Probability of diabetes as function of variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , 200 observations of Pima.tr and g -prior modelling: β ∼ N3 (0, n XT X)−1 importance function inspired from MLE estimator distribution ˆ ˆ β ∼ N (β, Σ)
  • 69. Pima Indian benchmark 80 100 1.0 80 60 0.8 60 0.6 Density Density Density 40 40 0.4 20 20 0.2 0.0 0 0 −0.005 0.010 0.020 0.030 −0.05 −0.03 −0.01 −1.0 0.0 1.0 2.0 Figure: Comparison between density estimates of the marginals on β1 (left), β2 (center) and β3 (right) from ABC rejection samples (red) and MCMC samples (black) .
  • 70. MA example MA(q) model q xt = t + ϑi t−i i=1 Simple prior: uniform over the inverse [real and complex] roots in q Q(u) = 1 − ϑi u i i=1 under the identifiability conditions
  • 71. MA example MA(q) model q xt = t + ϑi t−i i=1 Simple prior: uniform prior over the identifiability zone, i.e. triangle for MA(2)
  • 72. MA example (2) ABC algorithm thus made of 1. picking a new value (ϑ1 , ϑ2 ) in the triangle 2. generating an iid sequence ( t )−qt≤T 3. producing a simulated series (xt )1≤t≤T Distance: basic distance between the series T ρ((xt )1≤t≤T , (xt )1≤t≤T ) = (xt − xt )2 t=1 or distance between summary statistics like the q = 2 autocorrelations T τj = xt xt−j t=j+1
  • 73. MA example (2) ABC algorithm thus made of 1. picking a new value (ϑ1 , ϑ2 ) in the triangle 2. generating an iid sequence ( t )−qt≤T 3. producing a simulated series (xt )1≤t≤T Distance: basic distance between the series T ρ((xt )1≤t≤T , (xt )1≤t≤T ) = (xt − xt )2 t=1 or distance between summary statistics like the q = 2 autocorrelations T τj = xt xt−j t=j+1
  • 74. Comparison of distance impact Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 75. Comparison of distance impact 4 1.5 3 1.0 2 0.5 1 0.0 0 0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5 θ1 θ2 Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 76. Comparison of distance impact 4 1.5 3 1.0 2 0.5 1 0.0 0 0.0 0.2 0.4 0.6 0.8 −2.0 −1.0 0.0 0.5 1.0 1.5 θ1 θ2 Evaluation of the tolerance on the ABC sample against both distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
  • 77. Comments Role of distance paramount (because = 0) matters little if “small enough” representative of “curse of dimensionality” the data as a whole may be paradoxically weakly informative for ABC
  • 78. ABC (simul’) advances how approximative is ABC? Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 79. ABC (simul’) advances how approximative is ABC? Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 80. ABC (simul’) advances how approximative is ABC? Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 81. ABC (simul’) advances how approximative is ABC? Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y ... [Marjoram et al, 2003; Bortot et al., 2007, Sisson et al., 2007] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002] .....or even by including in the inferential framework [ABCµ ] [Ratmann et al., 2009]
  • 82. ABC-NP Better usage of [prior] simulations by adjustement: instead of throwing away θ such that ρ(η(z), η(y)) , replace θ’s with locally regressed transforms (use with BIC) θ∗ = θ − {η(z) − η(y)}T β ˆ [Csill´ry et al., TEE, 2010] e ˆ where β is obtained by [NP] weighted least square regression on (η(z) − η(y)) with weights Kδ {ρ(η(z), η(y))} [Beaumont et al., 2002, Genetics]
  • 83. ABC-NP (regression) Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) : weight directly simulation by Kδ {ρ(η(z(θ)), η(y))} or S 1 Kδ {ρ(η(zs (θ)), η(y))} S s=1 [consistent estimate of f (η|θ)] Curse of dimensionality: poor estimate when d = dim(η) is large...
  • 84. ABC-NP (regression) Also found in the subsequent literature, e.g. in Fearnhead-Prangle (2012) : weight directly simulation by Kδ {ρ(η(z(θ)), η(y))} or S 1 Kδ {ρ(η(zs (θ)), η(y))} S s=1 [consistent estimate of f (η|θ)] Curse of dimensionality: poor estimate when d = dim(η) is large...
  • 85. ABC-NP (density estimation) Use of the kernel weights Kδ {ρ(η(z(θ)), η(y))} leads to the NP estimate of the posterior expectation i θi Kδ {ρ(η(z(θi )), η(y))} i Kδ {ρ(η(z(θi )), η(y))} [Blum, JASA, 2010]
  • 86. ABC-NP (density estimation) Use of the kernel weights Kδ {ρ(η(z(θ)), η(y))} leads to the NP estimate of the posterior conditional density ˜ Kb (θi − θ)Kδ {ρ(η(z(θi )), η(y))} i i Kδ {ρ(η(z(θi )), η(y))} [Blum, JASA, 2010]
  • 87. ABC-NP (density estimations) Other versions incorporating regression adjustments ˜ Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))} i i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d ) g c var(ˆ (θ|y)) = g (1 + oP (1)) nbδ d
  • 88. ABC-NP (density estimations) Other versions incorporating regression adjustments ˜ Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))} i i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d ) g c var(ˆ (θ|y)) = g (1 + oP (1)) nbδ d [Blum, JASA, 2010]
  • 89. ABC-NP (density estimations) Other versions incorporating regression adjustments ˜ Kb (θi∗ − θ)Kδ {ρ(η(z(θi )), η(y))} i i Kδ {ρ(η(z(θi )), η(y))} In all cases, error E[ˆ (θ|y)] − g (θ|y) = cb 2 + cδ 2 + OP (b 2 + δ 2 ) + OP (1/nδ d ) g c var(ˆ (θ|y)) = g (1 + oP (1)) nbδ d [standard NP calculations]
  • 90. ABC-NCH Incorporating non-linearities and heterocedasticities: σ (η(y)) ˆ θ∗ = m(η(y)) + [θ − m(η(z))] ˆ ˆ σ (η(z)) ˆ where m(η) estimated by non-linear regression (e.g., neural network) ˆ σ (η) estimated by non-linear regression on residuals ˆ log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi ˆ [Blum Fran¸ois, 2009] c
  • 91. ABC-NCH Incorporating non-linearities and heterocedasticities: σ (η(y)) ˆ θ∗ = m(η(y)) + [θ − m(η(z))] ˆ ˆ σ (η(z)) ˆ where m(η) estimated by non-linear regression (e.g., neural network) ˆ σ (η) estimated by non-linear regression on residuals ˆ log{θi − m(ηi )}2 = log σ 2 (ηi ) + ξi ˆ [Blum Fran¸ois, 2009] c
  • 92. ABC-NCH (2) Why neural network? fights curse of dimensionality selects relevant summary statistics provides automated dimension reduction offers a model choice capability improves upon multinomial logistic [Blum Fran¸ois, 2009] c
  • 93. ABC-NCH (2) Why neural network? fights curse of dimensionality selects relevant summary statistics provides automated dimension reduction offers a model choice capability improves upon multinomial logistic [Blum Fran¸ois, 2009] c
  • 94. ABC-MCMC how approximative is ABC? Markov chain (θ(t) ) created via the transition function  θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   π(θ )Kω (θ(t) |θ ) θ(t+1) = and u ∼ U(0, 1) ≤ π(θ(t) )Kω (θ |θ(t) ) ,   (t) θ otherwise, has the posterior π(θ|y ) as stationary distribution [Marjoram et al, 2003]
  • 95. ABC-MCMC how approximative is ABC? Markov chain (θ(t) ) created via the transition function  θ ∼ Kω (θ |θ(t) ) if x ∼ f (x|θ ) is such that x = y   π(θ )Kω (θ(t) |θ ) θ(t+1) = and u ∼ U(0, 1) ≤ π(θ(t) )Kω (θ |θ(t) ) ,   (t) θ otherwise, has the posterior π(θ|y ) as stationary distribution [Marjoram et al, 2003]
  • 96. ABC-MCMC (2) Algorithm 2 Likelihood-free MCMC sampler Use Algorithm 1 to get (θ(0) , z(0) ) for t = 1 to N do Generate θ from Kω ·|θ(t−1) , Generate z from the likelihood f (·|θ ), Generate u from U[0,1] , π(θ )Kω (θ(t−1) |θ ) if u ≤ I π(θ(t−1) Kω (θ |θ(t−1) ) A ,y (z ) then set (θ (t) , z(t) ) = (θ , z ) else (θ(t) , z(t) )) = (θ(t−1) , z(t−1) ), end if end for
  • 97. Why does it work? Acceptance probability does not involve calculating the likelihood and π (θ , z |y) q(θ (t−1) |θ )f (z(t−1) |θ (t−1) ) × π (θ (t−1) , z(t−1) |y) q(θ |θ (t−1) )f (z |θ ) π(θ ) XX|θ IA ,y (z ) ) f (z X X = π(θ (t−1) ) f (z(t−1) |θ (t−1) ) IA ,y (z(t−1) ) q(θ (t−1) |θ ) f (z(t−1) |θ (t−1) ) × q(θ |θ (t−1) ) XX|θ ) f (z XX
  • 98. Why does it work? Acceptance probability does not involve calculating the likelihood and π (θ , z |y) q(θ (t−1) |θ )f (z(t−1) |θ (t−1) ) × π (θ (t−1) , z(t−1) |y) q(θ |θ (t−1) )f (z |θ ) π(θ ) XX|θ IA ,y (z ) ) f (z X X = hh ( π(θ (t−1) ) ((hhhhh) IA ,y (z(t−1) ) f (z(t−1) |θ (t−1) ((( h (( hh ( q(θ (t−1) |θ ) ((hhhhh) f (z(t−1) |θ (t−1) (( ((( h × q(θ |θ (t−1) ) XX|θ ) f (z XX
  • 99. Why does it work? Acceptance probability does not involve calculating the likelihood and π (θ , z |y) q(θ (t−1) |θ )f (z(t−1) |θ (t−1) ) × π (θ (t−1) , z(t−1) |y) q(θ |θ (t−1) )f (z |θ ) π(θ ) XX|θ IA ,y (z ) ) f (z X X = hh ( π(θ (t−1) ) ((hhhhh) X(t−1) ) f (z(t−1) |θ (t−1) IAXX (( X ((( h ,y (zX X hh ( q(θ (t−1) |θ ) ((hhhhh) f (z(t−1) |θ (t−1) (( ((( h × q(θ |θ (t−1) ) X|θ X ) f (z XX π(θ )q(θ (t−1) |θ ) = IA ,y (z ) π(θ (t−1) q(θ |θ (t−1) )
  • 100. A toy example Case of 1 1 x ∼ N (θ, 1) + N (−θ, 1) 2 2 under prior θ ∼ N (0, 10) ABC sampler thetas=rnorm(N,sd=10) zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1) eps=quantile(abs(zed-x),.01) abc=thetas[abs(zed-x)eps]
  • 101. A toy example Case of 1 1 x ∼ N (θ, 1) + N (−θ, 1) 2 2 under prior θ ∼ N (0, 10) ABC sampler thetas=rnorm(N,sd=10) zed=sample(c(1,-1),N,rep=TRUE)*thetas+rnorm(N,sd=1) eps=quantile(abs(zed-x),.01) abc=thetas[abs(zed-x)eps]
  • 102. A toy example Case of 1 1 x ∼ N (θ, 1) + N (−θ, 1) 2 2 under prior θ ∼ N (0, 10) ABC-MCMC sampler metas=rep(0,N) metas[1]=rnorm(1,sd=10) zed[1]=x for (t in 2:N){ metas[t]=rnorm(1,mean=metas[t-1],sd=5) zed[t]=rnorm(1,mean=(1-2*(runif(1).5))*metas[t],sd=1) if ((abs(zed[t]-x)eps)||(runif(1)dnorm(metas[t],sd=10)/dnorm(metas[t-1],sd=10))){ metas[t]=metas[t-1] zed[t]=zed[t-1]} }
  • 103. A toy example x =2
  • 104. 0.00 0.05 0.10 0.15 0.20 0.25 0.30 A toy example −4 −2 θ 0 x =2 2 4
  • 105. A toy example x = 10
  • 106. A toy example 0.25 0.20 0.15 0.10 0.05 0.00 x = 10 −10 −5 0 5 10 θ
  • 107. A toy example x = 50
  • 108. A toy example 0.10 0.08 0.06 0.04 0.02 0.00 x = 50 −40 −20 0 20 40 θ
  • 109. ABC-PMC Use of a transition kernel as in population Monte Carlo with manageable IS correction Generate a sample at iteration t by N (t−1) (t−1) πt (θ(t) ) ∝ ˆ ωj Kt (θ(t) |θj ) j=1 modulo acceptance of the associated xt , and use an importance (t) weight associated with an accepted simulation θi (t) (t) (t) ωi ∝ π(θi ) πt (θi ) . ˆ c Still likelihood free [Beaumont et al., 2009]
  • 110. Our ABC-PMC algorithm Given a decreasing sequence of approximation levels 1 ≥ ... ≥ T, 1. At iteration t = 1, For i = 1, ..., N (1) (1) Simulate θi ∼ π(θ) and x ∼ f (x|θi ) until (x, y ) 1 (1) Set ωi = 1/N (1) Take τ 2 as twice the empirical variance of the θi ’s 2. At iteration 2 ≤ t ≤ T , For i = 1, ..., N, repeat (t−1) (t−1) Pick θi from the θj ’s with probabilities ωj (t) 2 (t) generate θi |θi ∼ N (θi , σt ) and x ∼ f (x|θi ) until (x, y ) t (t) (t) N (t−1) −1 (t) (t−1) Set ωi ∝ π(θi )/ j=1 ωj ϕ σt θi − θj ) 2 (t) Take τt+1 as twice the weighted empirical variance of the θi ’s
  • 111. Sequential Monte Carlo SMC is a simulation technique to approximate a sequence of related probability distributions πn with π0 “easy” and πT as target. Iterated IS as PMC: particles moved from time n to time n via kernel Kn and use of a sequence of extended targets πn˜ n πn (z0:n ) = πn (zn ) ˜ Lj (zj+1 , zj ) j=0 where the Lj ’s are backward Markov kernels [check that πn (zn ) is a marginal] [Del Moral, Doucet Jasra, Series B, 2006]
  • 112. Sequential Monte Carlo (2) Algorithm 3 SMC sampler (0) sample zi ∼ γ0 (x) (i = 1, . . . , N) (0) (0) (0) compute weights wi = π0 (zi )/γ0 (zi ) for t = 1 to N do if ESS(w (t−1) ) NT then resample N particles z (t−1) and set weights to 1 end if (t−1) (t−1) generate zi ∼ Kt (zi , ·) and set weights to (t) (t) (t−1) (t) (t−1) πt (zi ))Lt−1 (zi ), zi )) wi = wi−1 (t−1) (t−1) (t) πt−1 (zi ))Kt (zi ), zi )) end for [Del Moral, Doucet Jasra, Series B, 2006]
  • 113. ABC-SMC [Del Moral, Doucet Jasra, 2009] True derivation of an SMC-ABC algorithm Use of a kernel Kn associated with target π n and derivation of the backward kernel π n (z )Kn (z , z) Ln−1 (z, z ) = πn (z) Update of the weights M m m=1 IA n (xin ) win ∝ wi(n−1) M m (xi(n−1) ) m=1 IA n−1 m when xin ∼ K (xi(n−1) , ·)
  • 114. ABC-SMCM Modification: Makes M repeated simulations of the pseudo-data z given the parameter, rather than using a single [M = 1] simulation, leading to weight that is proportional to the number of accepted zi s M 1 ω(θ) = Iρ(η(y),η(zi )) M i=1 [limit in M means exact simulation from (tempered) target]
  • 115. Properties of ABC-SMC The ABC-SMC method properly uses a backward kernel L(z, z ) to simplify the importance weight and to remove the dependence on the unknown likelihood from this weight. Update of importance weights is reduced to the ratio of the proportions of surviving particles Major assumption: the forward kernel K is supposed to be invariant against the true target [tempered version of the true posterior] Adaptivity in ABC-SMC algorithm only found in on-line construction of the thresholds t , slowly enough to keep a large number of accepted transitions
  • 116. Properties of ABC-SMC The ABC-SMC method properly uses a backward kernel L(z, z ) to simplify the importance weight and to remove the dependence on the unknown likelihood from this weight. Update of importance weights is reduced to the ratio of the proportions of surviving particles Major assumption: the forward kernel K is supposed to be invariant against the true target [tempered version of the true posterior] Adaptivity in ABC-SMC algorithm only found in on-line construction of the thresholds t , slowly enough to keep a large number of accepted transitions
  • 117. A mixture example (1) Toy model of Sisson et al. (2007): if θ ∼ U(−10, 10) , x|θ ∼ 0.5 N (θ, 1) + 0.5 N (θ, 1/100) , then the posterior distribution associated with y = 0 is the normal mixture θ|y = 0 ∼ 0.5 N (0, 1) + 0.5 N (0, 1/100) restricted to [−10, 10]. Furthermore, true target available as π(θ||x| ) ∝ Φ( −θ)−Φ(− −θ)+Φ(10( −θ))−Φ(−10( +θ)) .
  • 118. A mixture example (2) Recovery of the target, whether using a fixed standard deviation of τ = 0.15 or τ = 1/0.15, or a sequence of adaptive τt ’s. 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 θ θ θ θ θ
  • 119. ABC inference machine Introduction Approximate Bayesian computation ABC as an inference machine Error inc. Exact BC and approximate targets summary statistic Series B discussion2
  • 120. How much Bayesian? maybe a convergent method of inference (meaningful? sufficient? foreign?) approximation error unknown (w/o simulation) pragmatic Bayes (there is no other solution!) many calibration issues (tolerance, distance, statistics) ...should Bayesians care?!
  • 121. How much Bayesian? maybe a convergent method of inference (meaningful? sufficient? foreign?) approximation error unknown (w/o simulation) pragmatic Bayes (there is no other solution!) many calibration issues (tolerance, distance, statistics) ...should Bayesians care?!
  • 122. ABCµ Idea Infer about the error as well: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
  • 123. ABCµ Idea Infer about the error as well: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
  • 124. ABCµ Idea Infer about the error as well: Use of a joint density f (θ, |y) ∝ ξ( |y, θ) × πθ (θ) × π ( ) where y is the data, and ξ( |y, θ) is the prior predictive density of ρ(η(z), η(y)) given θ and y when z ∼ f (z|θ) Warning! Replacement of ξ( |y, θ) with a non-parametric kernel approximation. [Ratmann, Andrieu, Wiuf and Richardson, 2009, PNAS]
  • 125. ABCµ details Multidimensional distances ρk (k = 1, . . . , K ) and errors k = ρk (ηk (z), ηk (y)), with ˆ 1 k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K [{ k −ρk (ηk (zb ), ηk (y))}/hk ] Bhk b ˆ then used in replacing ξ( |y, θ) with mink ξk ( |y, θ) ABCµ involves acceptance probability ˆ π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ ) ˆ π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
  • 126. ABCµ details Multidimensional distances ρk (k = 1, . . . , K ) and errors k = ρk (ηk (z), ηk (y)), with ˆ 1 k ∼ ξk ( |y, θ) ≈ ξk ( |y, θ) = K [{ k −ρk (ηk (zb ), ηk (y))}/hk ] Bhk b ˆ then used in replacing ξ( |y, θ) with mink ξk ( |y, θ) ABCµ involves acceptance probability ˆ π(θ , ) q(θ , θ)q( , ) mink ξk ( |y, θ ) ˆ π(θ, ) q(θ, θ )q( , ) mink ξk ( |y, θ)
  • 127. ABCµ multiple errors [ c Ratmann et al., PNAS, 2009]
  • 128. ABCµ for model choice [ c Ratmann et al., PNAS, 2009]
  • 129. Questions about ABCµ [and model choice] For each model under comparison, marginal posterior on used to assess the fit of the model (HPD includes 0 or not). Is the data informative about ? [Identifiability] How much does the prior π( ) impact the comparison? How is using both ξ( |x0 , θ) and π ( ) compatible with a standard probability model? [remindful of Wilkinson’s eABC ] Where is the penalisation for complexity in the model comparison? [X, Mengersen Chen, 2010, PNAS]
  • 130. Questions about ABCµ [and model choice] For each model under comparison, marginal posterior on used to assess the fit of the model (HPD includes 0 or not). Is the data informative about ? [Identifiability] How much does the prior π( ) impact the comparison? How is using both ξ( |x0 , θ) and π ( ) compatible with a standard probability model? [remindful of Wilkinson’s eABC ] Where is the penalisation for complexity in the model comparison? [X, Mengersen Chen, 2010, PNAS]
  • 131. Wilkinson’s exact BC (not exactly!) ABC approximation error (i.e. non-zero tolerance) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π(θ)f (z|θ)K (y − z) π (θ, z|y) = , π(θ)f (z|θ)K (y − z)dzdθ with K kernel parameterised by bandwidth . [Wilkinson, 2008] Theorem The ABC algorithm based on the assumption of a randomised observation y = y + ξ, ξ ∼ K , and an acceptance probability of ˜ K (y − z)/M gives draws from the posterior distribution π(θ|y).
  • 132. Wilkinson’s exact BC (not exactly!) ABC approximation error (i.e. non-zero tolerance) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π(θ)f (z|θ)K (y − z) π (θ, z|y) = , π(θ)f (z|θ)K (y − z)dzdθ with K kernel parameterised by bandwidth . [Wilkinson, 2008] Theorem The ABC algorithm based on the assumption of a randomised observation y = y + ξ, ξ ∼ K , and an acceptance probability of ˜ K (y − z)/M gives draws from the posterior distribution π(θ|y).
  • 133. How exact a BC? “Using to represent measurement error is straightforward, whereas using to model the model discrepancy is harder to conceptualize and not as commonly used” [Richard Wilkinson, 2008]
  • 134. How exact a BC? Pros Pseudo-data from true model and observed data from noisy model Interesting perspective in that outcome is completely controlled Link with ABCµ and assuming y is observed with a measurement error with density K Relates to the theory of model approximation [Kennedy O’Hagan, 2001] Cons Requires K to be bounded by M True approximation error never assessed Requires a modification of the standard ABC algorithm
  • 135. ABC for HMMs Specific case of a hidden Markov model Xt+1 ∼ Qθ (Xt , ·) Yt+1 ∼ gθ (·|xt ) 0 where only y1:n is observed. [Dean, Singh, Jasra, Peters, 2011] Use of specific constraints, adapted to the Markov structure: 0 0 y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )
  • 136. ABC for HMMs Specific case of a hidden Markov model Xt+1 ∼ Qθ (Xt , ·) Yt+1 ∼ gθ (·|xt ) 0 where only y1:n is observed. [Dean, Singh, Jasra, Peters, 2011] Use of specific constraints, adapted to the Markov structure: 0 0 y1 ∈ B(y1 , ) × · · · × yn ∈ B(yn , )
  • 137. ABC-MLE for HMMs ABC-MLE defined by ˆ 0 0 θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , ) θ Exact MLE for the likelihood same basis as Wilkinson! 0 pθ (y1 , . . . , yn ) corresponding to the perturbed process (xt , yt + zt )1≤t≤n zt ∼ U(B(0, 1) [Dean, Singh, Jasra, Peters, 2011]
  • 138. ABC-MLE for HMMs ABC-MLE defined by ˆ 0 0 θn = arg max Pθ Y1 ∈ B(y1 , ), . . . , Yn ∈ B(yn , ) θ Exact MLE for the likelihood same basis as Wilkinson! 0 pθ (y1 , . . . , yn ) corresponding to the perturbed process (xt , yt + zt )1≤t≤n zt ∼ U(B(0, 1) [Dean, Singh, Jasra, Peters, 2011]
  • 139. ABC-MLE is biased ABC-MLE is asymptotically (in n) biased with target l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )] but ABC-MLE converges to true value in the sense l n (θn ) → l (θ) for all sequences (θn ) converging to θ and n
  • 140. ABC-MLE is biased ABC-MLE is asymptotically (in n) biased with target l (θ) = Eθ∗ [log pθ (Y1 |Y−∞:0 )] but ABC-MLE converges to true value in the sense l n (θn ) → l (θ) for all sequences (θn ) converging to θ and n
  • 141. Noisy ABC-MLE Idea: Modify instead the data from the start 0 (y1 + ζ1 , . . . , yn + ζn ) [ see Fearnhead-Prangle ] noisy ABC-MLE estimate 0 0 arg max Pθ Y1 ∈ B(y1 + ζ1 , ), . . . , Yn ∈ B(yn + ζn , ) θ [Dean, Singh, Jasra, Peters, 2011]
  • 142. Consistent noisy ABC-MLE Degrading the data improves the estimation performances: Noisy ABC-MLE is asymptotically (in n) consistent under further assumptions, the noisy ABC-MLE is asymptotically normal increase in variance of order −2 likely degradation in precision or computing time due to the lack of summary statistic [curse of dimensionality]
  • 143. SMC for ABC likelihood Algorithm 4 SMC ABC for HMMs Given θ for k = 1, . . . , n do 1 1 N N generate proposals (xk , yk ), . . . , (xk , yk ) from the model weigh each proposal with ωk l =I l B(yk + ζk , ) (yk ) 0 l renormalise the weights and sample the xk ’s accordingly end for approximate the likelihood by n N l ωk N k=1 l=1 [Jasra, Singh, Martin, McCoy, 2010]
  • 144. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Starting from a large collection of summary statistics is available, Joyce and Marjoram (2008) consider the sequential inclusion into the ABC target, with a stopping rule based on a likelihood ratio test Not taking into account the sequential nature of the tests Depends on parameterisation Order of inclusion matters likelihood ratio test?!
  • 145. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Starting from a large collection of summary statistics is available, Joyce and Marjoram (2008) consider the sequential inclusion into the ABC target, with a stopping rule based on a likelihood ratio test Not taking into account the sequential nature of the tests Depends on parameterisation Order of inclusion matters likelihood ratio test?!
  • 146. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics Starting from a large collection of summary statistics is available, Joyce and Marjoram (2008) consider the sequential inclusion into the ABC target, with a stopping rule based on a likelihood ratio test Not taking into account the sequential nature of the tests Depends on parameterisation Order of inclusion matters likelihood ratio test?!
  • 147. Which summary for model choice? Depending on the choice of η(·), the Bayes factor based on this insufficient statistic, η π1 (θ 1 )f1η (η(y)|θ 1 ) dθ 1 B12 (y) = , π2 (θ 2 )f2η (η(y)|θ 2 ) dθ 2 is consistent or not. [X, Cornuet, Marin, Pillai, 2012] Consistency only depends on the range of Ei [η(y)] under both models. [Marin, Pillai, X, Rousseau, 2012]
  • 148. Which summary for model choice? Depending on the choice of η(·), the Bayes factor based on this insufficient statistic, η π1 (θ 1 )f1η (η(y)|θ 1 ) dθ 1 B12 (y) = , π2 (θ 2 )f2η (η(y)|θ 2 ) dθ 2 is consistent or not. [X, Cornuet, Marin, Pillai, 2012] Consistency only depends on the range of Ei [η(y)] under both models. [Marin, Pillai, X, Rousseau, 2012]
  • 149. Semi-automatic ABC Fearnhead and Prangle (2010) study ABC and the selection of the summary statistic in close proximity to Wilkinson’s proposal ABC then considered from a purely inferential viewpoint and calibrated for estimation purposes Use of a randomised (or ‘noisy’) version of the summary statistics η (y) = η(y) + τ ˜ Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior]
  • 150. Semi-automatic ABC Fearnhead and Prangle (2010) study ABC and the selection of the summary statistic in close proximity to Wilkinson’s proposal ABC then considered from a purely inferential viewpoint and calibrated for estimation purposes Use of a randomised (or ‘noisy’) version of the summary statistics η (y) = η(y) + τ ˜ Derivation of a well-calibrated version of ABC, i.e. an algorithm that gives proper predictions for the distribution associated with this randomised summary statistic [calibration constraint: ABC approximation with same posterior mean as the true randomised posterior]
  • 151. Summary [of FP/statistics) Optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)! Use of the standard quadratic loss function (θ − θ0 )T A(θ − θ0 ) .
  • 152. Summary [of FP/statistics) Optimality of the posterior expectation E[θ|y] of the parameter of interest as summary statistics η(y)! Use of the standard quadratic loss function (θ − θ0 )T A(θ − θ0 ) .
  • 153. Details on Fearnhead and Prangle (FP) ABC Use of a summary statistic S(·), an importance proposal g (·), a kernel K (·) ≤ 1 and a bandwidth h 0 such that (θ, ysim ) ∼ g (θ)f (ysim |θ) is accepted with probability (hence the bound) K [{S(ysim ) − sobs }/h] and the corresponding importance weight defined by π(θ) g (θ) [Fearnhead Prangle, 2012]
  • 154. Errors, errors, and errors Three levels of approximation π(θ|yobs ) by π(θ|sobs ) loss of information [ignored] π(θ|sobs ) by π(s)K [{s − sobs }/h]π(θ|s) ds πABC (θ|sobs ) = π(s)K [{s − sobs }/h] ds noisy observations πABC (θ|sobs ) by importance Monte Carlo based on N simulations, represented by var(a(θ)|sobs )/Nacc [expected number of acceptances]
  • 155. Average acceptance asymptotics For the average acceptance probability/approximate likelihood p(θ|sobs ) = f (ysim |θ) K [{S(ysim ) − sobs }/h] dysim , overall acceptance probability p(sobs ) = p(θ|sobs ) π(θ) dθ = π(sobs )hd + o(hd ) [FP, Lemma 1]
  • 156. Optimal importance proposal Best choice of importance proposal in terms of effective sample size g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2 [Not particularly useful in practice] note that p(θ|sobs ) is an approximate likelihood reminiscent of parallel tempering could be approximately achieved by attrition of half of the data
  • 157. Optimal importance proposal Best choice of importance proposal in terms of effective sample size g (θ|sobs ) ∝ π(θ)p(θ|sobs )1/2 [Not particularly useful in practice] note that p(θ|sobs ) is an approximate likelihood reminiscent of parallel tempering could be approximately achieved by attrition of half of the data
  • 158. Calibration of h “This result gives insight into how S(·) and h affect the Monte Carlo error. To minimize Monte Carlo error, we need hd to be not too small. Thus ideally we want S(·) to be a low dimensional summary of the data that is sufficiently informative about θ that π(θ|sobs ) is close, in some sense, to π(θ|yobs )” (FP, p.5) turns h into an absolute value while it should be context-dependent and user-calibrated only addresses one term in the approximation error and acceptance probability (“curse of dimensionality”) h large prevents πABC (θ|sobs ) to be close to π(θ|sobs ) d small prevents π(θ|sobs ) to be close to π(θ|yobs ) (“curse of [dis]information”)
  • 159. Calibrating ABC “If πABC is calibrated, then this means that probability statements that are derived from it are appropriate, and in particular that we can use πABC to quantify uncertainty in estimates” (FP, p.5) Definition For 0 q 1 and subset A, event Eq (A) made of sobs such that PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if Pr(θ ∈ A|Eq (A)) = q unclear meaning of conditioning on Eq (A)
  • 160. Calibrating ABC “If πABC is calibrated, then this means that probability statements that are derived from it are appropriate, and in particular that we can use πABC to quantify uncertainty in estimates” (FP, p.5) Definition For 0 q 1 and subset A, event Eq (A) made of sobs such that PrABC (θ ∈ A|sobs ) = q. Then ABC is calibrated if Pr(θ ∈ A|Eq (A)) = q unclear meaning of conditioning on Eq (A)
  • 161. Calibrated ABC Theorem (FP) Noisy ABC, where sobs = S(yobs ) + h , ∼ K (·) is calibrated [Wilkinson, 2008] no condition on h!!
  • 162. Calibrated ABC Consequence: when h = ∞ Theorem (FP) The prior distribution is always calibrated is this a relevant property then?
  • 163. More questions about calibrated ABC “Calibration is not universally accepted by Bayesians. It is even more questionable here as we care how statements we make relate to the real world, not to a mathematically defined posterior.” R. Wilkinson Same reluctance about the prior being calibrated Property depending on prior, likelihood, and summary Calibration is a frequentist property (almost a p-value!) More sensible to account for the simulator’s imperfections than using noisy-ABC against a meaningless based measure [Wilkinson, 2012]
  • 164. Converging ABC Theorem (FP) For noisy ABC, the expected noisy-ABC log-likelihood, E {log[p(θ|sobs )]} = log[p(θ|S(yobs ) + )]π(yobs |θ0 )K ( )dyobs d , has its maximum at θ = θ0 . True for any choice of summary statistic? even ancilary statistics?! [Imposes at least identifiability...] Relevant in asymptotia and not for the data
  • 165. Converging ABC Corollary For noisy ABC, the ABC posterior converges onto a point mass on the true parameter value as m → ∞. For standard ABC, not always the case (unless h goes to zero). Strength of regularity conditions (c1) and (c2) in Bernardo Smith, 1994? [out-of-reach constraints on likelihood and posterior] Again, there must be conditions imposed upon summary statistics...
  • 166. Loss motivated statistic Under quadratic loss function, Theorem (FP) ˆ (i) The minimal posterior error E[L(θ, θ)|yobs ] occurs when ˆ = E(θ|yobs ) (!) θ (ii) When h → 0, EABC (θ|sobs ) converges to E(θ|yobs ) ˆ (iii) If S(yobs ) = E[θ|yobs ] then for θ = EABC [θ|sobs ] ˆ E[L(θ, θ)|yobs ] = trace(AΣ) + h2 xT AxK (x)dx + o(h2 ). measure-theoretic difficulties? dependence of sobs on h makes me uncomfortable inherent to noisy ABC Relevant for choice of K ?
  • 167. Optimal summary statistic “We take a different approach, and weaken the requirement for πABC to be a good approximation to π(θ|yobs ). We argue for πABC to be a good approximation solely in terms of the accuracy of certain estimates of the parameters.” (FP, p.5) From this result, FP derive their choice of summary statistic, S(y) = E(θ|y) [almost sufficient] suggest h = O(N −1/(2+d) ) and h = O(N −1/(4+d) ) as optimal bandwidths for noisy and standard ABC.
  • 168. Optimal summary statistic “We take a different approach, and weaken the requirement for πABC to be a good approximation to π(θ|yobs ). We argue for πABC to be a good approximation solely in terms of the accuracy of certain estimates of the parameters.” (FP, p.5) From this result, FP derive their choice of summary statistic, S(y) = E(θ|y) [wow! EABC [θ|S(yobs )] = E[θ|yobs ]] suggest h = O(N −1/(2+d) ) and h = O(N −1/(4+d) ) as optimal bandwidths for noisy and standard ABC.
  • 169. Caveat Since E(θ|yobs ) is most usually unavailable, FP suggest (i) use a pilot run of ABC to determine a region of non-negligible posterior mass; (ii) simulate sets of parameter values and data; (iii) use the simulated sets of parameter values and data to estimate the summary statistic; and (iv) run ABC with this choice of summary statistic. where is the assessment of the first stage error?
  • 170. Caveat Since E(θ|yobs ) is most usually unavailable, FP suggest (i) use a pilot run of ABC to determine a region of non-negligible posterior mass; (ii) simulate sets of parameter values and data; (iii) use the simulated sets of parameter values and data to estimate the summary statistic; and (iv) run ABC with this choice of summary statistic. where is the assessment of the first stage error?
  • 171. Approximating the summary statistic As Beaumont et al. (2002) and Blum and Fran¸ois (2010), FP c use a linear regression to approximate E(θ|yobs ): (i) θi = β0 + β (i) f (yobs ) + i with f made of standard transforms [Further justifications?]
  • 172. [my]questions about semi-automatic ABC dependence on h and S(·) in the early stage reduction of Bayesian inference to point estimation approximation error in step (i) not accounted for not parameterisation invariant practice shows that proper approximation to genuine posterior distributions stems from using a (much) larger number of summary statistics than the dimension of the parameter the validity of the approximation to the optimal summary statistic depends on the quality of the pilot run important inferential issues like model choice are not covered by this approach. [X, 2012]
  • 173. More about semi-automatic ABC [End of section derived from comments on Read Paper, just appeared] “The apparently arbitrary nature of the choice of summary statistics has always been perceived as the Achilles heel of ABC.” M. Beaumont “Curse of dimensionality” linked with the increase of the dimension of the summary statistic Connection with principal component analysis [Itan et al., 2010] Connection with partial least squares [Wegman et al., 2009] Beaumont et al. (2002) postprocessed output is used as input by FP to run a second ABC
  • 174. More about semi-automatic ABC [End of section derived from comments on Read Paper, just appeared] “The apparently arbitrary nature of the choice of summary statistics has always been perceived as the Achilles heel of ABC.” M. Beaumont “Curse of dimensionality” linked with the increase of the dimension of the summary statistic Connection with principal component analysis [Itan et al., 2010] Connection with partial least squares [Wegman et al., 2009] Beaumont et al. (2002) postprocessed output is used as input by FP to run a second ABC
  • 175. Wood’s alternative Instead of a non-parametric kernel approximation to the likelihood 1 K {η(yr ) − η(yobs )} R r Wood (2010) suggests a normal approximation η(y(θ)) ∼ Nd (µθ , Σθ ) whose parameters can be approximated based on the R simulations (for each value of θ). Parametric versus non-parametric rate [Uh?!] Automatic weighting of components of η(·) through Σθ Dependence on normality assumption (pseudo-likelihood?) [Cornebise, Girolami Kosmidis, 2012]
  • 176. Wood’s alternative Instead of a non-parametric kernel approximation to the likelihood 1 K {η(yr ) − η(yobs )} R r Wood (2010) suggests a normal approximation η(y(θ)) ∼ Nd (µθ , Σθ ) whose parameters can be approximated based on the R simulations (for each value of θ). Parametric versus non-parametric rate [Uh?!] Automatic weighting of components of η(·) through Σθ Dependence on normality assumption (pseudo-likelihood?) [Cornebise, Girolami Kosmidis, 2012]
  • 177. Reinterpretation and extensions Reinterpretation of ABC output as joint simulation from π (x, y |θ) = f (x|θ)¯Y |X (y |x) ¯ π where πY |X (y |x) = K (y − x) ¯ Reinterpretation of noisy ABC if y |y obs ∼ πY |X (·|y obs ), then marginally ¯ ¯ y ∼ πY |θ (·|θ0 ) ¯ ¯ c Explain for the consistency of Bayesian inference based on y and π ¯ ¯ [Lee, Andrieu Doucet, 2012]
  • 178. Reinterpretation and extensions Reinterpretation of ABC output as joint simulation from π (x, y |θ) = f (x|θ)¯Y |X (y |x) ¯ π where πY |X (y |x) = K (y − x) ¯ Reinterpretation of noisy ABC if y |y obs ∼ πY |X (·|y obs ), then marginally ¯ ¯ y ∼ πY |θ (·|θ0 ) ¯ ¯ c Explain for the consistency of Bayesian inference based on y and π ¯ ¯ [Lee, Andrieu Doucet, 2012]