Those are the slides for my conference talk at 2013 WSC, in the "Jacob Bernoulli's "Ars Conjectandi" and the emergence of probability" session organised by Adam Jakubowski
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
1. An [under]view of Monte Carlo methods, from
importance sampling to MCMC, to ABC
(& kudos to Bernoulli)
Christian P. Robert
Universit´e Paris-Dauphine, University of Warwick, & CREST, Paris
2013 WSC, Hong Kong
bayesianstatistics@gmail.com
3. Bernoulli as founding father of Monte Carlo methods
The weak law of large numbers (or Bernoulli’s [Golden] theorem)
provides the justification for Monte Carlo approximations:
if x1, . . . , xn are i.i.d. rv’s with density f ,
lim
n→∞
h(x1) + . . . + h(xn)
n
=
X
h(x)f (x) dx
Stigler’s Law of Eponimy: Cardano (1501–1576) first stated the
result
4. Bernoulli as founding father of Monte Carlo methods
...and indeed
h(x1) + . . . + h(xn)
n
converges to
I =
X
h(x)f (x) dx
5. Bernoulli as founding father of Monte Carlo methods
...and indeed
h(x1) + . . . + h(xn)
n
converges to
I =
X
h(x)f (x) dx
...meaning that provided we can simulate xi ∼ f (·) long and fast
“enough”, the empirical mean will be a good “enough”
approximation to I
6. Early implementations of the LLN
While Jakob Bernoulli
himself apparently did not
engage in simulation,
Buffon (1707–1788) resorted
to a (not-yet-Monte-Carlo)
experiment in 1735 to
estimate the value of the
Saint Petersburg game
(even though he did not
perform a similar experiment
for estimating π)
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
7. Early implementations of the LLN
While Jakob Bernoulli
himself apparently did not
engage in simulation,
De Forest (1834–1888)
found the median of a
log-Cauchy distribution,
using normal simulations
approximated to the second
digit (in 1876)
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
8. Early implementations of the LLN
While Jakob Bernoulli
himself apparently did not
engage in simulation,
followed closely by the
ubuquitous Galton using
“normal” dice in 1890, after
developping the Quincunx,
used both for checking the
CLT and simulating from a
posterior distribution as
early as 1877
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
9. Importance Sampling
When focussing on integral approximation, very loose principle in
that proposal distribution with pdf q(·) leads to alternative
representation
I =
X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
^IIS
= n−1
n
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
10. Importance Sampling
When focussing on integral approximation, very loose principle in
that proposal distribution with pdf q(·) leads to alternative
representation
I =
X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
^IIS
= n−1
n
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
11. Importance Sampling
When focussing on integral approximation, very loose principle in
that proposal distribution with pdf q(·) leads to alternative
representation
I =
X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
^IIS
= n−1
n
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
12. things aren’t all rosy...
LLN not sufficient to justify Monte
Carlo methods: if
n−1
n
i=1
h(xi ){f /q}(xi )
has an infinite variance, the estimator
^IIS is useless Importance sampling estimation of
P(2 Z 6) Z is Cauchy and
importance is normal, compared
with exact value, 0.095.
13. The harmonic mean estimator
Bayesian posterior distribution defined as
π(θ|x) = π(θ)L(θ|x)/m(x)
When θi ∼ π(θ|x),
1
T
T
t=1
1
L(θt|x)
is an unbiased estimator of 1/m(x)
[Gelfand & Dey, 1994; Newton & Raftery, 1994]
Highly hazardous material: Most often leads to an infinite
variance!!!
14. The harmonic mean estimator
Bayesian posterior distribution defined as
π(θ|x) = π(θ)L(θ|x)/m(x)
When θi ∼ π(θ|x),
1
T
T
t=1
1
L(θt|x)
is an unbiased estimator of 1/m(x)
[Gelfand & Dey, 1994; Newton & Raftery, 1994]
Highly hazardous material: Most often leads to an infinite
variance!!!
15. “The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that this
estimator is consistent ie, it will very likely be very close to the correct
answer if you use a sufficiently large number of points from the posterior
distribution.
The bad news is that the number of points required for this estimator to
get close to the right answer will often be greater than the number of
atoms in the observable universe. The even worse news is that it’s easy
for people to not realize this, and to na¨ıvely accept estimates that are
nowhere close to the correct value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
16. Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: proposal ϕ(·) must have lighter (rather than fatter)
tails than π(·)L(·) for the approximation
1
1
T
T
t=1
ϕ(θt)
πk(θt)L(θt)
θt ∼ ϕ(·)
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
17. Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: proposal ϕ(·) must have lighter (rather than fatter)
tails than π(·)L(·) for the approximation
1
1
T
T
t=1
ϕ(θt)
πk(θt)L(θt)
θt ∼ ϕ(·)
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
18. HPD indicator as ϕ
Use the convex hull of MCMC simulations (θt)t=1,...,T
corresponding to the 10% HPD region (easily derived!) and ϕ as
indicator:
ϕ(θ) =
10
T
t∈HPD
Id(θ,θt )
[X & Wraith, 2009]
20. computational jam
In the 1970’s and early 1980’s, theoretical foundations of Bayesian
statistics were sound, but methodology was lagging for lack of
computing tools.
restriction to conjugate priors
limited complexity of models
small sample sizes
The field was desperately in need of a new computing paradigm!
[X & Casella, STS, 2012]
21. MCMC as in Markov Chain Monte Carlo
Notion that i.i.d. simulation is definitely not necessary, all that
matters is the ergodic theorem
Realization that Markov chains could be used in a wide variety of
situations only came to mainstream statisticians with Gelfand and
Smith (1990) despite earlier publications in the statistical literature
like Hastings (1970) and growing awareness in spatial statistics
(Besag, 1986)
Reasons:
lack of computing machinery
lack of background on Markov chains
lack of trust in the practicality of the method
22. pre-Gibbs/pre-Hastings era
Early 1970’s, Hammersley, Clifford, and Besag were working on the
specification of joint distributions from conditional distributions
and on necessary and sufficient conditions for the conditional
distributions to be compatible with a joint distribution.
[Hammersley and Clifford, 1971]
23. pre-Gibbs/pre-Hastings era
Early 1970’s, Hammersley, Clifford, and Besag were working on the
specification of joint distributions from conditional distributions
and on necessary and sufficient conditions for the conditional
distributions to be compatible with a joint distribution.
“What is the most general form of the conditional
probability functions that define a coherent joint
function? And what will the joint look like?”
[Besag, 1972]
24. Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
Joint distribution of vector associated with a dependence graph
must be represented as product of functions over the cliques of the
graphs, i.e., of functions depending only on the components
indexed by the labels in the clique.
[Cressie, 1993; Lauritzen, 1996]
25. Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
A probability distribution P with positive and continuous density f
satisfies the pairwise Markov property with respect to an
undirected graph G if and only if it factorizes according to G, i.e.,
(F) ≡ (G)
[Cressie, 1993; Lauritzen, 1996]
26. Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
Under the positivity condition, the joint distribution g satisfies
g(y1, . . . , yp) ∝
p
j=1
g j
(y j
|y 1 , . . . , y j−1
, y j+1
, . . . , y p
)
g j
(y j
|y 1 , . . . , y j−1
, y j+1
, . . . , y p
)
for every permutation on {1, 2, . . . , p} and every y ∈ Y.
[Cressie, 1993; Lauritzen, 1996]
27. Clicking in
After Peskun (1973), MCMC mostly dormant in mainstream
statistical world for about 10 years, then several papers/books
highlighted its usefulness in specific settings:
Geman and Geman (1984)
Besag (1986)
Strauss (1986)
Ripley (Stochastic Simulation, 1987)
Tanner and Wong (1987)
Younes (1988)
28. [Re-]Enters the Gibbs sampler
Geman and Geman (1984), building on
Metropolis et al. (1953), Hastings (1970), and
Peskun (1973), constructed a Gibbs sampler
for optimisation in a discrete image processing
problem with a Gibbs random field without
completion.
Back to Metropolis et al., 1953: the Gibbs
sampler is already in use therein and ergodicity
is proven on the collection of global maxima
29. [Re-]Enters the Gibbs sampler
Geman and Geman (1984), building on
Metropolis et al. (1953), Hastings (1970), and
Peskun (1973), constructed a Gibbs sampler
for optimisation in a discrete image processing
problem with a Gibbs random field without
completion.
Back to Metropolis et al., 1953: the Gibbs
sampler is already in use therein and ergodicity
is proven on the collection of global maxima
30. Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis -
Hastings algorithms would crack almost any problem!
Flood of papers followed applying MCMC:
linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;
Wang & al., 1993, 1994)
generalized linear mixed models (Albert & Chib, 1993)
mixture models (Tanner & Wong, 1987; Diebolt & X., 1990, 1994;
Escobar & West, 1993)
changepoint analysis (Carlin & al., 1992)
point processes (Grenander & Møller, 1994)
&tc
31. Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis -
Hastings algorithms would crack almost any problem!
Flood of papers followed applying MCMC:
genomics (Stephens & Smith, 1993; Lawrence & al., 1993;
Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,
2000)
ecology (George & X, 1992)
variable selection in regression (George & mcCulloch, 1993; Green,
1995; Chen & al., 2000)
spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993))
longitudinal studies (Lange & al., 1992)
&tc
32. MCMC and beyond
reversible jump MCMC which impacted considerably Bayesian model
choice (Green, 1995)
adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal,
2009)
exact approximations to targets (Tanner & Wong, 1987; Beaumont,
2003; Andrieu & Roberts, 2009)
comp’al stats catching up with comp’al physics: free energy sampling
(e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead,
2011)
sequential Monte Carlo (SMC) for non-sequential problems (Chopin,
2002; Neal, 2001; Del Moral et al 2006)
retrospective sampling
intractability: EP – GIMH – PMCMC – SMC2
– INLA
QMC[MC] (Owen, 2011)
33. Particles
Iterating/sequential importance sampling is about as old as Monte
Carlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
Found in the molecular simulation literature of the 50’s with
self-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle filter”.
34. Particles
Iterating/sequential importance sampling is about as old as Monte
Carlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
Found in the molecular simulation literature of the 50’s with
self-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle filter”.
35. pMC & pMCMC
Recycling of past simulations legitimate to build better
importance sampling functions as in population Monte Carlo
[Iba, 2000; Capp´e et al, 2004; Del Moral et al., 2007]
synthesis by Andrieu, Doucet, and Hollenstein (2010) using
particles to build an evolving MCMC kernel ^pθ(y1:T ) in state
space models p(x1:T )p(y1:T |x1:T )
importance sampling on discretely observed diffusions
[Beskos et al., 2006; Fearnhead et al., 2008, 2010]
36. Metropolis-Hastings revisited
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Reinterpretation and
Rao-Blackwellisation
Russian roulette
Approximate Bayesian computation
(ABC)
37. Metropolis Hastings algorithm
1. We wish to approximate
I =
h(x)π(x)dx
π(x)dx
= h(x)¯π(x)dx
2. π(x) is known but not π(x)dx.
3. Approximate I with δ = 1
n
n
t=1 h(x(t)) where (x(t)) is a
Markov chain with limiting distribution ¯π.
4. Convergence obtained from Law of Large Numbers or CLT for
Markov chains.
38. Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied: ¯π
is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.
39. Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied: ¯π
is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.
40. Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied:
π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x).
¯π is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.
41. Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied:
π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x).
¯π is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.
42. Some properties of the HM algorithm
Alternative representation of the estimator δ is
δ =
1
n
n
t=1
h(x(t)
) =
1
n
Mn
i=1
ni h(zi ) ,
where
zi ’s are the accepted yj ’s,
Mn is the number of accepted yj ’s till time n,
ni is the number of times zi appears in the sequence (x(t))t.
43. The ”accepted candidates”
˜q(·|zi ) =
α(zi , ·) q(·|zi )
p(zi )
q(·|zi )
p(zi )
,
where p(zi ) = α(zi , y) q(y|zi )dy. To simulate from ˜q(·|zi ):
1. Propose a candidate y ∼ q(·|zi )
2. Accept with probability
˜q(y|zi )
q(y|zi )
p(zi )
= α(zi , y)
Otherwise, reject it and starts again.
this is the transition of the HM algorithm.The transition kernel
˜q enjoys ˜π as a stationary distribution:
˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
44. The ”accepted candidates”
˜q(·|zi ) =
α(zi , ·) q(·|zi )
p(zi )
q(·|zi )
p(zi )
,
where p(zi ) = α(zi , y) q(y|zi )dy. To simulate from ˜q(·|zi ):
1. Propose a candidate y ∼ q(·|zi )
2. Accept with probability
˜q(y|zi )
q(y|zi )
p(zi )
= α(zi , y)
Otherwise, reject it and starts again.
this is the transition of the HM algorithm.The transition kernel
˜q enjoys ˜π as a stationary distribution:
˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
45. ”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
46. ”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
47. ”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
48. ”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
51. Importance sampling perspective
1. A natural idea:
δ∗
Mn
i=1
h(zi )
p(zi )
Mn
i=1
1
p(zi )
=
Mn
i=1
π(zi )
˜π(zi )
h(zi )
Mn
i=1
π(zi )
˜π(zi )
.
2. But p not available in closed form.
52. Importance sampling perspective
1. A natural idea:
δ∗
Mn
i=1
h(zi )
p(zi )
Mn
i=1
1
p(zi )
=
Mn
i=1
π(zi )
˜π(zi )
h(zi )
Mn
i=1
π(zi )
˜π(zi )
.
2. But p not available in closed form.
3. The geometric ni is the replacement, an obvious solution that
is used in the original Metropolis–Hastings estimate since
E[ni ] = 1/p(zi ).
53. The Bernoulli factory
The crude estimate of 1/p(zi ),
ni = 1 +
∞
j=1 j
I {u α(zi , y )} ,
can be improved:
Lemma (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ), the quantity
^ξi = 1 +
∞
j=1 j
{1 − α(zi , y )}
is an unbiased estimator of 1/p(zi ) which variance, conditional on
zi , is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).
54. Rao-Blackwellised, for sure?
^ξi = 1 +
∞
j=1 j
{1 − α(zi , y )}
1. Infinite sum but finite with at least positive probability:
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
For example: take a symmetric random walk as a proposal.
2. What if we wish to be sure that the sum is finite?
Finite horizon k version:
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
55. Rao-Blackwellised, for sure?
^ξi = 1 +
∞
j=1 j
{1 − α(zi , y )}
1. Infinite sum but finite with at least positive probability:
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
For example: take a symmetric random walk as a proposal.
2. What if we wish to be sure that the sum is finite?
Finite horizon k version:
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
56. which Bernoulli factory?!
Not the spice warehouse of Leon Bernoulli!
Query:
Given an algorithm delivering iid B(p) rv’s, is it possible to derive
an algorithm delivering iid B(p) rv’s when f is known and p
unknown?
[von Neumann, 1951; Keane & O’Brien, 1994]
existence (e.g., impossible for f (p) = min(2p, 1))
condition: for some n,
min{f (p), 1 − f (p)} min{p, 1 − p}n
implementation (polynomial vs. exponential time)
use of sandwiching polynomials/power series
57. which Bernoulli factory?!
Not the spice warehouse of Leon Bernoulli!
Query:
Given an algorithm delivering iid B(p) rv’s, is it possible to derive
an algorithm delivering iid B(p) rv’s when f is known and p
unknown?
[von Neumann, 1951; Keane & O’Brien, 1994]
existence (e.g., impossible for f (p) = min(2p, 1))
condition: for some n,
min{f (p), 1 − f (p)} min{p, 1 − p}n
implementation (polynomial vs. exponential time)
use of sandwiching polynomials/power series
58. Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an
iid uniform sequence, for any k 0, the quantity
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
is an unbiased estimator of 1/p(zi ) with an almost sure finite
number of terms.
59. Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an
iid uniform sequence, for any k 0, the quantity
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
is an unbiased estimator of 1/p(zi ) with an almost sure finite
number of terms. Moreover, for k 1,
V ^ξk
i zi =
1 − p(zi )
p2(zi )
−
1 − (1 − 2p(zi ) + r(zi ))k
2p(zi ) − r(zi )
2 − p(zi )
p2(zi )
(p(zi ) − r(zi )) ,
where p(zi ) := α(zi , y) q(y|zi ) dy. and r(zi ) := α2(zi , y) q(y|zi ) dy.
60. Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an
iid uniform sequence, for any k 0, the quantity
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
is an unbiased estimator of 1/p(zi ) with an almost sure finite
number of terms. Therefore, we have
V ^ξi zi V ^ξk
i zi V ^ξ0
i zi = V [ni | zi ] .
61. B motivation for Russian roulette
drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with
Z(θ) = f (x; θ)dx
intractable (e.g., Ising spin model, MRF, diffusion processes,
networks, &tc)
doubly-intractable posterior follows as
π(θ|y) = p(y|θ) × π(θ) ×
1
Z(y)
=
f (y; θ)
Z(θ)
× π(θ) ×
1
Z(y)
where Z(y) = p(y|θ)π(θ)dθ
both Z(θ) and Z(y) are intractable with massively different
consequences
[thanks to Mark Girolami for his Russian slides!]
62. B motivation for Russian roulette
drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with
Z(θ) = f (x; θ)dx
intractable (e.g., Ising spin model, MRF, diffusion processes,
networks, &tc)
doubly-intractable posterior follows as
π(θ|y) = p(y|θ) × π(θ) ×
1
Z(y)
=
f (y; θ)
Z(θ)
× π(θ) ×
1
Z(y)
where Z(y) = p(y|θ)π(θ)dθ
both Z(θ) and Z(y) are intractable with massively different
consequences
[thanks to Mark Girolami for his Russian slides!]
63. B motivation for Russian roulette
If Z(θ) is intractable, Metropolis–Hasting acceptance
probability
α(θ , θ) = min 1,
f (y; θ )π(θ )
f (y; θ)π(θ)
×
q(θ|θ )
q(θ |θ)
×
Z(θ)
Z(θ )
is not available
Use instead biased approximations e.g. pseudo-likelihoods,
plugin ^Z(θ ) estimates without sacrificing exactness of MCMC
64. B motivation for Russian roulette
If Z(θ) is intractable, Metropolis–Hasting acceptance
probability
α(θ , θ) = min 1,
f (y; θ )π(θ )
f (y; θ)π(θ)
×
q(θ|θ )
q(θ |θ)
×
Z(θ)
Z(θ )
is not available
Use instead biased approximations e.g. pseudo-likelihoods,
plugin ^Z(θ ) estimates without sacrificing exactness of MCMC
65. Existing solution
Unbiased plugin estimate
Z(θ)
Z(θ )
≈
f (x; θ)
f (x; θ )
where x ∼
f (x; θ )
Z(θ )
[Møller et al, Bka, 2006; Murray et al 2006]
auxiliary variable method
removes Z(θ) Z(θ ) from the picture
require simulations from the model (e.g., via perfect sampling)
66. Exact approximate methods
Pseudo-Marginal construction that allows for the use of unbiased,
positive estimates of target in acceptance probability
α(θ , θ) = min 1,
^π(θ |y)
^π(θ|y)
×
q(θ|θ )
q(θ |θ)
[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]
Transition kernel has invariant distribution with exact target
density π(θ|y)
67. Exact approximate methods
Pseudo-Marginal construction that allows for the use of unbiased,
positive estimates of target in acceptance probability
α(θ , θ) = min 1,
^π(θ |y)
^π(θ|y)
×
q(θ|θ )
q(θ |θ)
[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]
Transition kernel has invariant distribution with exact target
density π(θ|y)
68. Infinite series estimator
For each (θ, y), construct rv’s {V
(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
is a.s. finite with finite expectation
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)
69. Infinite series estimator
For each (θ, y), construct rv’s {V
(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
is a.s. finite with finite expectation
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)
Introduce a random stopping time τθ, such that with
ξ := (τθ, {V
(j)
θ , 0 j τθ}) the estimate
^π(θ, ξ|y) :=
τθ
j=0
V
(j)
θ
satisfies
E ^π(θ, ξ|y)|{V
(j)
θ , j 0} = ^π(θ, {V
(j)
θ }|y)
70. Infinite series estimator
For each (θ, y), construct rv’s {V
(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
is a.s. finite with finite expectation
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)
Warning: unbiased estimate ^π(θ, ξ|y) using series
construction no general guarantee of positivity
71. Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the first time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
72. Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the first time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
73. Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the first time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
If limn→∞
n
j=1 qj = 0, Russian roulette terminates with
probability one
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
74. Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the first time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
E{^S(θ)} = S(θ)
variance finite under certain known conditions
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
75. towards ever more complexity
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Approximate Bayesian computation
(ABC)
76. New challenges
Novel statisticial issues that forces a different Bayesian answer:
very large datasets
complex or unknown dependence structures with maybe p n
multiple and involved random effects
missing data structures containing most of the information
sequential structures involving most of the above
77. New paradigm?
“Surprisingly, the confident prediction of the previous
generation that Bayesian methods would ultimately supplant
frequentist methods has given way to a realization that Markov
chain Monte Carlo (MCMC) may be too slow to handle
modern data sets. Size matters because large data sets stress
computer storage and processing power to the breaking point.
The most successful compromises between Bayesian and
frequentist methods now rely on penalization and
optimization.”
[Lange at al., ISR, 2013]
78. New paradigm?
sad reality constraint that
size does matter
focus on much smaller
dimensions and on sparse
summaries
many (fast if non-Bayesian)
ways of producing those
summaries
Bayesian inference can kick
in almost automatically at
this stage
79. Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
80. Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
81. Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
82. Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
83. ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
84. ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
85. ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
86. ABC algorithm
In most implementations, degree of approximation:
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for
where η(y) defines a (not necessarily sufficient) statistic
87. Comments
role of distance paramount
(because = 0)
scaling of components of η(y) also
capital
matters little if “small enough”
representative of “curse of
dimensionality”
small is beautiful!, i.e. data as a
whole may be weakly informative
for ABC
non-parametric method at core
88. ABC simulation advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
89. ABC simulation advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
90. ABC simulation advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
91. ABC simulation advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
92. ABC as an inference machine
Starting point is summary statistic
η(y), either chosen for computational
realism or imposed by external
constraints
ABC can produce a distribution on the parameter of interest
conditional on this summary statistic η(y)
inference based on ABC may be consistent or not, so it needs
to be validated on its own
the choice of the tolerance level is dictated by both
computational and convergence constraints
93. ABC as an inference machine
Starting point is summary statistic
η(y), either chosen for computational
realism or imposed by external
constraints
ABC can produce a distribution on the parameter of interest
conditional on this summary statistic η(y)
inference based on ABC may be consistent or not, so it needs
to be validated on its own
the choice of the tolerance level is dictated by both
computational and convergence constraints
94. How Bayesian aBc is..?
At best, ABC approximates π(θ|η(y)):
approximation error unknown (w/o massive simulation)
pragmatic or empirical Bayes (there is no other solution!)
many calibration issues (tolerance, distance, statistics)
the NP side should be incorporated into the whole Bayesian
picture
the approximation error should also be part of the Bayesian
inference
95. Noisy ABC
ABC approximation error (under non-zero tolerance ) replaced
with exact simulation from a controlled approximation to the
target, convolution of true posterior with kernel function
π (θ, z|y) =
π(θ)f (z|θ)K (y − z)
π(θ)f (z|θ)K (y − z)dzdθ
,
with K kernel parameterised by bandwidth .
[Wilkinson, 2013]
Theorem
The ABC algorithm based on a randomised observation y = ˜y + ξ,
ξ ∼ K , and an acceptance probability of
K (y − z)/M
gives draws from the posterior distribution π(θ|y).
96. Noisy ABC
ABC approximation error (under non-zero tolerance ) replaced
with exact simulation from a controlled approximation to the
target, convolution of true posterior with kernel function
π (θ, z|y) =
π(θ)f (z|θ)K (y − z)
π(θ)f (z|θ)K (y − z)dzdθ
,
with K kernel parameterised by bandwidth .
[Wilkinson, 2013]
Theorem
The ABC algorithm based on a randomised observation y = ˜y + ξ,
ξ ∼ K , and an acceptance probability of
K (y − z)/M
gives draws from the posterior distribution π(θ|y).
97. Which summary?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
98. Which summary?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
Loss of statistical information balanced against gain in data
roughening
Approximation error and information loss remain unknown
Choice of statistics induces choice of distance function
towards standardisation
borrowing tools from data analysis (LDA) machine learning
[Estoup et al., ME, 2012]
99. Which summary?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
may be imposed for external/practical reasons
may gather several non-B point estimates
we can learn about efficient combination
distance can be provided by estimation techniques
100. Which summary for model choice?
‘This is also why focus on model discrimination typically (...)
proceeds by (...) accepting that the Bayes Factor that one obtains
is only derived from the summary statistics and may in no way
correspond to that of the full model.’
[S. Sisson, Jan. 31, 2011, xianblog]
Depending on the choice of η(·), the Bayes factor based on this
insufficient statistic,
Bη
12(y) =
π1(θ1)f η
1 (η(y)|θ1) dθ1
π2(θ2)f η
2 (η(y)|θ2) dθ2
,
is either consistent or not
[X et al., PNAS, 2012]
101. Which summary for model choice?
Depending on the choice of η(·), the Bayes factor based on this
insufficient statistic,
Bη
12(y) =
π1(θ1)f η
1 (η(y)|θ1) dθ1
π2(θ2)f η
2 (η(y)|θ2) dθ2
,
is either consistent or not
[X et al., PNAS, 2012]
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.10.20.30.40.50.60.7
n=100
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.20.40.60.81.0
n=100
102. Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)
Theorem
If Pn belongs to one of the two models and if µ0 cannot be
attained by the other one :
0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
then the Bayes factor Bη
12 is consistent
[Marin et al., JRSS B, 2013]
103. Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)
q
M1 M2
0.30.40.50.60.7
q
q
q
q
M1 M2
0.30.40.50.60.7
M1 M2
0.30.40.50.60.7
q
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.8
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.8
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
qq
q
qq
M1 M2
0.00.20.40.60.81.0
[Marin et al., JRSS B, 2013]