Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Mcmc & lkd free I
1. MCMC and likelihood-free methods Part/day I: Markov chain methods
MCMC and likelihood-free methods
Part/day I: Markov chain methods
Christian P. Robert
Universit´ Paris-Dauphine, IUF, & CREST
e
Monash University, EBS, July 18, 2012
2. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Motivations and leading example
Computational issues in Bayesian
statistics
The Metropolis-Hastings Algorithm
The Gibbs Sampler
Population Monte Carlo
3. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
What is Bayesian statistics?
Statistical model defined by a likelihood function
f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn )
[inversion of what varies]
4. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
What is Bayesian statistics?
Statistical model defined by a likelihood function
f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn )
[inversion of what varies]
Bayesian approach turns the likelihood
into a conditional density:
π(θ|x1 , . . . , xn ) ∝ π(θ)L(θ|x1 , . . . , xn )
using a reference measure (or a prior)
π(θ)
[Thomas Bayes, 1701–1761]
5. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
What is Bayesian statistics?
Statistical model defined by a likelihood function
f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn )
[inversion of what varies]
Bayesian approach turns the likelihood
into a conditional density:
π(θ|x1 , . . . , xn ) ∝ π(θ)L(θ|x1 , . . . , xn )
using a reference measure (or a prior)
π(θ)
[Thomas Bayes, 1701–1761]
6. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
New perspective
Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution
7. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
New perspective
Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution
Inference processed through distribution of θ conditional on x,
π(θ|x), called posterior distribution
f (x|θ)π(θ)
π(θ|x) = .
f (x|θ)π(θ) dθ
8. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
Justifications
Semantic drift from unknown to random
Actualization of the information on θ by extracting the
information on θ contained in the observation x
Allows incorporation of imperfect information in the decision
process
Unique mathematical way to condition upon the observations
(conditional perspective)
Penalization factor
9. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
Posterior distribution
π(θ|x) central to Bayesian inference
Operates conditional upon the observation s
10. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
Posterior distribution
π(θ|x) central to Bayesian inference
Operates conditional upon the observation s
Incorporates the requirement of the Likelihood Principle
11. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
Posterior distribution
π(θ|x) central to Bayesian inference
Operates conditional upon the observation s
Incorporates the requirement of the Likelihood Principle
Avoids averaging over the unobserved values of x
12. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
Posterior distribution
π(θ|x) central to Bayesian inference
Operates conditional upon the observation s
Incorporates the requirement of the Likelihood Principle
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
13. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
abc of Bayesian perspective
Posterior distribution
π(θ|x) central to Bayesian inference
Operates conditional upon the observation s
Incorporates the requirement of the Likelihood Principle
Avoids averaging over the unobserved values of x
Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
Provides a complete inferential scope
14. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Latent structures make life harder!
Even simple models may lead to computational complications, as
in latent variable models
f (x|θ) = f (x, x |θ) dx
15. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Latent structures make life harder!
Even simple models may lead to computational complications, as
in latent variable models
f (x|θ) = f (x, x |θ) dx
If (x, x ) observed, fine!
16. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Latent structures make life harder!
Even simple models may lead to computational complications, as
in latent variable models
f (x|θ) = f (x, x |θ) dx
If (x, x ) observed, fine!
If only x observed, trouble!
17. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
example: mixture models
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1 f1 (x) + · · · + pk fk (x) .
18. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
example: mixture models
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1 f1 (x) + · · · + pk fk (x) .
For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1
19. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
example: mixture models
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1 f1 (x) + · · · + pk fk (x) .
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1
Expanding this product of sums into a sum of products involves k n
elementary terms: too prohibitive to compute in large samples.
20. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Simple mixture (1)
3
2
µ2
1
0
−1
−1 0 1 2 3
µ1
Case of the 0.3N (µ1 , 1) + 0.7N (µ2 , 1) likelihood
21. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Simple mixture (2)
For mixture of two normal distributions,
0.3N (µ1 , 1) + 0.7N (µ2 , 1) ,
likelihood proportional to
n
[0.3ϕ (xi − µ1 ) + 0.7 ϕ (xi − µ2 )]
i=1
containing 2n terms.
22. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Complex maximisation
Standard maximization techniques often fail to find the global
maximum because of multimodality or undesirable behavior
(usually at the frontier of the domain) of the likelihood function.
Example
In the special case
f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ 2 )(x − µ)2 }
σ
with > 0 known,
23. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Complex maximisation
Standard maximization techniques often fail to find the global
maximum because of multimodality or undesirable behavior
(usually at the frontier of the domain) of the likelihood function.
Example
In the special case
f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ 2 )(x − µ)2 }
σ
with > 0 known, whatever n, the likelihood is unbounded:
lim L(x1 , . . . , xn |µ = x1 , σ) = ∞
σ→0
25. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Mixture once again
press for MA Observations from
x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )
26. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Mixture once again
press for MA Observations from
x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )
Prior
µi |σi ∼ N (ξi , σi /ni ),
2
σi ∼ I G (νi /2, s2 /2),
2
i p ∼ Be(α, β)
27. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Mixture once again
press for MA Observations from
x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )
Prior
µi |σi ∼ N (ξi , σi /ni ),
2
σi ∼ I G (νi /2, s2 /2),
2
i p ∼ Be(α, β)
Posterior
n
π(θ|x1 , . . . , xn ) ∝ {pϕ(xj ; µ1 , σ1 ) + (1 − p)ϕ(xj ; µ2 , σ2 )} π(θ)
j=1
n
= ω(kt )π(θ|(kt ))
=0 (kt )
[O(2n )]
28. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Mixture once again (cont’d)
For a given permutation (kt ), conditional posterior distribution
2
σ1
π(θ|(kt )) = N ξ1 (kt ), × I G ((ν1 + )/2, s1 (kt )/2)
n1 +
2
σ2
×N ξ2 (kt ), × I G ((ν2 + n − )/2, s2 (kt )/2)
n2 + n −
×Be(α + , β + n − )
30. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Latent variables
Mixture once again
Bayes estimator of θ:
n
δ π (x1 , . . . , xn ) = ω(kt )Eπ [θ|x, (kt )]
=0 (kt )
Too costly: 2n terms
31. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The AR(p) model
AR(p) model
Auto-regressive representation of a time series,
p
xt |xt−1 , . . . ∼ N µ+ i (xt−i − µ), σ 2
i=1
32. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The AR(p) model
AR(p) model
Auto-regressive representation of a time series,
p
xt |xt−1 , . . . ∼ N µ+ i (xt−i − µ), σ 2
i=1
Generalisation of AR(1)
Among the most commonly used models in dynamic settings
More challenging than the static models (stationarity
constraints)
Different models depending on the processing of the starting
value x0
33. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The AR(p) model
Unwieldy stationarity constraints
Practical difficulty: for complex models, stationarity constraints get
quite involved to the point of being unknown in some cases
Example (AR(1))
Case of linear Markovian dependence on the last value
i.i.d.
xt = µ + (xt−1 − µ) + t, t ∼ N (0, σ 2 )
If | | < 1, (xt )t∈Z can be written as
∞
j
xt = µ + t−j
j=0
and this is a stationary representation.
34. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The AR(p) model
Stationary but...
If | | > 1, alternative stationary representation
∞
−j
xt = µ − t+j .
j=1
35. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The AR(p) model
Stationary but...
If | | > 1, alternative stationary representation
∞
−j
xt = µ − t+j .
j=1
This stationary solution is criticized as artificial because xt is
correlated with future white noises ( t )s>t , unlike the case when
| | < 1.
Non-causal representation...
36. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The AR(p) model
Stationarity+causality
Stationarity constraints in the prior as a restriction on the values of
θ.
Theorem
AR(p) model second-order stationary and causal iff the roots of the
polynomial
p
P(x) = 1 − ix
i
i=1
are all outside the unit circle
37. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The AR(p) model
Stationarity constraints
Under stationarity constraints, complex parameter space: each
value of needs to be checked for roots of corresponding
polynomial with modulus less than 1
38. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The AR(p) model
Stationarity constraints
Under stationarity constraints, complex parameter space: each
value of needs to be checked for roots of corresponding
polynomial with modulus less than 1
E.g., for an AR(2) process with
1.0
autoregressive polynomial
0.5
P(u) = 1 − 1 u − 2 u2 , constraint is
0.0
θ2
q
1 + 2 < 1, 1 − 2 <1
−0.5
−1.0
and | 2 | < 1 −2 −1 0 1 2
θ1
39. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The MA(q) model
The MA(q) model
Alternative type of time series
q
xt = µ + t − ϑj t−j , t ∼ N (0, σ 2 )
j=1
Stationary but, for identifiability considerations, the polynomial
q
Q(x) = 1 − ϑj xj
j=1
must have all its roots outside the unit circle
40. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The MA(q) model
Identifiability
Example
For the MA(1) model, xt = µ + t − ϑ1 t−1 ,
var(xt ) = (1 + ϑ2 )σ 2
1
can also be written
1
xt = µ + ˜t−1 − ˜t , ˜ ∼ N (0, ϑ2 σ 2 ) ,
1
ϑ1
Both pairs (ϑ1 , σ) & (1/ϑ1 , ϑ1 σ) lead to alternative
representations of the same model.
41. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The MA(q) model
Properties of MA models
Non-Markovian model (but special case of hidden Markov)
Autocovariance γx (s) is null for |s| > q
42. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The MA(q) model
Representations
x1:T is a normal random variable with constant mean µ and
covariance matrix
σ2
γ1 γ2 ... γq 0 ... 0 0
γ1 σ2 γ1 . . . γq−1 γq ... 0 0
Σ= ,
..
.
2
0 0 0 ... 0 0 ... γ1 σ
with (|s| ≤ q)
q−|s|
2
γs = σ ϑi ϑi+|s|
i=0
Not manageable in practice [large T’s]
44. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The MA(q) model
Representations (contd.)
Encompassing approach for general time series models
State-space representation
xt = Gyt + εt , (1)
yt+1 = F yt + ξt , (2)
(1) is the observation equation and (2) is the state equation
45. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The MA(q) model
Representations (contd.)
Encompassing approach for general time series models
State-space representation
xt = Gyt + εt , (1)
yt+1 = F yt + ξt , (2)
(1) is the observation equation and (2) is the state equation
Note
This is a special case of hidden Markov model
46. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The MA(q) model
MA(q) state-space representation
For the MA(q) model, take
yt = ( t−q , . . . , t−1 , t )
and then
0 1 0 ... 0
0
0 0
0 1 ... 0
.
yt+1 = ... t+1 .
yt +
0
.
0 0 ... 1 0
0 0 0 ... 0 1
xt = µ − ϑq ϑq−1 ... ϑ1 −1 yt .
47. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
The MA(q) model
MA(q) state-space representation (cont’d)
Example
For the MA(1) model, observation equation
xt = (1 0)yt
with
yt = (y1t y2t )
directed by the state equation
0 1 1
yt+1 = yt + t+1 .
0 0 ϑ1
48. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Typology of problems
c A typology of Bayes computational problems
(i). latent variable models in general
49. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Typology of problems
c A typology of Bayes computational problems
(i). latent variable models in general
(ii). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
50. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Typology of problems
c A typology of Bayes computational problems
(i). latent variable models in general
(ii). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(iii). use of a complex sampling model with an intractable
likelihood, as for instance in some graphical models;
51. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Typology of problems
c A typology of Bayes computational problems
(i). latent variable models in general
(ii). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(iii). use of a complex sampling model with an intractable
likelihood, as for instance in some graphical models;
(iv). use of a huge dataset;
52. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Typology of problems
c A typology of Bayes computational problems
(i). latent variable models in general
(ii). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(iii). use of a complex sampling model with an intractable
likelihood, as for instance in some graphical models;
(iv). use of a huge dataset;
(v). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
53. MCMC and likelihood-free methods Part/day I: Markov chain methods
Computational issues in Bayesian statistics
Typology of problems
c A typology of Bayes computational problems
(i). latent variable models in general
(ii). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(iii). use of a complex sampling model with an intractable
likelihood, as for instance in some graphical models;
(iv). use of a huge dataset;
(v). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(vi). use of a particular inferential procedure as for instance, Bayes
factors
π P (θ ∈ Θ0 | x) π(θ ∈ Θ0 )
B01 (x) = .
P (θ ∈ Θ1 | x) π(θ ∈ Θ1 )
54. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
Computational issues in Bayesian
statistics
The Metropolis-Hastings Algorithm
The Gibbs Sampler
Population Monte Carlo
55. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo basics
General purpose
A major computational issue in Bayesian statistics:
Given a density π known up to a normalizing constant, and an
integrable function h, compute
h(x)˜ (x)µ(dx)
π
Π(h) = h(x)π(x)µ(dx) =
π (x)µ(dx)
˜
when h(x)˜ (x)µ(dx) is intractable.
π
56. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo basics
Monte Carlo 101
Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
N
ΠM C (h) = N −1
ˆ
N h(xi ).
i=1
ˆN as
LLN: ΠM C (h) −→ Π(h)
If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
√ L
CLT: ˆN
N ΠM C (h) − Π(h) N 0, Π [h − Π(h)]2 .
57. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo basics
Monte Carlo 101
Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
N
ΠM C (h) = N −1
ˆ
N h(xi ).
i=1
ˆN as
LLN: ΠM C (h) −→ Π(h)
If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
√ L
CLT: ˆN
N ΠM C (h) − Π(h) N 0, Π [h − Π(h)]2 .
Caveat conducting to MCMC
Often impossible or inefficient to simulate directly from Π
58. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Importance Sampling
Importance Sampling
For Q proposal distribution such that Q(dx) = q(x)µ(dx),
alternative representation
Π(h) = h(x){π/q}(x)q(x)µ(dx).
59. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Importance Sampling
Importance Sampling
For Q proposal distribution such that Q(dx) = q(x)µ(dx),
alternative representation
Π(h) = h(x){π/q}(x)q(x)µ(dx).
Principle of importance (!)
Generate an iid sample x1 , . . . , xN ∼ Q and estimate Π(h) by
N
ΠIS (h) = N −1
ˆ
Q,N h(xi ){π/q}(xi ).
i=1
return to pMC
60. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Importance Sampling
Properties of importance
Then
ˆ as
LLN: ΠIS (h) −→ Π(h)
Q,N and if Q((hπ/q)2 ) < ∞,
√ L
CLT: ˆ Q,N
N (ΠIS (h) − Π(h)) N 0, Q{(hπ/q − Π(h))2 } .
61. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Importance Sampling
Properties of importance
Then
ˆ as
LLN: ΠIS (h) −→ Π(h)
Q,N and if Q((hπ/q)2 ) < ∞,
√ L
CLT: ˆ Q,N
N (ΠIS (h) − Π(h)) N 0, Q{(hπ/q − Π(h))2 } .
Caveat
ˆ Q,N
If normalizing constant of π unknown, impossible to use ΠIS
Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).
62. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Importance Sampling
Self-Normalised Importance Sampling
Self normalized version
N −1 N
ˆ Q,N
ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ).
i=1 i=1
63. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Importance Sampling
Self-Normalised Importance Sampling
Self normalized version
N −1 N
ˆ Q,N
ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ).
i=1 i=1
ˆ as
LLN : ΠSN IS (h) −→ Π(h)
Q,N
and if Π((1 + h2 )(π/q)) < ∞,
√ L
CLT : ˆ Q,N
N (ΠSN IS (h) − Π(h)) N 0, π {(π/q)(h − Π(h)}2 ) .
64. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Importance Sampling
Self-Normalised Importance Sampling
Self normalized version
N −1 N
ˆ Q,N
ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ).
i=1 i=1
ˆ as
LLN : ΠSN IS (h) −→ Π(h)
Q,N
and if Π((1 + h2 )(π/q)) < ∞,
√ L
CLT : ˆ Q,N
N (ΠSN IS (h) − Π(h)) N 0, π {(π/q)(h − Π(h)}2 ) .
c The quality of the SNIS approximation depends on the
choice of Q
65. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (MCMC)
It is not necessary to use a sample from the distribution f to
approximate the integral
I= h(x)f (x)dx ,
66. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (MCMC)
It is not necessary to use a sample from the distribution f to
approximate the integral
I= h(x)f (x)dx ,
[notation warnin: π turned to f !]
67. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (MCMC)
It is not necessary to use a sample from the distribution f to
approximate the integral
I= h(x)f (x)dx ,
We can obtain X1 , . . . , Xn ∼ f
(approx) without directly simulating
from f , using an ergodic Markov
chain with stationary distribution f
68. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (MCMC)
It is not necessary to use a sample from the distribution f to
approximate the integral
I= h(x)f (x)dx ,
We can obtain X1 , . . . , Xn ∼ f
(approx) without directly simulating
from f , using an ergodic Markov
chain with stationary distribution f
Andre¨ Markov
ı
69. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f
70. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f
irreducible Markov chain with stationary distribution f is
ergodic with limiting distribution f under weak conditions
hence convergence in distribution of (X (t) ) to a random
variable from f .
for T0 “large enough” T0 , X (T0 ) distributed from f
Markov sequence is dependent sample X (T0 ) , X (T0 +1) , . . .
generated from f
Birkoff’s ergodic theorem extends LLN, sufficient for most
approximation purposes
71. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f
Problem: How can one build a Markov chain with a given
stationary distribution?
72. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The Metropolis–Hastings algorithm
Arguments: The algorithm uses the
objective (target) density
f
and a conditional density
q(y|x)
called the instrumental (or proposal) Nicholas Metropolis
distribution
73. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The MH algorithm
Algorithm (Metropolis–Hastings)
Given x(t) ,
1. Generate Yt ∼ q(y|x(t) ).
2. Take
Yt with prob. ρ(x(t) , Yt ),
X (t+1) =
x(t) with prob. 1 − ρ(x(t) , Yt ),
where
f (y) q(x|y)
ρ(x, y) = min ,1 .
f (x) q(y|x)
74. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Features
Independent of normalizing constants for both f and q(·|x)
(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain
75. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)
76. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
77. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
3. If
f (Yt ) q(X (t) |Yt )
Pr ≥ 1 < 1. (1)
f (X (t) ) q(Yt |X (t) )
that is, the event {X (t+1) = X (t) } is possible, then the chain
is aperiodic
78. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4. If
q(y|x) > 0 for every (x, y), (2)
the chain is irreducible
79. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4. If
q(y|x) > 0 for every (x, y), (2)
the chain is irreducible
5. For M-H, f -irreducibility implies Harris recurrence
80. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4. If
q(y|x) > 0 for every (x, y), (2)
the chain is irreducible
5. For M-H, f -irreducibility implies Harris recurrence
6. Thus, for M-H satisfying (1) and (2)
(i) For h, with Ef |h(X)| < ∞,
T
1
lim h(X (t) ) = h(x)df (x) a.e. f.
T →∞ T t=1
(ii) and
lim K n (x, ·)µ(dx) − f =0
n→∞
TV
for every initial distribution µ, where K n (x, ·) denotes the
kernel for n transitions.
81. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Random walk Metropolis–Hastings
Use of a local perturbation as proposal
Yt = X (t) + εt ,
where εt ∼ g, independent of X (t) .
The instrumental density is of the form g(y − x) and the Markov
chain is a random walk if we take g to be symmetric g(x) = g(−x)
82. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Random walk Metropolis–Hastings [code]
Algorithm (Random walk Metropolis)
Given x(t)
1. Generate Yt ∼ g(y − x(t) )
2. Take
f (Yt )
Y with prob. min 1, ,
(t+1) t
X = f (x(t) )
(t)
x otherwise.
83. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
The original example
Example (Random walk and normal target)
Generate N (0, 1) based on the uniform proposal [−δ, δ]
forget History!
The probability of acceptance is then
2
ρ(x(t) , yt ) = exp{(x(t) − yt )/2} ∧ 1.
2
[Hastings (1970)]
84. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
The original example
Example (Random walk & normal (2))
Sample statistics
δ 0.1 0.5 1.0
mean 0.399 -0.111 0.10
variance 0.698 1.11 1.06
c As δ ↑, we get better histograms and a faster exploration of the
support of f .
85. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
The original example
400
400
250
0.5
0.5
0.5
300
200
300
0.0
0.0
0.0
150
200
200
-0.5
-0.5
-0.5
100
100
100
-1.0
-1.0
-1.0
50
-1.5
-1.5
-1.5
0
0
0
-1 0 1 2 -2 0 2 -3 -2 -1 0 1 2 3
(a) (b) (c)
ples based on U [−δ, δ] with (a) δ = 0.1, (b) δ = 0.5 and (c) δ = 1.0, superimposed with the convergence of the means (15, 000 si
86. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Mixtures by random walk MH
Example (Mixture models)
n k
π(θ|x) ∝ p f (xj |µ , σ ) π(θ)
j=1 =1
87. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Mixtures by random walk MH
Example (Mixture models)
n k
π(θ|x) ∝ p f (xj |µ , σ ) π(θ)
j=1 =1
Metropolis-Hastings proposal:
θ(t) + ωε(t) if u(t) < ρ(t)
θ(t+1) =
θ(t) otherwise
where
π(θ(t) + ωε(t) |x)
ρ(t) = ∧1
π(θ(t) |x)
and ω scaled for good acceptance rate
88. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Mixtures by random walk MH
Random walk sampling (50000 iterations)
2
2
1
1
theta
theta
0
0
-1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 1.2
p tau
1.2
0.0 1.0 2.0
1.0
-1 0 1 2
theta
0.8
0 1 2 3 4 5 6
tau
0.6
0.4
0.0 0.2 0.4 0.6 0.8 1.0
p
0 2 4
0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8
p tau
General case of a 3 component normal mixture
[Celeux & al., 2000]
89. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Mixtures by random walk MH
3
2
µ2
1
0
X
−1
−1 0 1 2 3
µ1
Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1)
90. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Convergence properties
Uniform ergodicity prohibited by random walk structure
91. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Convergence properties
Uniform ergodicity prohibited by random walk structure
At best, geometric ergodicity:
Theorem (Sufficient ergodicity)
For a symmetric density f , log-concave in the tails, and a positive
and symmetric density g, the chain (X (t) ) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail effect
92. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
illustration of the tail effect
1.5
1.5
1.0
1.0
Example (Comparison of tails)
0.5
0.5
Random walk Metropolis Hastings
0.0
0.0
algorithms based on a N (0, 1)
-0.5
-0.5
instrumental for the generation of
-1.0
-1.0
(left) a N (0, 1) distribution and
(right) a distribution with density -1.5
-1.5
0 50 100 150 200 0 50 100 150 200
ψ(x) ∝ (1 + |x|)−3 (a) (b)
90% confidence envelopes of the means, derived from
500 parallel independent chains
93. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Further convergence properties
Under assumptions skip detailed convergence
(A1) f is super-exponential, i.e. it is positive with positive
continuous first derivative such that
lim|x|→∞ n(x) log f (x) = −∞ where n(x) := x/|x|.
In words : exponential decay of f in every direction with rate
tending to ∞
(A2) lim sup|x|→∞ n(x) m(x) < 0, where m(x) = f (x)/| f (x)|
In words: non degeneracy of the countour manifold
Cf (y) = {y : f (y) = f (x)}
Q is geometrically ergodic, and
V (x) ∝ f (x)−1/2 verifies the drift condition
[Jarner & Hansen, 2000]
94. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Further [further] convergence properties
skip hyperdetailed convergence
If P ψ-irreducible and aperiodic, for r = (r(n))n∈N real-valued non
decreasing sequence, such that, for all n, m ∈ N,
r(n + m) ≤ r(n)r(m),
and r(0) = 1, for C a small set, τC = inf{n ≥ 1, Xn ∈ C}, and
h ≥ 1, assume
τC −1
sup Ex r(k)h(Xk ) < ∞,
x∈C k=0
95. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Further [further] convergence properties
then,
τC −1
S(f, C, r) := x ∈ X, Ex r(k)h(Xk ) <∞
k=0
is full and absorbing and for x ∈ S(f, C, r),
lim r(n) P n (x, .) − f h = 0.
n→∞
[Tuominen & Tweedie, 1994]
96. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Comments
[CLT, Rosenthal’s inequality...] h-ergodicity implies CLT
for additive (possibly unbounded) functionals of the chain,
Rosenthal’s inequality and so on...
[Control of the moments of the return-time] The
condition implies (because h ≥ 1) that
τC −1
sup Ex [r0 (τC )] ≤ sup Ex r(k)h(Xk ) < ∞,
x∈C x∈C k=0
where r0 (n) = n r(l) Can be used to derive bounds for
l=0
the coupling time, an essential step to determine computable
bounds, using coupling inequalities
[Roberts & Tweedie, 98; Fort & Moulines, 00; Jones et al., 02]
97. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Alternative conditions
The condition is not really easy to work with...
[Possible alternative conditions]
(a) [Tuominen, Tweedie, 1994] There exists a sequence
(Vn )n∈N , Vn ≥ r(n)h, such that
(i) supC V0 < ∞,
(ii) {V0 = ∞} ⊂ {V1 = ∞} and
(iii) P Vn+1 ≤ Vn − r(n)h + br(n)IC .
98. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Random-walk Metropolis-Hastings algorithms
Alternative conditions
(b) [Fort 2000] ∃V ≥ f ≥ 1 and b < ∞, such that supC V < ∞
and
σC
P V (x) + Ex ∆r(k)f (Xk ) ≤ V (x) + bIC (x)
k=0
where σC is the hitting time on C and
∆r(k) = r(k) − r(k − 1), k ≥ 1 and ∆r(0) = r(0).
τC −1
Result (a) ⇔ (b) ⇔ supx∈C Ex k=0 r(k)f (Xk ) < ∞.
99. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Langevin Algorithms
Proposal based on the Langevin diffusion Lt is defined by the
stochastic differential equation
1
dLt = dBt + log f (Lt )dt,
2
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f .
100. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Instead, consider the sequence
σ2
x(t+1) = x(t) + log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretization step
101. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Instead, consider the sequence
σ2
x(t+1) = x(t) + log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretization step
Unfortunately, the discretized chain may be transient, for instance
when
lim σ 2 log f (x)|x|−1 > 1
x→±∞
102. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
MH correction
Accept the new value Yt with probability
2
σ2
exp − Yt − x(t) − 2 log f (x(t) ) 2σ 2
f (Yt )
· ∧1.
f (x(t) ) σ2
2
exp − x(t) − Yt − 2 log f (Yt ) 2σ 2
Choice of the scaling factor σ
Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998; Girolami & Calderhead, 2011]
103. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Optimizing the Acceptance Rate
Problem of choice of the transition kernel from a practical point of
view
Most common alternatives:
(a) a fully automated algorithm like ARMS;
[Gilks & Wild, 1992]
(b) an instrumental density g which approximates f , such that
f /g is bounded for uniform ergodicity to apply;
(c) a random walk
In both cases (b) and (c), the choice of g is critical,
104. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk
Different approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
105. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk
Different approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
If x(t) and yt are close, i.e. f (x(t) ) f (yt ) y is accepted with
probability
f (yt )
min ,1 1.
f (x(t) )
For multimodal densities with well separated modes, the negative
effect of limited moves on the surface of f clearly shows.
106. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk (2)
If the average acceptance rate is low, the successive values of f (yt )
tend to be small compared with f (x(t) ), which means that the
random walk moves quickly on the surface of f since it often
reaches the “borders” of the support of f
107. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of
50%. In large dimensions, at an average acceptance rate of
25%.
[Gelman,Gilks and Roberts, 1995]
108. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of
50%. In large dimensions, at an average acceptance rate of
25%.
[Gelman,Gilks and Roberts, 1995]
warnin: rule to be taken with a pinch of salt!
109. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Role of scale
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
xt+1 = ϕxt + t+1 t ∼ N (0, τ 2 )
and observables
yt |xt ∼ N (x2 , σ 2 )
t
110. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Role of scale
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
xt+1 = ϕxt + t+1 t ∼ N (0, τ 2 )
and observables
yt |xt ∼ N (x2 , σ 2 )
t
The distribution of xt given xt−1 , xt+1 and yt is
−1 τ2
exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2
t .
2τ 2 σ2
111. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Role of scale
Example (Noisy AR(1) continued)
For a Gaussian random walk with scale ω small enough, the
random walk never jumps to the other mode. But if the scale ω is
sufficiently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.
112. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Role of scale
Markov chain based on a random walk with scale ω = .1.
113. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Role of scale
Markov chain based on a random walk with scale ω = .5.
114. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
MA(2)
Since the constraints on (ϑ1 , ϑ2 ) are well-defined, use of a flat
prior over the triangle as prior.
Simple representation of the likelihood
library(mnormt)
ma2like=function(theta){
n=length(y)
sigma = toeplitz(c(1 +theta[1]^2+theta[2]^2,
theta[1]+theta[1]*theta[2],theta[2],rep(0,n-3)))
dmnorm(y,rep(0,n),sigma,log=TRUE)
}
115. MCMC and likelihood-free methods Part/day I: Markov chain methods
The Metropolis-Hastings Algorithm
Extensions
Basic RWHM for MA(2)
Algorithm 1 RW-HM-MA(2) sampler
set ω and ϑ(1)
for i = 2 to T do
˜ (i−1) (i−1)
generate ϑj ∼ U(ϑj − ω, ϑj + ω)
set p = 0 and ϑ (i) = ϑ(i−1)
˜
if ϑ within the triangle then
˜
p = exp(ma2like(ϑ) − ma2like(ϑ(i−1) ))
end if
if U < p then
˜
ϑ(i) = ϑ
end if
end for