Monash University short course, part I

MCMC and likelihood-free methods Part/day I: Markov chain methods

MCMC and likelihood-free methods
Part/day I: Markov chain methods

Christian P. Robert

Universit´ Paris-Dauphine, IUF, & CREST
e

Monash University, EBS, July 18, 2012

Computational issues in Bayesian statistics

Motivations and leading example

Computational issues in Bayesian
statistics

The Metropolis-Hastings Algorithm

The Gibbs Sampler

Population Monte Carlo

abc of Bayesian perspective

What is Bayesian statistics?

Statistical model deﬁned by a likelihood function

f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn )

[inversion of what varies]


What is Bayesian statistics?

Statistical model deﬁned by a likelihood function

f (x1 , . . . , xn |θ) = L(θ|x1 , . . . , xn )

[inversion of what varies]
Bayesian approach turns the likelihood
into a conditional density:

π(θ|x1 , . . . , xn ) ∝ π(θ)L(θ|x1 , . . . , xn )

using a reference measure (or a prior)
π(θ)
[Thomas Bayes, 1701–1761]


New perspective

Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution


New perspective

Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution
Inference processed through distribution of θ conditional on x,
π(θ|x), called posterior distribution

f (x|θ)π(θ)
π(θ|x) = .
f (x|θ)π(θ) dθ


Justiﬁcations

Semantic drift from unknown to random
Actualization of the information on θ by extracting the
information on θ contained in the observation x
Allows incorporation of imperfect information in the decision
process
Unique mathematical way to condition upon the observations
(conditional perspective)
Penalization factor


Posterior distribution

π(θ|x) central to Bayesian inference
Operates conditional upon the observation s



Incorporates the requirement of the Likelihood Principle



Avoids averaging over the unobserved values of x



Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected



Coherent updating of the information available on θ,
independent of the order in which i.i.d. observations are
collected
Provides a complete inferential scope

Latent variables

Latent structures make life harder!

Even simple models may lead to computational complications, as
in latent variable models

f (x|θ) = f (x, x |θ) dx

Latent variables



f (x|θ) = f (x, x |θ) dx

If (x, x ) observed, ﬁne!

Latent variables



f (x|θ) = f (x, x |θ) dx

If (x, x ) observed, ﬁne!
If only x observed, trouble!

Latent variables

example: mixture models

Models of mixtures of distributions:

X ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

X ∼ p1 f1 (x) + · · · + pk fk (x) .

Latent variables





X ∼ p1 f1 (x) + · · · + pk fk (x) .

For a sample of independent random variables (X1 , · · · , Xn ),
sample density
n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1

Latent variables





X ∼ p1 f1 (x) + · · · + pk fk (x) .

n
{p1 f1 (xi ) + · · · + pk fk (xi )} .
i=1

Expanding this product of sums into a sum of products involves k n
elementary terms: too prohibitive to compute in large samples.

Latent variables

Simple mixture (1)
3
2
µ2

1
0
−1

−1 0 1 2 3

µ1

Case of the 0.3N (µ1 , 1) + 0.7N (µ2 , 1) likelihood

Latent variables

Simple mixture (2)

For mixture of two normal distributions,

0.3N (µ1 , 1) + 0.7N (µ2 , 1) ,
likelihood proportional to

n
[0.3ϕ (xi − µ1 ) + 0.7 ϕ (xi − µ2 )]
i=1

containing 2n terms.

Latent variables

Complex maximisation

Standard maximization techniques often fail to ﬁnd the global
maximum because of multimodality or undesirable behavior
(usually at the frontier of the domain) of the likelihood function.

Example
In the special case

f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ 2 )(x − µ)2 }
σ
with > 0 known,

Latent variables

Complex maximisation

Standard maximization techniques often fail to ﬁnd the global
maximum because of multimodality or undesirable behavior
(usually at the frontier of the domain) of the likelihood function.

Example
In the special case

f (x|µ, σ) = (1 − ) exp{(−1/2)x2 } + exp{(−1/2σ 2 )(x − µ)2 }
σ
with > 0 known, whatever n, the likelihood is unbounded:

lim L(x1 , . . . , xn |µ = x1 , σ) = ∞
σ→0

Latent variables

Unbounded likelihood

n= 3 n= 6

4

4
3

3
σ
2

2
1

1
−2 0 2 4 6 −2 0 2 4 6

µ µ

n= 12 n= 24
4

4
3

3
σ
2

2
1

1
−2 0 2 4 6 −2 0 2 4 6

µ µ

n= 48 n= 96
4

4
3

3
σ
2

2
1

1

−2 0 2 4 6 −2 0 2 4 6

µ µ

Case of the 0.3N (0, 1) + 0.7N (µ, σ) likelihood

Latent variables

Mixture once again
press for MA Observations from

x1 , . . . , xn ∼ f (x|θ) = pϕ(x; µ1 , σ1 ) + (1 − p)ϕ(x; µ2 , σ2 )

Latent variables

Mixture once again


Prior

µi |σi ∼ N (ξi , σi /ni ),
2
σi ∼ I G (νi /2, s2 /2),
2
i p ∼ Be(α, β)

Latent variables

Mixture once again


Prior

µi |σi ∼ N (ξi , σi /ni ),
2
σi ∼ I G (νi /2, s2 /2),
2
i p ∼ Be(α, β)

Posterior
n
π(θ|x1 , . . . , xn ) ∝ {pϕ(xj ; µ1 , σ1 ) + (1 − p)ϕ(xj ; µ2 , σ2 )} π(θ)
j=1
n
= ω(kt )π(θ|(kt ))
=0 (kt )

[O(2n )]

Latent variables

Mixture once again (cont’d)

For a given permutation (kt ), conditional posterior distribution
2
σ1
π(θ|(kt )) = N ξ1 (kt ), × I G ((ν1 + )/2, s1 (kt )/2)
n1 +
2
σ2
×N ξ2 (kt ), × I G ((ν2 + n − )/2, s2 (kt )/2)
n2 + n −
×Be(α + , β + n − )

Latent variables

Mixture once again (cont’d)

where
1 2
x1 (kt ) =
¯ t=1 xkt , s1 (kt ) =
ˆ t=1 (xkt − x1 (kt )) ,
¯
1 n n 2
x2 (kt ) =
¯ n− t= +1 xkt , s2 (kt ) =
ˆ t= +1 (xkt − x2 (kt ))
¯

and
n1 ξ1 + x1 (kt )
¯ n2 ξ2 + (n − )¯2 (kt )
x
ξ1 (kt ) = , ξ2 (kt ) = ,
n1 + n2 + n −
n1
s1 (kt ) = s2 + s2 (kt ) +
1 ˆ1 (ξ1 − x1 (kt ))2 ,
¯
n1 +
n2 (n − )
s2 (kt ) = s2 + s2 (kt ) +
2 ˆ2 (ξ2 − x2 (kt ))2 ,
¯
n2 + n −

posterior updates of the hyperparameters

Latent variables

Mixture once again

Bayes estimator of θ:
n
δ π (x1 , . . . , xn ) = ω(kt )Eπ [θ|x, (kt )]
=0 (kt )

Too costly: 2n terms

The AR(p) model

AR(p) model

Auto-regressive representation of a time series,
p
xt |xt−1 , . . . ∼ N µ+ i (xt−i − µ), σ 2
i=1

The AR(p) model

AR(p) model

Auto-regressive representation of a time series,
p
xt |xt−1 , . . . ∼ N µ+ i (xt−i − µ), σ 2
i=1

Generalisation of AR(1)
Among the most commonly used models in dynamic settings
More challenging than the static models (stationarity
constraints)
Diﬀerent models depending on the processing of the starting
value x0

The AR(p) model

Unwieldy stationarity constraints

Practical diﬃculty: for complex models, stationarity constraints get
quite involved to the point of being unknown in some cases

Example (AR(1))
Case of linear Markovian dependence on the last value
i.i.d.
xt = µ + (xt−1 − µ) + t, t ∼ N (0, σ 2 )

If | | < 1, (xt )t∈Z can be written as
∞
j
xt = µ + t−j
j=0

and this is a stationary representation.

The AR(p) model

Stationary but...

If | | > 1, alternative stationary representation
∞
−j
xt = µ − t+j .
j=1

The AR(p) model

Stationary but...

If | | > 1, alternative stationary representation
∞
−j
xt = µ − t+j .
j=1

This stationary solution is criticized as artiﬁcial because xt is
correlated with future white noises ( t )s>t , unlike the case when
| | < 1.
Non-causal representation...

The AR(p) model

Stationarity+causality

Stationarity constraints in the prior as a restriction on the values of
θ.
Theorem
AR(p) model second-order stationary and causal iﬀ the roots of the
polynomial
p
P(x) = 1 − ix
i

i=1

are all outside the unit circle

The AR(p) model

Stationarity constraints

Under stationarity constraints, complex parameter space: each
value of needs to be checked for roots of corresponding
polynomial with modulus less than 1

The AR(p) model

Stationarity constraints

Under stationarity constraints, complex parameter space: each
value of needs to be checked for roots of corresponding
polynomial with modulus less than 1

E.g., for an AR(2) process with

1.0
autoregressive polynomial

0.5
P(u) = 1 − 1 u − 2 u2 , constraint is

0.0
θ2
q

1 + 2 < 1, 1 − 2 <1

−0.5
−1.0
and | 2 | < 1 −2 −1 0 1 2

θ1

The MA(q) model

The MA(q) model

Alternative type of time series
q
xt = µ + t − ϑj t−j , t ∼ N (0, σ 2 )
j=1

Stationary but, for identiﬁability considerations, the polynomial
q
Q(x) = 1 − ϑj xj
j=1

must have all its roots outside the unit circle

The MA(q) model

Identiﬁability

Example
For the MA(1) model, xt = µ + t − ϑ1 t−1 ,

var(xt ) = (1 + ϑ2 )σ 2
1

can also be written
1
xt = µ + ˜t−1 − ˜t , ˜ ∼ N (0, ϑ2 σ 2 ) ,
1
ϑ1
Both pairs (ϑ1 , σ) & (1/ϑ1 , ϑ1 σ) lead to alternative
representations of the same model.

The MA(q) model

Properties of MA models

Non-Markovian model (but special case of hidden Markov)
Autocovariance γx (s) is null for |s| > q

The MA(q) model

Representations

x1:T is a normal random variable with constant mean µ and
covariance matrix
σ2
 
γ1 γ2 ... γq 0 ... 0 0
 γ1 σ2 γ1 . . . γq−1 γq ... 0 0
Σ= ,
 
..
 . 
2
0 0 0 ... 0 0 ... γ1 σ

with (|s| ≤ q)
q−|s|
2
γs = σ ϑi ϑi+|s|
i=0

Not manageable in practice [large T’s]

The MA(q) model

Representations (contd.)

Conditional on past ( 0 , . . . , −q+1 ),

L(µ, ϑ1 , . . . , ϑq , σ|x1:T , 0 , . . . , −q+1 ) ∝
  2 
T 
 q 

−T  2σ 2 ,
σ exp − xt − µ + ϑj ˆt−j
 
t=1  j=1 

where (t > 0)
q
ˆt = xt − µ + ϑj ˆt−j , ˆ0 = 0, . . . , ˆ1−q = 1−q
j=1

Recursive deﬁnition of the likelihood, still costly O(T × q)

The MA(q) model


Encompassing approach for general time series models
State-space representation

xt = Gyt + εt , (1)
yt+1 = F yt + ξt , (2)

(1) is the observation equation and (2) is the state equation

The MA(q) model


Encompassing approach for general time series models
State-space representation

xt = Gyt + εt , (1)
yt+1 = F yt + ξt , (2)

(1) is the observation equation and (2) is the state equation

Note
This is a special case of hidden Markov model

The MA(q) model

MA(q) state-space representation

For the MA(q) model, take

yt = ( t−q , . . . , t−1 , t )

and then
 

0 1 0 ... 0
 0
0 0
0 1 ... 0  
  .
yt+1 =  ... t+1  . 
 yt +

0
 .
0 0 ... 1 0
0 0 0 ... 0 1
xt = µ − ϑq ϑq−1 ... ϑ1 −1 yt .

The MA(q) model

MA(q) state-space representation (cont’d)

Example
For the MA(1) model, observation equation

xt = (1 0)yt

with
yt = (y1t y2t )
directed by the state equation

0 1 1
yt+1 = yt + t+1 .
0 0 ϑ1

Typology of problems

c A typology of Bayes computational problems

(i). latent variable models in general



(ii). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;



(iii). use of a complex sampling model with an intractable
likelihood, as for instance in some graphical models;



(iv). use of a huge dataset;



(v). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);



(v). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(vi). use of a particular inferential procedure as for instance, Bayes
factors
π P (θ ∈ Θ0 | x) π(θ ∈ Θ0 )
B01 (x) = .
P (θ ∈ Θ1 | x) π(θ ∈ Θ1 )



Computational issues in Bayesian
statistics


The Gibbs Sampler

Population Monte Carlo

Monte Carlo basics

General purpose

A major computational issue in Bayesian statistics:

Given a density π known up to a normalizing constant, and an
integrable function h, compute

h(x)˜ (x)µ(dx)
π
Π(h) = h(x)π(x)µ(dx) =
π (x)µ(dx)
˜

when h(x)˜ (x)µ(dx) is intractable.
π

Monte Carlo basics

Monte Carlo 101

Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
N
ΠM C (h) = N −1
ˆ
N h(xi ).
i=1

ˆN as
LLN: ΠM C (h) −→ Π(h)
If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
√ L
CLT: ˆN
N ΠM C (h) − Π(h) N 0, Π [h − Π(h)]2 .

Monte Carlo basics

Monte Carlo 101

Generate an iid sample x1 , . . . , xN from π and estimate Π(h) by
N
ΠM C (h) = N −1
ˆ
N h(xi ).
i=1

ˆN as
LLN: ΠM C (h) −→ Π(h)
If Π(h2 ) = h2 (x)π(x)µ(dx) < ∞,
√ L
CLT: ˆN
N ΠM C (h) − Π(h) N 0, Π [h − Π(h)]2 .

Caveat conducting to MCMC

Often impossible or ineﬃcient to simulate directly from Π

Importance Sampling

Importance Sampling

For Q proposal distribution such that Q(dx) = q(x)µ(dx),
alternative representation

Π(h) = h(x){π/q}(x)q(x)µ(dx).

Importance Sampling

Importance Sampling

For Q proposal distribution such that Q(dx) = q(x)µ(dx),
alternative representation

Π(h) = h(x){π/q}(x)q(x)µ(dx).

Principle of importance (!)
Generate an iid sample x1 , . . . , xN ∼ Q and estimate Π(h) by
N
ΠIS (h) = N −1
ˆ
Q,N h(xi ){π/q}(xi ).
i=1

return to pMC

Importance Sampling

Properties of importance

Then
ˆ as
LLN: ΠIS (h) −→ Π(h)
Q,N and if Q((hπ/q)2 ) < ∞,
√ L
CLT: ˆ Q,N
N (ΠIS (h) − Π(h)) N 0, Q{(hπ/q − Π(h))2 } .

Importance Sampling

Properties of importance

Then
ˆ as
LLN: ΠIS (h) −→ Π(h)
Q,N and if Q((hπ/q)2 ) < ∞,
√ L
CLT: ˆ Q,N
N (ΠIS (h) − Π(h)) N 0, Q{(hπ/q − Π(h))2 } .

Caveat
ˆ Q,N
If normalizing constant of π unknown, impossible to use ΠIS

Generic problem in Bayesian Statistics: π(θ|x) ∝ f (x|θ)π(θ).

Importance Sampling

Self-Normalised Importance Sampling

Self normalized version
N −1 N
ˆ Q,N
ΠSN IS (h) = {π/q}(xi ) h(xi ){π/q}(xi ).
i=1 i=1

Importance Sampling


N −1 N
ˆ Q,N
i=1 i=1

ˆ as
LLN : ΠSN IS (h) −→ Π(h)
Q,N
and if Π((1 + h2 )(π/q)) < ∞,
√ L
CLT : ˆ Q,N
N (ΠSN IS (h) − Π(h)) N 0, π {(π/q)(h − Π(h)}2 ) .

Importance Sampling


N −1 N
ˆ Q,N
i=1 i=1

ˆ as
LLN : ΠSN IS (h) −→ Π(h)
Q,N
and if Π((1 + h2 )(π/q)) < ∞,
√ L
CLT : ˆ Q,N
N (ΠSN IS (h) − Π(h)) N 0, π {(π/q)(h − Π(h)}2 ) .

c The quality of the SNIS approximation depends on the
choice of Q

Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (MCMC)

It is not necessary to use a sample from the distribution f to
approximate the integral

I= h(x)f (x)dx ,




I= h(x)f (x)dx ,

[notation warnin: π turned to f !]




I= h(x)f (x)dx ,

We can obtain X1 , . . . , Xn ∼ f
(approx) without directly simulating
from f , using an ergodic Markov
chain with stationary distribution f

Andre¨ Markov
ı


Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f


Idea

irreducible Markov chain with stationary distribution f is
ergodic with limiting distribution f under weak conditions
hence convergence in distribution of (X (t) ) to a random
variable from f .
for T0 “large enough” T0 , X (T0 ) distributed from f
Markov sequence is dependent sample X (T0 ) , X (T0 +1) , . . .
generated from f
Birkoﬀ’s ergodic theorem extends LLN, suﬃcient for most
approximation purposes



Idea

Problem: How can one build a Markov chain with a given
stationary distribution?

The Metropolis–Hastings algorithm


Basics
The algorithm uses the objective
(target) density

f

and a conditional density

q(y|x)

called the instrumental (or proposal) Nicholas Metropolis
distribution


The MH algorithm

Algorithm (Metropolis–Hastings)
Given x(t) ,
1. Generate Yt ∼ q(y|x(t) ).
2. Take

Yt with prob. ρ(x(t) , Yt ),
X (t+1) =
x(t) with prob. 1 − ρ(x(t) , Yt ),

where
f (y) q(x|y)
ρ(x, y) = min ,1 .
f (x) q(y|x)


Features

Independent of normalizing constants for both f and q(·|x)
(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain


Convergence properties

1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisﬁes the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)



balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent



balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
3. If
f (Yt ) q(X (t) |Yt )
Pr ≥ 1 < 1. (1)
f (X (t) ) q(Yt |X (t) )

that is, the event {X (t+1) = X (t) } is possible, then the chain
is aperiodic


Convergence properties (2)
4. If
q(y|x) > 0 for every (x, y), (2)
the chain is irreducible


4. If
5. For M-H, f -irreducibility implies Harris recurrence


4. If
5. For M-H, f -irreducibility implies Harris recurrence
6. Thus, for M-H satisfying (1) and (2)
(i) For h, with Ef |h(X)| < ∞,
T
1
lim h(X (t) ) = h(x)df (x) a.e. f.
T →∞ T t=1

(ii) and
lim K n (x, ·)µ(dx) − f =0
n→∞
TV
for every initial distribution µ, where K n (x, ·) denotes the
kernel for n transitions.

Random-walk Metropolis-Hastings algorithms

Random walk Metropolis–Hastings

Use of a local perturbation as proposal

Yt = X (t) + εt ,

where εt ∼ g, independent of X (t) .
The instrumental density is of the form g(y − x) and the Markov
chain is a random walk if we take g to be symmetric g(x) = g(−x)


Random walk Metropolis–Hastings [code]

Algorithm (Random walk Metropolis)
Given x(t)
1. Generate Yt ∼ g(y − x(t) )
2. Take
f (Yt )

Y with prob. min 1, ,
(t+1) t
X = f (x(t) )
 (t)
x otherwise.


The original example

Example (Random walk and normal target)
Generate N (0, 1) based on the uniform proposal [−δ, δ]
forget History!

The probability of acceptance is then
2
ρ(x(t) , yt ) = exp{(x(t) − yt )/2} ∧ 1.
2

[Hastings (1970)]



Example (Random walk & normal (2))
Sample statistics
δ 0.1 0.5 1.0
mean 0.399 -0.111 0.10
variance 0.698 1.11 1.06

c As δ ↑, we get better histograms and a faster exploration of the
support of f .



400
400
250

0.5

0.5

0.5
300
200

300
0.0

0.0

0.0
150

200
200
-0.5

-0.5

-0.5
100

100
100
-1.0

-1.0

-1.0
50

-1.5

-1.5

-1.5
0

0

0

-1 0 1 2 -2 0 2 -3 -2 -1 0 1 2 3
(a) (b) (c)

ples based on U [−δ, δ] with (a) δ = 0.1, (b) δ = 0.5 and (c) δ = 1.0, superimposed with the convergence of the means (15, 000 si


Mixtures by random walk MH

Example (Mixture models)
n k
π(θ|x) ∝ p f (xj |µ , σ ) π(θ)
j=1 =1



Example (Mixture models)
n k
π(θ|x) ∝ p f (xj |µ , σ ) π(θ)
j=1 =1

Metropolis-Hastings proposal:

θ(t) + ωε(t) if u(t) < ρ(t)
θ(t+1) =
θ(t) otherwise

where
π(θ(t) + ωε(t) |x)
ρ(t) = ∧1
π(θ(t) |x)
and ω scaled for good acceptance rate



Random walk sampling (50000 iterations)
2

2
1

1
theta

theta
0

0
-1

-1
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 1.2
p tau
1.2

0.0 1.0 2.0
1.0

-1 0 1 2
theta
0.8

0 1 2 3 4 5 6
tau
0.6
0.4

0.0 0.2 0.4 0.6 0.8 1.0
p
0 2 4
0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8
p tau

General case of a 3 component normal mixture
[Celeux & al., 2000]


3
2
µ2

1
0

X
−1

−1 0 1 2 3

µ1

Random walk MCMC output for .7N (µ1 , 1) + .3N (µ2 , 1)



Uniform ergodicity prohibited by random walk structure



Uniform ergodicity prohibited by random walk structure
At best, geometric ergodicity:

Theorem (Suﬃcient ergodicity)
For a symmetric density f , log-concave in the tails, and a positive
and symmetric density g, the chain (X (t) ) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail eﬀect


illustration of the tail eﬀect

1.5

1.5
Example (Comparison of tails)

1.0

1.0
Random walk Metropolis

0.5

0.5
Hastings algorithms based on a

0.0

0.0
N (0, 1) instrumental for the

-0.5

-0.5
generation of (left) a N (0, 1)
distribution and (right) a
-1.0

-1.0
distribution with density -1.5

-1.5
ψ(x) ∝ (1 + |x|)−3
0 50 100 150 200 0 50 100 150 200

(a) (b)
90% conﬁdence envelopes of the means, derived from 500 parallel independent


Further convergence properties

Under assumptions skip detailed convergence

(A1) f is super-exponential, i.e. it is positive with positive continuous ﬁrst derivative such that

lim|x|→∞ n(x) log f (x) = −∞ where n(x) := x/|x|.

In words : exponential decay of f in every direction with rate
tending to ∞
(A2) lim sup|x|→∞ n(x) m(x) < 0, where m(x) = f (x)/| f (x)|

In words: non degeneracy of the countour manifold
Cf (y) = {y : f (y) = f (x)}
Q is geometrically ergodic, and
V (x) ∝ f (x)−1/2 veriﬁes the drift condition
[Jarner & Hansen, 2000]


Further [further] convergence properties

skip hyperdetailed convergence

If P ψ-irreducible and aperiodic, for r = (r(n))n∈N real-valued non
decreasing sequence, such that, for all n, m ∈ N,

r(n + m) ≤ r(n)r(m),

and r(0) = 1, for C a small set, τC = inf{n ≥ 1, Xn ∈ C}, and
h ≥ 1, assume
τC −1
sup Ex r(k)h(Xk ) < ∞,
x∈C k=0


Further [further] convergence properties

then,
τC −1
S(f, C, r) := x ∈ X, Ex r(k)h(Xk ) <∞
k=0

is full and absorbing and for x ∈ S(f, C, r),

lim r(n) P n (x, .) − f h = 0.
n→∞

[Tuominen & Tweedie, 1994]


Comments

[CLT, Rosenthal’s inequality...] h-ergodicity implies CLT
for additive (possibly unbounded) functionals of the chain,
Rosenthal’s inequality and so on...
[Control of the moments of the return-time] The
condition implies (because h ≥ 1) that
τC −1
sup Ex [r0 (τC )] ≤ sup Ex r(k)h(Xk ) < ∞,
x∈C x∈C k=0

where r0 (n) = n r(l) Can be used to derive bounds for
l=0
the coupling time, an essential step to determine computable
bounds, using coupling inequalities
[Roberts & Tweedie, 98; Fort & Moulines, 00; Jones et al., 02]


Alternative conditions

The condition is not really easy to work with...
[Possible alternative conditions]
(a) [Tuominen, Tweedie, 1994] There exists a sequence
(Vn )n∈N , Vn ≥ r(n)h, such that
(i) supC V0 < ∞,
(ii) {V0 = ∞} ⊂ {V1 = ∞} and
(iii) P Vn+1 ≤ Vn − r(n)h + br(n)IC .


Alternative conditions

(b) [Fort 2000] ∃V ≥ f ≥ 1 and b < ∞, such that supC V < ∞
and
σC
P V (x) + Ex ∆r(k)f (Xk ) ≤ V (x) + bIC (x)
k=0

where σC is the hitting time on C and

∆r(k) = r(k) − r(k − 1), k ≥ 1 and ∆r(0) = r(0).

τC −1
Result (a) ⇔ (b) ⇔ supx∈C Ex k=0 r(k)f (Xk ) < ∞.

Extensions

Langevin Algorithms

Proposal based on the Langevin diffusion Lt is defined by the
stochastic differential equation
1
dLt = dBt + log f (Lt )dt,
2
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f .

Extensions

Discretization

Instead, consider the sequence

σ2
x(t+1) = x(t) + log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretization step

Extensions

Discretization

Instead, consider the sequence

σ2
x(t+1) = x(t) + log f (x(t) ) + σεt , εt ∼ Np (0, Ip )
2
where σ 2 corresponds to the discretization step
Unfortunately, the discretized chain may be be transient, for
instance when

lim σ2 log f (x)|x|−1 > 1
x→±∞

Extensions

MH correction

Accept the new value Yt with probability
2
σ2
exp − Yt − x(t) − 2 log f (x(t) ) 2σ 2
f (Yt )
· ∧1.
f (x(t) ) σ2
2
exp − x(t) − Yt − 2 log f (Yt ) 2σ 2

Choice of the scaling factor σ
Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998; Girolami & Calderhead, 2011]

Extensions

Optimizing the Acceptance Rate

Problem of choice of the transition kernel from a practical point of
view
Most common alternatives:
(a) a fully automated algorithm like ARMS;
(b) an instrumental density g which approximates f , such that
f /g is bounded for uniform ergodicity to apply;
(c) a random walk
In both cases (b) and (c), the choice of g is critical,

Extensions

Case of the random walk

Diﬀerent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .

Extensions

Case of the random walk

Diﬀerent approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
If x(t) and yt are close, i.e. f (x(t) ) f (yt ) y is accepted with
probability
f (yt )
min ,1 1.
f (x(t) )
For multimodal densities with well separated modes, the negative
eﬀect of limited moves on the surface of f clearly shows.

Extensions

Case of the random walk (2)

If the average acceptance rate is low, the successive values of f (yt )
tend to be small compared with f (x(t) ), which means that the
random walk moves quickly on the surface of f since it often
reaches the “borders” of the support of f

Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]

Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In
large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]

This rule is to be taken with a pinch of salt!

Extensions

Role of scale

Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + t+1 t ∼ N (0, τ 2 )

and observables
yt |xt ∼ N (x2 , σ 2 )
t

Extensions

Role of scale

Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + t+1 t ∼ N (0, τ 2 )

and observables
yt |xt ∼ N (x2 , σ 2 )
t

The distribution of xt given xt−1 , xt+1 and yt is

−1 τ2
exp (xt − ϕxt−1 )2 + (xt+1 − ϕxt )2 + (yt − x2 )2
t .
2τ 2 σ2

Extensions

Role of scale

Example (Noisy AR(1) continued)
For a Gaussian random walk with scale ω small enough, the
random walk never jumps to the other mode. But if the scale ω is
suﬃciently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.

Extensions

Role of scale

Markov chain based on a random walk with scale ω = .1.

Extensions

Role of scale

Markov chain based on a random walk with scale ω = .5.

Extensions

MA(2)

Since the constraints on (ϑ1 , ϑ2 ) are well-deﬁned, use of a ﬂat
prior over the triangle as prior.
Simple representation of the likelihood

library(mnormt)
ma2like=function(theta){
n=length(y)
sigma = toeplitz(c(1 +theta[1]^2+theta[2]^2,
theta[1]+theta[1]*theta[2],theta[2],rep(0,n-3)))
dmnorm(y,rep(0,n),sigma,log=TRUE)
}

Extensions

Basic RWHM for MA(2)

Algorithm 1 RW-HM-MA(2) sampler
set ω and ϑ(1)
for i = 2 to T do
˜ (i−1) (i−1)
generate ϑj ∼ U(ϑj − ω, ϑj + ω)
set p = 0 and ϑ (i) = ϑ(i−1)
˜
if ϑ within the triangle then
˜
p = exp(ma2like(ϑ) − ma2like(ϑ(i−1) ))
end if
if U < p then
˜
ϑ(i) = ϑ
end if
end for

Monash University short course, part I

Monash University short course, part I

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Monash University short course, part I

Semelhante a Monash University short course, part I (20)

Mais de Christian Robert

Mais de Christian Robert (20)

Último

Último (20)

Monash University short course, part I