Bayesian Nonparametrics Workshop

IV Workshop Bayesian Nonparametrics, Roma, 12 Giugno 2004 1
Bayesian Inference on Mixtures
Christian P. Robert
Universit´e Paris Dauphine
Joint work with
JEAN-MICHEL MARIN, KERRIE MENGERSEN AND JUDITH ROUSSEAU

IV Workshop Bayesian Nonparametrics, Roma, 12 Giugno 2004 2
What’s new?!
• Density approximation & consistency
• Scarsity phenomenon
• Label switching & Bayesian inference
• Nonconvergence of the Gibbs sampler & population Monte Carlo
• Comparison of RJMCM with B& D

Intro/Inference/Algorithms/Beyond ﬁxed k 3
1 Mixtures
Convex combination of “usual” densities (e.g., exponential family)
k
i=1
pif(x|θi) ,
k
i=1
pi = 1 k > 1 ,

−1 0 1 2 3
0.10.20.30.4
0 1 2 3 4 5
0.00.10.20.30.4
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
0.00.20.40.60.81.0
0 2 4 6 8 10
0.000.100.200.30
0 2 4 6
0.000.050.100.150.200.25
−2 0 2 4 6 8 10
0.000.100.200.30
0 5 10 15
0.000.050.100.15
0 5 10 15
0.000.050.100.15
0 5 10 15
0.000.050.100.15
0.000.050.100.15
0.00.20.40.60.81.0
0.00.10.20.3
Normal mixture densities for K = 2, 5, 25, 50

Likelihood
L(θ, p|x) =
n
i=1
k
j=1
pjf (xi|θj)
c Computable in O(nk) time

Intro:Misg/Inference/Algorithms/Beyond ﬁxed k 6
Missing data representation
Demarginalisation
k
i=1
pif(x|θi) = f(x|θ, z) f(z|p) dz
where
X|Z = z ∼ f(x|θz), Z ∼ Mk(1; p1, ..., pk)
Missing “data” z1, . . . , zn that may be or may not be meaningful
[Auxiliary variables]

Nonparametric re-interpretation
Approximation of unknown distributions
E.g., Nadaraya–Watson kernel
ˆkn(x|x) =
1
nhn
n
i=1
ϕ (x; xi, hn)

Bernstein polynomials
Bounded continuous densities on [0, 1] approximated by Beta mixtures
(αk,βk)∈N2
+
pk Be(αk, βk) αk, βk ∈ N∗
[Consistency]
Associated predictive is then
ˆfn(x|x) =
∞
k=1
k
j=1
Eπ
[ωkj|x] Be(j, k + 1 − j) P(K = k|x) .
[Petrone and Wasserman, 2002]

0.0 0.2 0.4 0.6 0.8 1.0
02468
11,0.1,0.9
0.0 0.2 0.4 0.6 0.8 1.0
2468
31,0.6,0.3
0.0 0.2 0.4 0.6 0.8 1.0
0.91.01.11.21.3
5,0.8,0.9
0.0 0.2 0.4 0.6 0.8 1.0
01234
54,0.8,2.6
0.0 0.2 0.4 0.6 0.8 1.0
0.20.40.60.81.01.2
22,1.2,1.6
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.5
45,2.9,1.8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.5
7,4.9,3.3
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.53.0
67,5.1,9.3
0.0 0.2 0.4 0.6 0.8 1.001234
91,19.1,17.5
Realisations from the Bernstein prior

0.0 0.2 0.4 0.6 0.8 1.0
0510152025
0.0 0.2 0.4 0.6 0.8 1.0
0.40.60.81.01.21.41.6
0.0 0.2 0.4 0.6 0.8 1.0
0.60.81.01.21.41.6
0.0 0.2 0.4 0.6 0.8 1.0
0.40.60.81.01.21.41.6
0.0 0.2 0.4 0.6 0.8 1.0
024681012
0.0 0.2 0.4 0.6 0.8 1.0
1.01.52.02.53.03.50.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.41.61.8
0.0 0.2 0.4 0.6 0.8 1.0
0.51.01.52.02.5
0.0 0.2 0.4 0.6 0.8 1.00.51.01.52.02.53.03.5
Realisations from a more general prior

Intro:Constancy/Inference/Algorithms/Beyond ﬁxed k 11
Density estimation
[CPR & Rousseau, 2000–04]
Reparameterisation of a Beta mixture
p0U(0, 1) + (1 − p0)
K
k=1
pkB(αkεk, αk(1 − εk))
k≥1
pk = 1 ,
with density fψ
Can approximate most distributions g on [0, 1]
Assumptions
– g is piecewise continuous on {x ; g(x) < M} for all M’s
– g(x) log g(x) d x < ∞

Prior distributions
– π(K) has a light tail
P(K ≥ tn/ log n) ≤ exp −rn
– p0 ∼ Be(a0, b0), a0 < 1, b0 > 1
– pk ∝ ωk and ωk ∼ Be(1, k)
– location-scale “hole” prior
(αk, εk) ∼ {1 − exp [− {β1(αk − 2)c3
+ β2(εk − .5)c4
}]}
exp −τ0αc0
k /2 − τ1/{α2c1
k εc1
k (1 − εk)c1
} ,

Consistency results
Hellinger neighbourhood
A (f0) = {f, d(f, f0) ≤ }
Then, for all > 0,
π[A (g)|x1:n] → 1, as n → ∞, g a.s.
and
Eπ
[d(g, fψ)|x1:n] → 0, g a.s.
Extension to general parametric distributions by the cdf transform Fθ(x)

2 [B] Inference
Difﬁculties:
• identiﬁability
• label switching
• loss function
• ordering constraints
• prior determination

Intro/Inference:Identifability/Algorithms/Beyond ﬁxed k 15
Central (non)identiﬁability issue
k
j=1 pjf(y|θj) is invariant to relabelling of the components
Consequence
((pj, θj))1≤i≤k
only known up to a permutation τ ∈ Sk

Example 1. Two component normal mixture
p N (µ1, 1) + (1 − p) N (µ2, 1)
where p = 0.5 is known
The parameters µ1 and µ2 are identiﬁable

Bimodal likelihood [500 observations and (µ1, µ2, p) = (0, 2.5, 0.7)]
−1 0 1 2 3 4
−101234
µ1
µ2

Inﬂuence of p on the modes
−2 0 2 4
−2024
µ1
µ2
p=0.5
−2 0 2 4
−2024
µ1
µ2
p=0.6
−2 0 2 4
−2024
µ1
µ2
p=0.75
−2 0 2 4
−2024
µ1
µ2
p=0.85

Intro/Inference:Com’ics/Algorithms/Beyond ﬁxed k 19
Combinatorics
For a normal mixture,
pϕ(x; µ1, σ1) + (1 − p)ϕ(x; µ2, σ2)
under the pseudo-conjugate priors (i = 1, 2)
µi|σi ∼ N (ζi, σ2
i /λi), σ−2
i ∼ G a(νi/2, s2
i /2), p ∼ Be(α, β) ,
the posterior is
π (θ, p|x) ∝
n
j=1
{pϕ(xj; µ1, σ1) + (1 − p)ϕ(xj; µ2, σ2)} π (θ, p) .
Computation: complexity O(2n)

Missing variables (2)
Auxiliary variables z = (z1, . . . , zn) ∈ Z associated with observations
x = (x1, . . . , xn)
For (n1, . . . , nk), where n1 + . . . + nk = n,
Zj = z :
n
i=1
Izi=1 = n1, . . . ,
n
i=1
Izi=k = nk
which consists of all allocations with the given allocation vector (n1, . . . , nk) (and
j corresponding lexicographic order).

Number of nonnegative integer solutions of this decomposition of n
r =
n + k − 1
n
.
Partition
Z = ∪r
i=1Zi
[Number of partition sets of order O(nk−1
)]

Posterior decomposition
π θ, p|x =
r
i=1 z∈Zi
ω (z) π θ, p|x, z
with ω (z) posterior probability of allocation z.
Corresponding representation of posterior expectation of θ, p
r
i=1 z∈Zi
ω (z) Eπ
θ, p|x, z

Very sensible from an inferential point of view:
1. consider each possible allocation z of the dataset,
2. allocates a posterior probability ω (z) to this allocation, and
3. constructs a posterior distribution for the parameters conditional on this
allocation.
All possible allocations: complexity O(kn
)

Posterior
For a given permutation/allocation (kt), conditional posterior distribution
π(θ|(kt)) = N ξ1(kt),
σ2
1
n1 +
× IG((ν1 + )/2, s1(kt)/2)
×N ξ2(kt),
σ2
2
n2 + n −
× IG((ν2 + n − )/2, s2(kt)/2)
×Be(α + , β + n − )

where
¯x1(kt) = 1
t=1 xkt , ˆs1(kt) = t=1(xkt − ¯x1(kt))2
,
¯x2(kt) = 1
n−
n
t= +1 xkt , ˆs2(kt) =
n
t= +1(xkt − ¯x2(kt))2
and
ξ1(kt) =
n1ξ1 + ¯x1(kt)
n1 +
, ξ2(kt) =
n2ξ2 + (n − )¯x2(kt)
n2 + n −
,
s1(kt) = s2
1 + ˆs2
1(kt) +
n1
n1 +
(ξ1 − ¯x1(kt))2
,
s2(kt) = s2
2 + ˆs2
2(kt) +
n2(n − )
n2 + n −
(ξ2 − ¯x2(kt))2
,
posterior updates of the hyperparameters

Intro/Inference:Scarcity/Algorithms/Beyond ﬁxed k 26
Scarcity
Frustrating barrier:
Almost all posterior probabilities ω (z) are zero
Example 2. Galaxy dataset with k = 4 components, Set of allocations with the
partition sizes (n1, n2, n3, n4) = (7, 34, 38, 3) with probability 0.59 and
(n1, n2, n3, n4) = (7, 30, 27, 18) with probability 0.32, and no other size group
getting a probability above 0.01.

Example 3. Normal mean mixture
For a same normal prior,
µ1, µ2 ∼ N(0, 10)
posterior weight associated with a z such that
n
i=1
Izi=1 = l
is
ω (z) ∝ (l + 1/4)(n − l + 1/4) pl
(1 − p)n−l
,
Thus posterior distribution of z only depends on l and repartition of the partition size
follows a distribution close to a binomial B(n, p) distribution.

For two different normal priors on the means,
µ1 ∼ N(0, 4) , µ2 ∼ N(2, 4) ,
posterior weight of z is
ω (z) ∝ (l + 1/4)(n − l + 1/4) pl
(1 − p)n−l
×
exp −[(l + 1/4)ˆs1 (z) + l{¯x1 (z)}2
/4]/2 ×
exp −[(n − l + 1/4)ˆs2 (z) + (n − l){¯x2 (z) − 2}2
/4]/2
where
¯x1 (z) =
1
l
n
i=1
Izi=1xi, ¯x2 (z) =
1
n − l
n
i=1
Izi=2xi
ˆs1 (z) =
n
i=1
Izi=1 (xi − ¯x1 (z))
2
, ˆs2 (z) =
n
i=1
Izi=2 (xi − ¯x2 (z))
2
.

Computation of exact weight of all partition sizes l impossible
Monte Carlo experiment by drawing z’s at random.
Example 4. Sample of 45 points simulated when p = 0.7, µ1 = 0 and µ2 = 2.5
leads to
l = 23 as the most likely partition, with a weight approximated by 0.962
For l = 27, weight approximated by 4.56 10−11
.

l=23
log(ω(kt))
−750 −700 −650 −600 −550
0.0000.0050.0100.0150.020
l=29
log(ω(kt))
−750 −700 −650 −600 −550
0.0000.0050.0100.0150.0200.025

Ten highest log-weights ω (z) (up to an additive constant)
0 10 20 30 40
−700−650−600−550
l

Most likely allocation z for a simulated dataset of 45 observations
−2 −1 0 1 2 3 4
0.00.10.20.30.40.5

Caution! We simulated 450, 000 permutations, to be compared with a
total of 245
permutations!

Intro/Inference: Priors/Algorithms/Beyond ﬁxed k 34
Prior selection
Basic difﬁculty: if exchangeable prior used on
θ = (θ1, . . . , θk)
all marginals on the θi’s are identical
Posterior expectation of θ1 identical to posterior expectation of θ2!

Identifiability constraints
Prior restriction by identifiability constraint on the mixture parameters, for instance
by ordering the means [or the variances or the weights]
Not so innocuous!
• truncation unrelated to the topology of the posterior distribution
• may induce a posterior expectation in a low probability region
• modifies the prior modelling
θ(1)
−4 −3 −2 −1 0
0.00.20.40.60.8
θ(10)
−1.0 −0.5 0.0 0.5 1.0
0.00.20.40.60.81.01.21.4
θ(19)
1 2 3 4
0.00.20.40.60.8

• with many components, ordering in terms of one type of parameter is unrealistic
• poor estimation (posterior mean)
-2 -1 0 1 2 3
0.00.10.20.30.40.5
Gibbs sampling
p
-2 -1 0 1 2 3
0.00.10.20.30.40.5
theta
-2 -1 0 1 2 3
0.00.10.20.30.40.5
tau
-2 -1 0 1 2 3
0.00.10.20.30.40.5
Random walk
p
-2 -1 0 1 2 3
0.00.10.20.30.40.5
theta
-2 -1 0 1 2 3
0.00.10.20.30.40.5
tau
-2 -1 0 1 2 3
0.00.10.20.30.40.5
Langevin
p
-2 -1 0 1 2 3
0.00.10.20.30.40.5
theta
-2 -1 0 1 2 3
0.00.10.20.30.40.5
tau
-2 -1 0 1 2 3
0.00.10.20.30.40.5
Tempered random walk
p
-2 -1 0 1 2 3
0.00.10.20.30.40.5
theta
-2 -1 0 1 2 3
0.00.10.20.30.40.5
tau
• poor exploration (MCMC)

Improper priors??
Independent improper priors,
π (θ) =
k
i=1
πi(θi) ,
cannot be used since, if
πi(θi)dθi = ∞
then for every n,
π(θ, p|x)dθdp = ∞
Still, some improper priors can be used when the impropriety is on a common
(location/scale) parameter
[CPR & Titterington, 1998]

Intro/Inference: Loss/Algorithms/Beyond ﬁxed k 38
Loss functions
Once a sample can be produced from the unconstrained posterior distribution, an
ordering constraint can be imposed ex post
[Stephens, 1997]
Good for MCMC exploration

Again, difﬁcult assesment of the true effect of the ordering constraints...
order p1 p2 p3 θ1 θ2 θ3 σ1 σ2 σ3
p 0.231 0.311 0.458 0.321 -0.55 2.28 0.41 0.471 0.303
θ 0.297 0.246 0.457 -1.1 0.83 2.33 0.357 0.543 0.284
σ 0.375 0.331 0.294 1.59 0.083 0.379 0.266 0.34 0.579
true 0.22 0.43 0.35 1.1 2.4 -0.95 0.3 0.2 0.5
−4 −2 0 2 4
0.00.10.20.30.40.50.6
x
y

Pivotal quantity
For a permutation τ ∈ Sk, corresponding permutation of the parameter
τ(θ, p) = (θτ(1), . . . , θτ(k)), (pτ(1), . . . , pτ(k))
does not modify the value of the likelihood (& posterior under exchangeability).
Label switching phenomenon

Reordering scheme:
Based on a simulated sample of size M,
(i) compute the pivot (θ, p)(i∗
)
such that
i∗
= arg max
i=1,...,M
π((θ, p)(i)
|x)
Monte Carlo approximation of the MAP estimator of (θ, p).
(ii) For i ∈ {1, . . . , M}:
1. Compute
τi = arg min
τ∈Sk
d τ((θ, p)(i)
), (θ, p)(i∗
)
2. Set (θ, p)(i)
= τi((θ, p)(i)
).

Step (ii) chooses the reordering the closest to the MAP estimator
After reordering, the Monte Carlo posterior expectation is
M
j=1
(θi)(j)
M .

Probabilistic alternative
[Jasra, Holmes & Stephens, 2004]
Also put a prior on permutations σ ∈ Sk
Deﬁnes a speciﬁc model M based on a preliminary estimate (e.g., by relabelling)
Computes
θj =
1
N
n
t=1 σ∈Sk
θ
(t)
σ(j)p(σ|θ(t)
, M)

3 Computations

Intro/Inference/Algorithms: Gibbs/Beyond ﬁxed k 45
3.1 Gibbs sampling
Same idea as the EM algorithm: take advantage of the missing data representation
General Gibbs sampling for mixture models
0. Initialization: choose p(0)
and θ(0)
arbitrarily
1. Step t. For t = 1, . . .
1.1 Generate z
(t)
i (i = 1, . . . , n) from (j = 1, . . . , k)
P z
(t)
i = j|p
(t−1)
j , θ
(t−1)
j , xi ∝ p
(t−1)
j f xi|θ
(t−1)
j
1.2 Generate p(t)
from π(p|z(t)
),
1.3 Generate θ(t)
from π(θ|z(t)
, x).

Trapping states
Gibbs sampling may lead to trapping states, concentrated local modes that require
an enormous number of iterations to escape from, e.g., components with a small
number of allocated observations and very small variance
[Diebolt & CPR, 1990]
Also, most MCMC samplers fail to reproduce the permutation invariance of the
posterior distribution, that is, do not visit the k! replications of a given mode.
[Celeux, Hurn & CPR, 2000]

Example 5. Mean normal mixture
0. Initialization. Choose µ
(0)
1 and µ
(0)
2 ,
1. Step t. For t = 1, . . .
1.1 Generate z
(t)
i (i = 1, . . . , n) from
P z
(t)
i = 1 = 1−P z
(t)
i = 2 ∝ p exp −
1
2
xi − µ
(t−1)
1
2
1.2 Compute n
(t)
j =
n
i=1
Iz
(t)
i =j
and (sx
j )(t)
=
n
i=1
Iz
(t)
i =j
xi
1.3 Generate µ
(t)
j (j = 1, 2) from N
λδ + (sx
j )(t)
λ + n
(t)
j
,
1
λ + n
(t)
j
.

−1 0 1 2 3 4
−101234
µ1
µ2

But...
−1 0 1 2 3 4
−101234
µ1
µ2

Intro/Inference/Algorithms: HM/Beyond ﬁxed k 50
3.2 Metropolis–Hastings
Missing data structure is not necessary for MCMC implementation: the mixture
likelihood is available in closed form and computable in O(kn) time:

Step t. For t = 1, . . .
1.1 Generate (θ, p) from q θ, p|θ(t−1)
, p(t−1)
,
1.2 Compute
r =
f(x|θ, p)π(θ, p)q(θ(t−1)
, p(t−1)
|θ, p)
f(x|θ(t−1)
, p(t−1))π(θ(t−1)
, p(t−1))q(θ, p|θ(t−1)
, p(t−1))
,
1.3 Generate u ∼ U[0,1]
If r < u then (θ(t)
, p(t)
) = (θ, p)
else (θ(t)
, p(t)
) = (θ(t−1)
, p(t−1)
).

Proposal
Use of random walk inefficient for constrained parameters like the weights and the
variances.
Reparameterisation:
For the weights p, overparameterise the model as
pj = wj
k
l=1
wl , wj > 0
[Cappé, Rydén & CPR]
The wj’s are not identifiable, but this is not a problem.
Proposed move on the wj’s is
log(wj) = log(w
(t−1)
j ) + uj, uj ∼ N (0, ζ2
)

Gaussian random walk proposal
µ1 ∼ N µ
(t−1)
1 , ζ2
and µ2 ∼ N µ
(t−1)
2 , ζ2
associated with

0. Initialization. Choose µ
(0)
1 and µ
(0)
2
1. Step t. For t = 1, . . .
1.1 Generate µj (j = 1, 2) from N µ
(t−1)
j , ζ2
,
1.2 Compute
r =
f (x|µ1, µ2, ) π (µ1, µ2)
f x|µ
(t−1)
1 , µ
(t−1)
2 π µ
(t−1)
1 , µ
(t−1)
2
,
1.3 Generate u ∼ U[0,1]
If r < u then µ
(t)
1 , µ
(t)
2 = (µ1, µ2)
else µ
(t)
1 , µ
(t)
2 = µ
(t−1)
1 , µ
(t−1)
2 .

−1 0 1 2 3 4
−101234
µ1
µ2

Intro/Inference/Algorithms: PMC/Beyond ﬁxed k 56
3.3 Population Monte Carlo
Idea Apply dynamic importance sampling to simulate a sequence of iid samples
x(t)
= (x
(t)
1 , . . . , x(t)
n )
iid
≈ π(x)
where t is a simulation iteration index (at sample level)

Dependent importance sampling
The importance distribution of the sample x(t)
qt(x(t)
|x(t−1)
)
can depend on the previous sample x(t−1)
in any possible way as long as marginal
distributions
qit(x) = qt(x(t)
) dx
(t)
−i
can be expressed to build importance weights
it =
π(x
(t)
i )
qit(x
(t)
i )

Special case
qt(x(t)
|x(t−1)
) =
n
i=1
qit(x
(t)
i |x(t−1)
)
[Independent proposals]
In that case,
var ˆIt =
1
n2
n
i=1
var
(t)
i h(x
(t)
i ) .

Population Monte Carlo (PMC)
Use previous sample (x(t)
) marginaly distributed from π
E ith(X
(t)
i ) = E
π(x
(t)
i )
qit(x
(t)
i )
h(x
(t)
i )qit(x
(t)
i ) dx
(t)
i = E [Eπ
[h(X)]]
to improve on approximation of π

Resampling
Over iterations (in t), weights may degenerate:
e.g.,
1 1
while 2, . . . , n negligible
Use instead Rubin’s (1987) systematic resampling: at each iteration resample the
x
(t)
i ’s according to their weight
(t)
i and reset the weights to 1 (preserves
“unbiasedness”/increases variance)

PMC for mixtures
Proposal distributions qit that simulate (θ
(i)
(t), p
(i)
(t)) and associated importance
weight
ρ
(i)
(t) =
f x|θ
(i)
(t), p
(i)
(t) π θ
(i)
(t), p
(i)
(t)
qit θ
(i)
(t), p
(i)
(t)
, i = 1, . . . , M
Approximations of the form
1
M
M
i=1
ρ
(i)
(t)
M
l=1 ρ
(l)
(t)
h θ
(i)
(t), p
(i)
(t)
give (almost) unbiased estimators of Eπ
x[h(θ, p)],

0. Initialization. Choose θ
(1)
(0), . . . , θ
(M)
(0) and p
(1)
(0), . . . , p
(M)
(0)
1. Step t. For t = 1, . . . , T
1.1 For i = 1, . . . , M
1.1.1 Generate θ
(i)
(t), p
(i)
(t) from qit (θ, p),
1.1.2 Compute
ρ(i)
= f x|θ
(i)
(t), p
(i)
(t) π θ
(i)
(t), p
(i)
(t) qit θ
(i)
(t), p
(i)
(t) ,
1.2 Compute ω(i)
= ρ(i)
M
l=1
ρ(l)
,
1.3 Resample M values with replacement from the θ
(i)
(t), p
(i)
(t) ’s using
the weights ω(i)

Implementation without the Gibbs augmentation step, using normal random walk
proposals based on the previous sample of (µ1, µ2)’s as in Metropolis–Hastings.
Selection of a “proper” scale:
bypassed by the adaptivity of the PMC algorithm
Several proposals associated with a range of variances vk, k = 1, . . . , K.
At each step, new variances can be selected proportionally to the performances of
the scales vk on the previous iterations, for instance, proportional to its
non-degeneracy rate

Step t. For t = 1, . . . , T
1.1 For i = 1, . . . , M
1.1.1 Generate k from M (1; r1, . . . , rK),
1.1.2 Generate (µj)
(i)
(t) (j = 1, 2) from N (µj)
(i)
(t−1) , vk
1.1.4 Compute
ρ(i)
=
f x|(µ1)
(i)
(t), (µ2)
(i)
(t) π (µ1)
(i)
(t), (µ2)
(i)
(t)
K
l=1
2
j=1
ϕ (µj)
(i)
(t); (µ1)
(i)
(t−1), vl
,
1.2 Compute ω(i)
= ρ(i)
M
l=1
ρ(l)
,
1.3 Resample the (µ1)
(i)
(t), (µ2)
(i)
(t)’s using the weights ω(i)
1.4 Update the rl’s: rl is proportional to the number of (µ1)
(i)
(t), (µ2)
(i)
(t)’s with
variance vl resampled.

−1 0 1 2 3 4
−101234
µ1
µ2

4 Unknown number of components
When k number of components is unknown, there are several models
Mk
with corresponding parameter sets
Θk
in competition.

Intro/Inference/Algorithms/Beyond ﬁxed k: RJ 67
Reversible jump MCMC
Reversibility constraint put on dimension-changing moves that bridge the sets Θk /
the models Mk
[Green, 1995]
Local reversibility for each pair (k1, k2) of possible values of k: supplement Θk1
and Θk2 with adequate artiﬁcial spaces in order to create a bijection between them:

Basic steps
Choice of probabilities
πij
j
πij = 1
of jumping to model Mkj while in model Mki
θ(k1)
is completed by a simulation u1 ∼ g1(u1) into (θ(k1)
, u1) and θ(k2)
by
u2 ∼ g2(u2) into (θ(k2)
, u2)
(θ(k2)
, u2) = Tk1→k2
(θ(k1)
, u1),

Green reversible jump algorithm
0. At iteration t, if x(t)
= (m, θ(m)
),
1. Select model Mn with probability πmn,
2. Generate umn ∼ ϕmn(u),
3. Set (θ(n)
, vnm) = Tm→n(θ(m)
, umn),
4. Take x(t+1)
= (n, θ(n)
) with probability
min
π(n, θ(n)
)
π(m, θ(m))
πnmϕnm(vnm)
πmnϕmn(umn)
∂Tm→n(θ(m)
, umn)
∂(θ(m), umn)
, 1 ,
and take x(t+1)
= x(t)
otherwise.

Example 8. For a normal mixture
Mk :
k
j=1
pjkN(µjk, σ2
jk) ,
restriction to moves from Mk to neighbouring models Mk+1 and Mk−1.
[Richardson & Green, 1997]

Birth and death steps
birth adds a new normal component generated from the prior
death removes one of the k components at random.
Birth acceptance probability
min
π(k+1)k
πk(k+1)
(k + 1)!
k!
πk+1(θk+1)
πk(θk) (k + 1)ϕk(k+1)(uk(k+1))
, 1
= min
π(k+1)k
πk(k+1)
(k + 1)
(k)
k+1(θk+1) (1 − pk+1)k−1
k(θk)
, 1 ,
where (k) is the prior probability of model Mk

Proposal that can work well in some settings, but can also be inefﬁcient (i.e. high
rejection rate), if the prior is vague.
Alternative: devise more local jumps between models,
(i). split



pjk = pj(k+1) + p(j+1)(k+1)
pjkµjk = pj(k+1)µj(k+1) + p(j+1)(k+1)µ(j+1)(k+1)
pjkσ2
jk = pj(k+1)σ2
j(k+1) + p(j+1)(k+1)σ2
(j+1)(k+1)
(ii). merge (reverse)

Histogram and rawplot of 100, 000 k’s produced by RJMCMC
Histogram of k
k
1 2 3 4 5
0.00.10.20.30.4
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
12345
Rawplot of k
k

Normalised enzyme dataset
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.00.51.01.52.02.53.0

Intro/Inference/Algorithms/Beyond ﬁxed k: b&d 75
Birth and Death processes
Use of an alternative methodology based on a Birth–&-Death (point) process
Idea: Create a Markov chain in continuous time, i.e. a Markov jump process,
moving between models Mk, by births (to increase the dimension), deaths (to
decrease the dimension), and other moves
[Preston, 1976; Ripley, 1977; Stephens, 1999]

Time till next modiﬁcation (jump) is exponentially distributed with rate depending on
current state
Remember: if ξ1, . . . , ξv are exponentially distributed, ξi ∼ Exp(λi),
min ξi ∼ Exp
i
λi
Difference with MH-MCMC: Whenever a jump occurs, the corresponding move is
always accepted. Acceptance probabilities replaced with holding times.

Balance condition
Sufficient to have detailed balance
L(θ)π(θ)q(θ, θ ) = L(θ )π(θ )q(θ , θ) for all θ, θ
for ˜π(θ) ∝ L(θ)π(θ) to be stationary.
Here q(θ, θ ) rate of moving from state θ to θ .
Possibility to add split/merge and fixed-k processes if balance condition satisfied.
[Cappé, Rydén & CPR, 2002]

Case of mixtures
Representation as a (marked) point process
Φ = {pj, (µj, σj)}
j
Birth rate λ0 (constant) and proposal from the prior
Death rate δj(Φ) for removal of component j
Overall death rate
k
j=1
δj(Φ) = δ(Φ)
Balance condition
(k + 1) d(Φ ∪ {p, (µ, σ)}) L(Φ ∪ {p, (µ, σ)}) = λ0L(Φ)
π(k)
π(k + 1)
with
d(Φ {pj, (µj, σj)}) = δj(Φ)

Stephen’s original algorithm:
For v = 0, 1, · · · , V
t ← v
Run till t > v + 1
1. Compute δj(Φ) =
L(Φ|Φj)
L(Φ)
λ0λ1
2. δ(Φ) ←
k
j=1
δj(Φj), ξ ← λ0 + δ(Φ), u ∼ U([0, 1])
3. t ← t − u log(u)

4. With probability δ(Φ)/ξ
Remove component j with probability δj(Φ)/δ(Φ)
k ← k − 1
p ← p /(1 − pj) ( = j)
Otherwise,
Add component j from the prior π(µj, σj)
pj ∼ Be(γ, kγ)
p ← p (1 − pj) ( = j)
k ← k + 1
5. Run I MCMC(k, β, p)

Rescaling time
In discrete-time RJMCMC, let the time unit be 1/N, put
βk = λk/N and δk = 1 − λk/N
As N → ∞, each birth proposal will be accepted, and having k components births
occur according to a Poisson process with rate λk

while component (w, φ) dies with rate
lim
N→∞
Nδk+1 ×
1
k + 1
× min(A−1
, 1)
= lim
N→∞
N
1
k + 1
× likelihood ratio
−1
×
βk
δk+1
×
b(w, φ)
(1 − w)k−1
= likelihood ratio
−1
×
λk
k + 1
×
b(w, φ)
(1 − w)k−1
.
Hence
“RJMCMC→BDMCMC”

Even closer to RJMCM
Exponential (random) sampling is not necessary, nor is continuous time!
Estimator of
I = g(θ)π(θ)dθ
by
ˆI =
1
N
N
1
g(θ(τi))
where {θ(t)} continuous time MCMC process and τ1, . . . , τN sampling instants.

New notations:
1. Tn time of the n-th jump of {θ(t)} with T0 = 0
2. {θn} jump chain of states visited by {θ(t)}
3. λ(θ) total rate of {θ(t)} leaving state θ
Then holding time Tn − Tn−1 of {θ(t)} in its n-th state θn exponential rv with rate
λ(θn)

Rao–Blackwellisation
If sampling interval goes to 0, limiting case
ˆI∞ =
1
TN
N
n=1
g(θn−1)(Tn − Tn−1)
Rao–Blackwellisation argument: replace ˆI∞ with
˜I =
1
TN
N
n=1
g(θn−1)
λ(θn−1)
=
1
TN
N
n=1
E[Tn − Tn−1 | θn−1] g(θn−1) .
Conclusion: Only simulate jumps and store average holding times!
Completely remove continuous time feature

Example 9. Galaxy dataset
Comparison of RJMCMC and CTMCMC in the Galaxy dataset
[Cappé & al., 2002]
Experiment:
• Same proposals (same C code)
• Moves proposed in equal proportions by both samplers (setting the probability
PF
of proposing a fixed k move in RJMCMC equal to the rate ηF
at which
fixed k moves are proposed in CTMCMC, and likewise PB
= ηB
for the birth
moves)
• Rao–Blackwellisation
• Number of jumps (number of visited configurations) in CTMCMC == number of
iterations of RJMCMC

Results:
• If one algorithm performs poorly, so does the other. (For RJMCMC
manifested as small A’s—birth proposals are rarely accepted—while for
BDMCMC manifested as large δ’s—new components are indeed born but die
again quickly.)
• No signiﬁcant difference between samplers for birth and death only
• CTMCMC slightly better than RJMCMC with split-and-combine moves
• Marginal advantage in accuracy for split-and-combine addition
• For split-and-combine moves, computation time associated with one step of
continuous time simulation is about 5 times longer than for reversible jump
simulation.

Box plot for the estimated posterior on k obtained from 200 independent runs:
RJMCMC (top) and BDMCMC (bottom). The number of iterations varies from 5 000
(left), to 50 000 (middle) and 500 000 (right).
2 4 6 8 10 12 14
0
0.1
0.2
0.3
CT (500 000 it.)
k
2 4 6 8 10 12 14
0
0.1
0.2
0.3
RJ (500 000 it.)
2 4 6 8 10 12 14
0
0.1
0.2
0.3
CT (50 000 it.)
k
2 4 6 8 10 12 14
0
0.1
0.2
0.3
RJ (50 000 it.)
2 4 6 8 10 12 14
0
0.1
0.2
0.3
CT (5 000 it.)
posteriorprobability
k
2 4 6 8 10 12 14
0
0.1
0.2
0.3
RJ (5 000 it.)

Same for the estimated posterior on k obtained from 500 independent runs: Top
RJMCMC and bottom, CTMCMC. The number of iterations varies from 5 000 (left
plots) to 50 000 (right plots).
2 4 6 8 10 12 14
0
0.1
0.2
0.3
CT (50 000 it.)
k
2 4 6 8 10 12 14
0
0.1
0.2
0.3
RJ (50 000 it.)
2 4 6 8 10 12 14
0
0.1
0.2
0.3
CT (5 000 it.)
k
2 4 6 8 10 12 14
0
0.1
0.2
0.3
RJ (5 000 it.)

Bayesian Nonparametrics Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Bayesian Nonparametrics Workshop

Similar to Bayesian Nonparametrics Workshop (20)

More from Christian Robert

More from Christian Robert (17)

Recently uploaded

Recently uploaded (20)

Bayesian Nonparametrics Workshop