Posterior Concentration under Local and Sup-Norm Losses
1. On adaptation for the posterior distribution
under local and sup-norm
Judith Rousseau, Marc Hoffman and Johannes Schmidt -
Hieber
ENSAE - CREST et CEREMADE, Université Paris-Dauphine
January
1/ 26
2. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
2/ 26
3. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
3/ 26
4. Generalities
n n
Model : Y1 |θ ∼ pθ (density wrt µ), θ ∈ Θ
A priori : θ ∼ Π : prior distribution
−→ posterior distribution
n n
dΠ(θ)pθ (Y1 )
dΠ(θ|X n ) = n , n
Y1 = (Y1 , . . . , Yn )
m(Y1 )
Posterior concentration d(., .) = loss on Θ & θ0 ∈ Θ = True
n
Eθ0 (Π [U n |Y1 ]) = 1 + o(1), U n = {θ; d(θ, θ0 ) ≤ n} n ↓0
Why should we care ?
• Gives insight on some aspects of the prior
• Gives some insight on inference : interpretation of posterior
credible regions (loosely)
• Helps understanding the links between freq. and Bayesian
4/ 26
5. Minimax concentration rates on a Class Θα (L),
c n
sup Eθ0 Π UM n (α)
|Y1 = o(1),
θ0 ∈Θα (L)
where n (α) = minimax rate under d(., .) & over Θα (L).
5/ 26
6. Examples of Models-losses for which nice results exist
Density estimation Yi ∼ pθ i.i.d.
√ √
d(pθ , pθ )2 = ( pθ − pθ )2 (x)dx, d(pθ , pθ ) = |pθ −pθ |(x)dx
Regression function
Yi = f (xi ) + i , i ∼ N (0, σ 2 ), θ = (f , σ)
n
d(pθ , pθ ) = f −f 2, d(pθ , pθ ) = n−1 H 2 (pθ (y |Xi ), pθ (y |Xi ))
i=1
H = Hellinger
White noise
dY (t) = f (t)dt + n−1/2 dW (t) ⇔ Yi = θi + n−1/2 i , i ∈N
d(pθ , pθ ) = f − f 2
6/ 26
7. Examples : functional classes
Θα (L) = Hölder (H(α, L))
n (α) = n−α/(2α+1) minimax rate over H(α, L)
Density example : Hellinger loss
Prior = DPM
f (x) = fP,σ (x) = φσ (x−µ)dP(µ), σ ∼ IΓ(a, b) P ∼ DP(A, G0 )
c n
sup Ef0 Π UM(n/ log n)−α/(2α+1) (f0 )|Y1 = o(1),
f0 ∈Θα (L)
U (f0 ) = {f , h(f0 , f ) ≤ } [ log n term necessary ? ]
⇒ Ef0 h(ˆ, f0 )2
f (n/ log n)−α/(2α+1) , ˆ(x) = E π [f (x)|Y n ]
f
7/ 26
8. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
8/ 26
9. Outline of the proof : Tests and KL
n n
Un = UM(n/ log n)−α/(2α+1) and ln (θ) = log pθ (Y1 )
¯n = (n/ log n)−α/(2α+1)
c
Un eln (θ)−ln (θ0 ) dΠ(θ) Nn
c n
Π [Un |Y1 ] = :=
Θ eln (θ)−ln (θ0 ) dΠ(θ) Dn
n
φn = φn (Y1 ) ∈ [0, 1]
2
Eθ0 (Π [Un |Y1 ]) ≤ Eθ0 [φn ] + Pθ0 Dn < e−cn ¯
c n n n
2
+ e(c+τ )n n Eθ [1 − φn ] dπ(θ)
c
Un
9/ 26
10. Constraints
2
n
Eθ0 [φn ] = o(1) & sup Eθ [1 − φn ] = o(e−cn ¯ ) → d(., .)
n
d(θ,θ0 )>M ¯
n
2
Pθ0 Dn < e−cn ¯
n
= o(1) We need :
Dn ≥ eln (θ)−ln (θ0 ) dΠ(θ)
Sn
2
≥ e−2n n Π Sn ∩ {ln (θ) − ln (θ0 ) > −2n ¯ 2 }
n
Ok if Sn = {KL(pθ0 , pθ ) ≤ n ¯ 2 ; V (pθ0 , pθ ) ≤ n ¯ 2 } and
n n
n
n n
n
2
Π(Sn ) ≥ e−cn ¯ → links d(., .)
n
with KL(., .)
10/ 26
11. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
11/ 26
12. White noise model and pointwise or sup-norm loss
White noise
dY (t) = f (t)dt + n−1/2 dW (t) ⇔ Yjk = θjk + n−1/2 jk , i ∈N
pointwise loss : (f , f0 ) = (f (x0 ) − f0 (x0 ))2
sup-norm loss : ∞ (f , f0 ) = supx |f (x) − f0 (x)|
Random Truncation prior
J ∼ P,
θj,k ∼ g(.) ∀k ∀j ≤ J,
θj,k = 0 ∀k ∀j > J
L2 concentration
sup sup Ef0 P π f − f0 2 > M(n/ log n)−α/(2α+1) |Y = o(1)
α1 ≤α≤α2 f0 ∈H(α,L)
concentration ∀α ∃ > 0
2 /(2α+1)2
sup Ef0 P π (f , f0 ) > (n/ log n)−2α |Y = 1 + o(1)
f0 ∈H(α,L)
12/ 26
13. If Deterministic Truncation prior
J := Jn (α) : 2Jn (α) = (n/ log n)1/(2α+1)
θj,k ∼ g(.) ∀k ∀j ≤ Jn (α)
θj,k = 0 ∀k ∀j > Jn (α)
L2 concentration
sup Ef0 P π f − f0 2 > M(n/ log n)−α/(2α+1) |Y = o(1)
f0 ∈H(α,L)
concentration ∀α ∃ > 0
sup Ef0 P π (f , f0 ) > M(n/ log n)−α/(2α+1) |Y = o(1)
f0 ∈H(α,L)
• What does it mean ? Can we have adaptation with or ∞?
13/ 26
14. Why didn’t it work ?
Same problem as in freq. (see M Low’s papers)
0 2
f1 = 0, f0 : θj,0 = n (α), ∀2j ≤ L(n/ log n)2α/(2α+1) , j ≥ 1
then
2 /(2α+1)2
(θj,k )2 ≤
0 2
n (α) , 2j/2 θj,0
0
(n/ log n)−2α
j≥1,k j≥1
and
P π [J = 0|Y ] = 1 + oP0 (1)
f0 looks too much like f1 = 0
14/ 26
15. Gine & Nickl : posterior concentration rates for ∞ via
tests
At best they have
sup Ef0 P π ∞ (f , f0 ) > M(n/ log n)−(α−1/2)/(2α+1) |Y = o(1)
f0 ∈H(α,L)
Proof based on tests → suboptimal. Can we do better ?
15/ 26
16. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
16/ 26
17. Bayesian Lower bounds : white noise model
Let d(θ, θ ) be a symetrical semi-metric, e.g. d(θ, θ ) = (fθ , fθ )
or ∞ (fθ , fθ ).
Dual of modulus of continuity
φ(θ, ) = inf { θ − θ 2 |d(θ, θ )> }
θ ∈Θ
φ( ) = inf φ(θ, )
θ
Theorem Let C > 0 and n such that
Qn (C) := {θ; φ(θ, 2 n ) ≤ Cφ( n )} = ∅. Then ∀θ0 ∈ Qn (C), ∀Π,
∃K > 0
2
Eθ0 ,n [P π [d(θ, θ0 ) ≥ n |Y ]] ≥ e−Knφ( n )
17/ 26
18. Consequences
Consequence 1 : We√ obtain as in Cai and Low
n = inf{ , φ( ) > M/ n} then ∀un = o( n )
Eθ0 ,n [P π [d(θ, θ0 ) ≥ un |Y ]] = o(1)
18/ 26
19. Consequences
Consequence 1 : We√ obtain as in Cai and Low
n = inf{ , φ( ) > M/ n} then ∀un = o( n )
Eθ0 ,n [P π [d(θ, θ0 ) ≥ un |Y ]] = o(1)
Consequence 2 : If φ( n ) = o( 2 )
n
2 2
e−Knφ( n)
>> e−Kn n
→ Proof based on tests will lead to suboptimal
concentration rates.
¯n = inf{ ; φ( ) > M n }
18/ 26
20. Outline
1 Bayesian nonparametric : posterior concentration
Generalities
Idea of the proof
Examples of loss functions where things become less
nice
2 Bayesian Upper and Lower bounds
Lower bound
The case of ∞ and adaptation
3 Links with confidence bands
19/ 26
21. The case of ∞
Yj,k = θj,k + n−1/2 j,k , j,k ∼ N (0, 1) i.i.d
∞ (fθ , fθ )= max 2j/2 |θj,k − θj,k |
k
j
log n
φ( n (β)) = O √ , n (β) = (n/ log n)−β/(2β/1) , Θ = H(β, L)
n
Theorem
There is a prior Π s.t. ∀C < 1/2
sup sup Eθ0 (P π [ ∞ (θ0 , θ) > M n (β)|Y ]) ≤ e−C log n
β1 ≤β≤β2 θ0 ∈H(β,L)
n (β) := (n/ log n)−β/(2β+1)
Sieve prior (discrete prior)
Spike and slab
20/ 26
22. spike and slab prior
∀j ≤ Jn with 2Jn ≈ n , ∀k
1 1
θj,k ∼ 1− δ(0) + g(.)
n n
with log g smooth (Laplace, Gaussian, Student)
Adaptive posterior concentration in L2 (loosing a log n) and ∞
(n/ log n)−α/(2α+1)
21/ 26
23. Some connections with confidence sets
Adaptive confidence sets Cn
inf Pθ (θ ∈ Cn ) ≥ 1 − α
θ
and
−1
sup sup n (β) Eθ0 [|Cn |] < +∞
β1 ≤β≤β2 θ0 ∈H(β,L)
with |Cn | = supθ,θ ∈Cn d(θ, θ )
If d(., .) = ∞ Does not exist (M. Low)
Hoffman and Nickl H(β1 , L) ∪ H(β2 , L) with β2 > β1
˜
Θn = H(β2 , L) ∪ {θ ∈ H(β1 , L); ∞ (θ, H(β2 , L)) > M n (β1 )}
˜
Then Adaptive confidence set in Θn
22/ 26
24. 1rst Bayesian perspective
H(β1 , L) ∪ H(β2 , L) with β2 > β1 If
sup sup Eθ0 (P π [ ∞ (θ0 , θ) > M n (β)|Y ]) ≤ e−C log n
β∈{β1 ,β2 } θ0 ∈H(β,L)
Set Cn = {θ0 ; P π [ ∞ (θ0 , θ) > M n (β, θ0 )|Y ] < n−C /α} Then
Pθ0 [θ0 ∈ Cn ] ≤ αnC Eθ0 [P π [
c
∞ (θ0 , θ) > M n (β, θ0 )|Y ]] ≤ α
problem : Control of Eθ [|Cn |]
sup Eθ [|Cn |] n (β1 ) → OK
θ∈H(β1 ,L)
sup Eθ [|Cn |] n (β1 ) → BAD
θ∈H(β2 ,L)
˜
But on Θ := H(β2 , L) ∪ {θ ∈ H(β1 , L); ∞ (θ, H(β2 , L)) > n (β1 )}
˜ ˜
Cn = Cn ∩ Θ Adaptive confidence set
23/ 26
25. A better ( ?) Bayesian perspective : back to basics
If
sup sup Eθ0 (P π [ ∞ (θ0 , θ) > M n (β)|Y ]) ≤ e−C log n
β∈[β1 ,β2 ] θ0 ∈H(β,L)
and
ˆ −1
sup sup Eθ0 ∞ (θ0 , θ) n (β) < +∞
β∈[β1 ,β2 ] θ0 ∈H(β,L)
Cn = {θ; ∞ (θ, θ)
ˆ ≤ kn (αn )}, P π [θ ∈ Cn |Y ] ≥ 1 − αn
Then
Pθ [θ ∈ Cn ] dπ(θ) ≥ 1 − αn
Θ
sup Eθ [|Cn |] ≤ 2M n (β),
θ∈H(β,L)
If Θ is bounded.
24/ 26
26. Conclusion
• Bayesian is great for risks that are related to Kullback : L2 in
regression, hellinger or L1 in density etc.
• How to understand some specific features in these big
models ?
More tricky
• Good nonparametric priors : Have good properties for a wide
range of loss functions
• Why should we care ? → interpretation of credible bands ! ?
• Extension to other models than white noise . [Done]
• Can we go further than 2nd Bayesian interpretation
(confidence sets) ?
25/ 26