This slide was created for NIPS 2016 study meetup.
IAF and other related researches are briefly explained.
paper:
Diederik P. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
https://papers.nips.cc/paper/6581-improving-variational-autoencoders-with-inverse-autoregressive-flow
5. 4
Diagonal/Full Covariance Gaussian Distribution
Diagonal: Efficient but not flexible
𝑞 𝒛 𝒙 = ΠU 𝑁 𝒛𝒊|𝜇U 𝒙 , 𝜎U 𝒙
Full Covariance: Not Efficient and not flexible (unimodal)
𝑞 𝒛 𝒙 = 𝑁 𝒛|𝝁 𝒙 , 𝚺 𝒙
1. Computationally cheap to compute and differentiate ✓ / ✗
2. Computationally cheap to sample from ✓ / ✗
3. Parallel computation ✓ / ✗
4. Sufficiently flexible to match
the true posterior p(z|x)
✗
6. 5
Change of Variables based methods
Transoform 𝑞 𝑧Z 𝑥 to make more powerful distribution
𝑞 𝑧 𝑥 via sequential application of change of variables
𝒛 𝒕 = 𝑓^ 𝒛 𝒕_𝟏
𝑞 𝒛 𝒕 𝒙 = 𝑞 𝒛 𝒕_𝟏 𝒙 det
𝑑𝑓^ 𝒛 𝒕_𝟏
𝑑𝒛 𝒕_𝟏
_G
⇒ log 𝑞 𝒛 𝑻 𝒙 = log 𝑞 𝒛 𝟎 𝒙 − B log det
𝑑𝑓^ 𝒛 𝒕_𝟏
𝑑𝒛 𝒕_𝟏
^
• Nice
L. Dinh et al., “Nice: non-linear independent components estimation”, 2014
• Normalizing Flow
D. J. Rezende et al., “Variational inference with normalizing flows”, ICML2015
7. 6
Normalizing Flow
Transformation via
𝒛 𝒕 = 𝒛 𝒕_𝟏 + 𝒖 𝒕 𝑓^ 𝒘 𝒕
𝒛 𝒕_𝟏 + 𝑏^
Key Features
- Determinants are computable
Drawbacks
- Information goes through single bottleneck
1. Computationally cheap to compute and differentiate ✓
2. Computationally cheap to sample from ✓
3. Parallel computation ✗
4. Sufficiently flexible to match
the true posterior p(z|x)
✗
single bottleneck
⊕
𝒛 𝒕_𝟏
𝒛 𝒕
𝒘 𝒕
𝑻
𝒛 𝒕 + 𝑏^
𝒖 𝒕 𝑓^ 𝒘 𝒕
𝑻
𝒛 𝒕 + 𝑏^
8. 7
Hamiltonian Flow / Hamiltonian Variational Inference
ELBO with auxiliary variables y
log 𝑝 𝒙 ≥ log 𝑝 𝒙 − 𝐷23 𝑞 𝒛|𝒙 ∥ 𝑝 𝒛 𝒙 − 𝐷23 𝑞 𝒚 𝒙, 𝒛 ∥ 𝑟 𝒚 𝒙, 𝒛 =: ℒ 𝒙
Drawing (y, z) via HMC
𝑦^, 𝑧^ ~𝐻𝑀𝐶 𝑦^, 𝑧^|𝑦^_G, 𝑧^_G
Key Features
- Capability to sample from exact posterior
Drawbacks
- Long mixing time and lower ELBO
1. Computationally cheap to compute and differentiate ✗
2. Computationally cheap to sample from ✗
3. Parallel computation ✗
4. Sufficiently flexible to match
the true posterior p(z|x)
✓
9. 8
Nice
Transform only half of z at each steps
𝒛 𝒕 = 𝒛 𝒕
𝜶
, 𝒛 𝒕
𝜷
= 𝒛 𝒕_𝟏
𝜶
, 𝒛 𝒕_𝟏
𝜷
+ 𝑓^ 𝒙, 𝒛 𝒕_𝟏
𝜶
,
Key Features
- Determinant of the Jacobian det
uvw 𝒛 𝒕x𝟏
u𝒛 𝒕x𝟏
is always 1
Drawbacks
- Limited form of transformation
- less accurate powerful than Normalizing Flow (Next)
1. Computationally cheap to compute and differentiate ✓
2. Computationally cheap to sample from ✓
3. Parallel computation ✗
4. Sufficiently flexible to match
the true posterior p(z|x)
✗
10. 9
Autoregressive Flow (proposed)
Autoregressive Flow (𝑑𝜇^,U/𝑑𝑧^,z = 𝑑𝜎^,U/𝑑𝑧^,z = 0 if 𝑖 ≤ 𝑗)
𝑧^,U = 𝜇^,U 𝒛 𝒕,𝟎:𝒊_𝟏 + 𝜎^,U 𝒛 𝒕,𝟎:𝒊_𝟏 ⊙ 𝑧^_G,U
Key features
- Powerful
- Easy to compute det 𝜕𝒛 𝒕/𝜕𝒛 𝒕_𝟏 = ΠU 𝜎^,U 𝐳𝐭_𝟏
Drawbacks
- Difficult to parallelize
1. Computationally cheap to compute and differentiate ✓
2. Computationally cheap to sample from ✓
3. Parallel computation ✗
4. Sufficiently flexible to match
the true posterior p(z|x)
✓
11. 10
Inverse Autoregressive Flow (proposed)
Inverting AF (𝝁 𝒕, 𝝈 𝒕 is also autoregressive)
𝒛 𝒕 =
𝒛 𝒕_𝟏 − 𝝁 𝒕 𝒛 𝒕_𝟏
𝝈 𝒕 𝒛 𝒕_𝟏
Key Features
- Equally powerful as AF
- Easy to compute det 𝜕𝒛 𝒕/𝜕𝒛 𝒕_𝟏 = 1/ΠU 𝜎^,U 𝐳𝐭_𝟏
- Parallelizable
1. Computationally cheap to compute and differentiate ✓
2. Computationally cheap to sample from ✓
3. Parallel computation ✓
4. Sufficiently flexible to match
the true posterior p(z|x)
✓