Strategies for Landing an Oracle DBA Job as a Fresher
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
1. Challenging Common Assumptions
in the Unsupervised Learning of
Disentangled Representations
(ICML 2019 Best Paper)
2019.07.17.
Sangwoo Mo
1
2. Outline
• Quick Review
• What is disentangled representation (DR)?
• Prior work on the unsupervised learning of DR
• Theoretical Results
• Unsupervised learning of DR is impossible without inductive biases
• Empirical Results
• Q1. Which method should be used?
• Q2. How to choose the hyperparameters?
• Q3. How to select the best model from a set of trained models?
2
3. Quick Review
• Disentangled representation: Learn a representation 𝑧 from the data 𝑥 s.t.
• Contain all the information of 𝑥 in a compact and interpretable structure
• Currently no single formal definition L (many definitions for the factor of variation)
3* Image from BetaVAE (ICLR 2017)
4. Quick Review: Prior Methods
• BetaVAE (ICLR 2017)
• Use 𝛽 > 1 for the VAE objective (force to the factorized Gaussian prior)
4
5. Quick Review: Prior Methods
• BetaVAE (ICLR 2017)
• Use 𝛽 > 1 for the VAE objective (force to the factorized Gaussian prior)
• FactorVAE (ICML 2018) & 𝜷-TCVAE (NeurIPS 2018)
• Penalize the total correlation of the representation, which is estimated1 by
adversarial learning (FactorVAE) or (biased) mini-batch approximation (𝛽-TCVAE)
51. It requires the aggregated posterior 𝑞(𝒛)
6. Quick Review: Prior Methods
• BetaVAE (ICLR 2017)
• Use 𝛽 > 1 for the VAE objective (force to the factorized Gaussian prior)
• FactorVAE (ICML 2018) & 𝜷-TCVAE (NeurIPS 2018)
• Penalize the total correlation of the representation, which is estimated1 by
adversarial learning (FactorVAE) or (biased) mini-batch approximation (𝛽-TCVAE)
• DIP-VAE (ICLR 2018)
• Match 𝑞(𝒛) to the disentangled prior 𝑝(𝒛), where 𝐷 is a (tractable) moment matching
61. It requires the aggregated posterior 𝑞(𝒛)
7. Quick Review: Evaluation Metrics
• Many heuristics are proposed to quantitatively evaluate the disentanglement
• Basic idea: Factors and representation should have 1-1 correspondence
7
8. Quick Review: Evaluation Metrics
• Many heuristics are proposed to quantitatively evaluate the disentanglement
• Basic idea: Factors and representation should have 1-1 correspondence
• BetaVAE (ICLR 2017) & FactorVAE (ICML 2018) metric
• Given a factor 𝑐., generate two (simulation) data 𝑥, 𝑥′ with same 𝑐. but different 𝑐1.,
then train a classifier to predict 𝑐. using the difference of the representation |𝑧 − 𝑧4|
• Indeed, the classifier will map the zero-valued index of |𝑧 − 𝑧4
| to the factor 𝑐.
8
9. Quick Review: Evaluation Metrics
• Many heuristics are proposed to quantitatively evaluate the disentanglement
• Basic idea: Factors and representation should have 1-1 correspondence
• BetaVAE (ICLR 2017) & FactorVAE (ICML 2018) metric
• Given a factor 𝑐., generate two (simulation) data 𝑥, 𝑥′ with same 𝑐. but different 𝑐1.,
then train a classifier to predict 𝑐. using the difference of the representation |𝑧 − 𝑧4|
• Indeed, the classifier will map the zero-valued index of |𝑧 − 𝑧4
| to the factor 𝑐.
• Mutual Information Gap (NeurIPS 2018)
• Compute the mutual information between each factor 𝑐. and each dimension of 𝑧5
• For the highest and second highest dimensions 𝑖7 and 𝑖8 of the mutual information,
measure the difference between them: 𝐼 𝑐., 𝑧5:
− 𝐼(𝑐., 𝑧5;
)
9
10. Theoretical Results
• “Unsupervised learning of disentangled representations is fundamentally impossible
without inductive biases on both the models and the data”
10
11. Theoretical Results
• “Unsupervised learning of disentangled representations is fundamentally impossible
without inductive biases on both the models and the data”
• Theorem. For 𝑝 𝒛 = ∏5>7
?
𝑝(𝑧5), there exists an infinite family of bijective functions 𝑓 s.t.
• 𝒛 and 𝑓(𝒛) are completely entangled (i.e.,
ABC(𝒖)
AEF
≠ 0 a.e. for all 𝑖, 𝑗)
• 𝒛 and 𝑓(𝒛) have same marginal distribution (i.e., 𝑃 𝒛 ≤ 𝒖 = 𝑃(𝑓 𝒛 ≤ 𝒖) for all 𝒖)
11
12. Theoretical Results
• “Unsupervised learning of disentangled representations is fundamentally impossible
without inductive biases on both the models and the data”
• Theorem. For 𝑝 𝒛 = ∏5>7
?
𝑝(𝑧5), there exists an infinite family of bijective functions 𝑓 s.t.
• 𝒛 and 𝑓(𝒛) are completely entangled (i.e.,
ABC(𝒖)
AEF
≠ 0 a.e. for all 𝑖, 𝑗)
• 𝒛 and 𝑓(𝒛) have same marginal distribution (i.e., 𝑃 𝒛 ≤ 𝒖 = 𝑃(𝑓 𝒛 ≤ 𝒖) for all 𝒖)
• Proof sketch. By construction.
• Let 𝑔: supp 𝒛 → 0,1 ?
s.t. 𝑔5 𝒗 = 𝑃(𝑧5 ≤ 𝑣5)
• Let ℎ: 0,1 ? → ℝ? s.t. ℎ5 𝒗 = 𝜓17(𝑣5) where 𝜓 is a c.d.f. of a normal distribution
• Then for any orthogonal matrix 𝑨, the following 𝑓 satisfies the condition:
𝑓 𝒖 = ℎ ∘ 𝑔 17(𝑨 ℎ ∘ 𝑔 𝒖 )
12
13. Theoretical Results
• “Unsupervised learning of disentangled representations is fundamentally impossible
without inductive biases on both the models and the data”
• Theorem. For 𝑝 𝒛 = ∏5>7
?
𝑝(𝑧5), there exists an infinite family of bijective functions 𝑓 s.t.
• 𝒛 and 𝑓(𝒛) are completely entangled (i.e.,
ABC(𝒖)
AEF
≠ 0 a.e. for all 𝑖, 𝑗)
• 𝒛 and 𝑓(𝒛) have same marginal distribution (i.e., 𝑃 𝒛 ≤ 𝒖 = 𝑃(𝑓 𝒛 ≤ 𝒖) for all 𝒖)
• Corollary. One cannot find the disentangled representation 𝑟(𝒙) (w.r.t. to the generative
model 𝐺(𝒙|𝒛)) as there are two equivalent generative models 𝐺 and 𝐺′ which has same
marginal distribution 𝑝(𝒙) but 𝒛4 = 𝑓(𝒛) is completely entangled w.r.t. 𝒛 (so as 𝑟(𝒙))
• Namely, inferring representation 𝒛 from observation 𝒙 is not a well-defined problem
13
14. Theoretical Results
• 𝛽-VAE learns some decorrelated features, but they are not semantically decomposed
• E.g., the width is entangled with the leg style in 𝛽-VAE
14* Image from BetaVAE (ICLR 2017)
15. Empirical Results
• Q1. Which method should be used?
• A. Hyperparameters and random seeds matter more than the choice of the model
15
16. Empirical Results
• Q2. How to choose the hyperparameters?
• A. Selecting the best hyperparameter is extremely hard due to the randomness
16
17. Empirical Results
• Q2. How to choose the hyperparameters?
• A. Also, there is no obvious trend over the variation of hyperparameters
17
18. Empirical Results
• Q2. How to choose the hyperparameters?
• A. Good hyperparameters often can be transferred (e.g., dSprites → color-dSprites)
18
Rank correlation matrix
19. Empirical Results
• Q3. How to select the best model from a set of trained models?
• A. Unsupervised (training) scores do not correlated to the disentanglement metrics
19
Unsupervised scores vs disentanglement metrics
20. Summary
• TL;DR: Current unsupervised learning of disentangled representation has a limitation!
• Summary of findings:
• Q1. Which method should be used?
• A. Current methods should be rigorously validated (no significant difference)
20
21. Summary
• TL;DR: Current unsupervised learning of disentangled representation has a limitation!
• Summary of findings:
• Q1. Which method should be used?
• A. Current methods should be rigorously validated (no significant difference)
• Q2. How to choose the hyperparameters?
• A. No rule of thumb, but transfer across datasets seem to help!
21
22. Summary
• TL;DR: Current unsupervised learning of disentangled representation has a limitation!
• Summary of findings:
• Q1. Which method should be used?
• A. Current methods should be rigorously validated (no significant difference)
• Q2. How to choose the hyperparameters?
• A. No rule of thumb, but transfer across datasets seem to help!
• Q3. How to select the best model from a set of trained models?
• A. (Unsupervised) model selection remains a key challenge!
22
23. Following Work & Future Direction
• “Disentangling Factors of Variation Using Few Labels”
(ICLR Workshop 2019, NeurIPS 2019 submission)
• Summary of findings: Using a few labels highly improves the disentanglement!
23
24. Following Work & Future Direction
• “Disentangling Factors of Variation Using Few Labels”
(ICLR Workshop 2019, NeurIPS 2019 submission)
• Summary of findings: Using a few labels highly improves the disentanglement!
1. Existing disentanglement metrics + few labels perform well on model selection,
even though models are completely trained in an unsupervised manner
24
25. Following Work & Future Direction
• “Disentangling Factors of Variation Using Few Labels”
(ICLR Workshop 2019, NeurIPS 2019 submission)
• Summary of findings: Using a few labels highly improves the disentanglement!
1. Existing disentanglement metrics + few labels perform well on model selection,
even though models are completely trained in an unsupervised manner
2. One can obtain even better results if one use few labels into the learning processes
(use a simple supervised regularizer)
25
26. Following Work & Future Direction
• “Disentangling Factors of Variation Using Few Labels”
(ICLR Workshop 2019, NeurIPS 2019 submission)
• Summary of findings: Using a few labels highly improves the disentanglement!
1. Existing disentanglement metrics + few labels perform well on model selection,
even though models are completely trained in an unsupervised manner
2. One can obtain even better results if one use few labels into the learning processes
(use a simple supervised regularizer)
• Take-home message: Future research should be on “how to utilize inductive bias better”
using a few labels, rather than the previous total correlation-like approaches
26