SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
AGREEMENT
• If you plan to share these slides or to use the content in these slides for your own work,
please include the following reference:
• 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします:
Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.
VAEs for multimodal
disentanglement
2022/05/15
Antonio TEJERO DE PABLOS
antonio_tejero@cyberagent.co.jp
1.Self-introduction
2.Background
3.Paper introduction
4.Final remarks
Self-introduction
Antonio TEJERO DE PABLOS
Background
• Present: Research scientist @ CyberAgent (AI Lab)
• ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP)
• ~2017: PhD @ NAIST (Yokoya Lab)
Research interests
• Learning of multimodal data (RGB, depth, audio, text)
• and its applications (action recognition, advertisement
classification, etc.)
父 母
分野:コンピュータビジョン
Background
What is a VAE?
• Auto-encoder • Variational auto-encoder
With the proper regularization:
There is more!
• Vector Quantized-VAE
Quantize the bottleneck using a discrete codebook
There are a number of algorithms (like transformers) that are designed to work on discrete data, so we
would like to have a discrete representation of the data for these algorithms to use.
Advantages of VQ-VAE:
- Simplified latent space (easier to train)
- Likelihood-based model: do not suffer from the
problems of mode collapse and lack of diversity
- Real world data favors a discrete representation
(number of images that make sense is kind of finite)
Why are VAEs cool?
• Usage of VAEs (state-of-the-art)
Multimodal generation (DALL-E)
Representation learning, latent space disentanglement
Paper introduction
An interesting usage of VAEs for disentangling
multimodal data that grabbed my attention
Today I’m introducing:
1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal
deep generative models. Advances in Neural Information Processing Systems, 32.
2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of
latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692-
1700).
3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning
Multimodal VAEs through Mutual Supervision. International Conference on Learning
Representations.
Motivation and goal
• Importance of multimodal data
Learning in the real world involves multiple perspectives: visual, auditive, linguistic
Understanding them individually allows only a partial learning of concepts
• Understanding how different modalities work together is not trivial
A similar joint-embedding process happens in the brain for reasoning and understanding
• Multimodal VAE facilitate representation learning on data with multiple views/modalities
Capture common underlying factors between the modalities
Motivation and goal
• Normally, only the shared aspects of modalities are modeled
The private information of each modality is totally LOST
E.g., image captioning
• Leverage VAE’s latent space for disentanglement
Private spaces are leveraged for modeling the disjoint properties of each
modality, and cross-modal generation
• Basically, such disentanglement can be used as:
An analytical tool to understand how modalities intertwine
A way of cross-generating modalities
Motivation and goal
• [1] and [2] propose a similar methodology
According to [1], a true multimodal generative model should meet four criteria:
Today I will introduce [2] (most recent), and explain briefly the differences with [3]
Dataset
• Digit images: MNIST & SVHN
- Shared features: Digit class
- Private features: Number style, background, etc.
Image domains as different modalities?
• Flower images and text description: Oxford-102 Flowers
- Shared features: Words and image features present in both
modalities
- Private features: Words and image features exclusive from
their modality
MNIST
SVHN
Related work
• Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE)
The learning of a common disentangled embedding (i.e., private-shared) is often ignored
Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the
latent space (e.g., via adversarial loss)
Exclusively for between-image modalities: Not suitable for different modalities such as image and text
• Domain adaptation
Learning joint embeddings of multimodal observations
Proposed method: DMVAE
• Generative variational model: Introducing separate shared and private spaces
Usage: Cross-generation (analytical tool)
• Representations induced using pairs of individual modalities (encoder, decoder)
• Consistency of representations via Product of Experts (PoE). For a number of modalities N:
𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) *
%&"
$
𝑞(𝑧!|𝑥%)
In VAE, inference networks and priors assume conditional Gaussian forms
𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶%
𝑧"~𝑞'!
𝑧 𝑥" , 𝑧#~𝑞'"
𝑧 𝑥#
𝑧" = 𝑧(!
, 𝑧!!
, 𝑧# = 𝑧("
, 𝑧!"
We want: 𝑧) = 𝑧!!
= 𝑧!"
→ PoE
Proposed method: DMVAE
• Reconstruction inference
PoE-induced shared inference allows for inference when one or more modalities are missing
Thus, we consider three reconstruction tasks:
- Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4
𝑥", 4
𝑥# 𝑧(!
, 𝑧("
, 𝑧)
- Reconstruct a single modality from its own input: 𝑥" → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥# → 4
𝑥# 𝑧("
, 𝑧)
- Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4
𝑥" 𝑧(!
, 𝑧) or 𝑥" → 4
𝑥# 𝑧("
, 𝑧)
• Loss function
Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution
Accuracy of cross-modal and self reconstruction + KL-divergence
Experiments: Digits (image-image)
• Evaluation
Qualitative: Cross-generation between modalities
Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality
- Joint: A sample from zs generates two image modalities that must be assigned the same class
Input
Output for
different
samples of zp2
Input
Output for
different
samples of zp1
Experiments: Digits (image-image)
• Ablation study: DMVAE [2] vs MMVAE [1]’s shared latent space
Experiments: Flowers (image-text)
• This task is more complex
Instead of the image-text, the intermediate features are reconstructed
• Quantitative evaluation
Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space
• Qualitative evaluation
Retrieval
Conclusions
• Multimodal VAE for disentangling private and shared spaces
Improve the representational performance of multimodal VAEs
Successful application to image-image and image-text modalities
• Shaping a latent space into subspaces that capture the private-shared aspects of the
modalities
“is important from the perspective of downstream tasks, where better decomposed representations are more
amenable for using on a wider variety of tasks”
[3] Multimodal VAEs via mutual supervision
• Main differences with [1] and [2]
A type of multimodal VAE, without private-shared disentanglement
Does not rely on factorizations such as MoE or PoE for modeling modality-shared information
Instead, it repurposes semi-supervised VAEs for combining inter-modality information
- Allows learning from partially-observed modalities (Reg. = KL divergence)
• Proposed method: Mutually supErvised Multimodal vaE (MEME)
[3] Multimodal VAEs via mutual supervision
• Qualitative evaluation
Cross-modal generation
• Quantitative evaluation
Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier
Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)
Final remarks
Final remarks
• VAE not only for generation but also for reconstruction and disentanglement tasks
Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling
• Private-shared latent spaces as an effective tool for analyzing multimodal data
• There is still a lot of potential for this research
It has been only applied to a limited number of multimodal problems
• このテーマに興味のある博士課程の学生 → インターン募集中
https://www.cyberagent.co.jp/news/detail/id=27453
違うテーマでも大丈夫!
共同研究も大歓迎!
ありがとうございました︕
Antonio TEJERO DE PABLOS
antonio_tejero@cyberagent.co.jp

Mais conteúdo relacionado

Mais procurados

深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
semantic segmentation サーベイ
semantic segmentation サーベイsemantic segmentation サーベイ
semantic segmentation サーベイyohei okawa
 
深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎Takumi Ohkuma
 
SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII
 
Humpback whale identification challenge反省会
Humpback whale identification challenge反省会Humpback whale identification challenge反省会
Humpback whale identification challenge反省会Yusuke Uchida
 
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...Deep Learning JP
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門tmtm otm
 
[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...
[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...
[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...Deep Learning JP
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment AnythingDeep Learning JP
 
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...Hideki Tsunashima
 
生成モデルの Deep Learning
生成モデルの Deep Learning生成モデルの Deep Learning
生成モデルの Deep LearningSeiya Tokui
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential EquationsDeep Learning JP
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion ModelsDeep Learning JP
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative ModelsDeep Learning JP
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII
 
Generative Models(メタサーベイ )
Generative Models(メタサーベイ )Generative Models(メタサーベイ )
Generative Models(メタサーベイ )cvpaper. challenge
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用Yoshitaka Ushiku
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門Takuji Tahara
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida
 

Mais procurados (20)

深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
semantic segmentation サーベイ
semantic segmentation サーベイsemantic segmentation サーベイ
semantic segmentation サーベイ
 
深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎深層学習によるHuman Pose Estimationの基礎
深層学習によるHuman Pose Estimationの基礎
 
SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向SSII2019企画: 点群深層学習の研究動向
SSII2019企画: 点群深層学習の研究動向
 
Humpback whale identification challenge反省会
Humpback whale identification challenge反省会Humpback whale identification challenge反省会
Humpback whale identification challenge反省会
 
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門
 
[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...
[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...
[DL輪読会]Life-Long Disentangled Representation Learning with Cross-Domain Laten...
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment Anything
 
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
 
生成モデルの Deep Learning
生成モデルの Deep Learning生成モデルの Deep Learning
生成モデルの Deep Learning
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
 
Generative Models(メタサーベイ )
Generative Models(メタサーベイ )Generative Models(メタサーベイ )
Generative Models(メタサーベイ )
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
 

Semelhante a VAEs for multimodal disentanglement

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...Antonio Tejero de Pablos
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...Wee Hyong Tok
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionLiad Magen
 
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,NoidaTeaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,NoidaDr. Sandeep Kumar Singh
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer LearningDanielle Dean
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfBoahKim2
 
ImageHubExplorerPosterReduced
ImageHubExplorerPosterReducedImageHubExplorerPosterReduced
ImageHubExplorerPosterReducedNenad Toma?ev
 
Gephi icwsm-tutorial
Gephi icwsm-tutorialGephi icwsm-tutorial
Gephi icwsm-tutorialcsedays
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014Paris Open Source Summit
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiJohn Breslin
 
Collaborative Immersive Analytics
Collaborative Immersive AnalyticsCollaborative Immersive Analytics
Collaborative Immersive AnalyticsMark Billinghurst
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applicationsAnas Arram, Ph.D
 

Semelhante a VAEs for multimodal disentanglement (20)

CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
CVPR2022 paper reading - Balanced multimodal learning - All Japan Computer Vi...
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Image captioning
Image captioningImage captioning
Image captioning
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
 
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,NoidaTeaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
Teaching Object Oriented Programming Courses by Sandeep K Singh JIIT,Noida
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer Learning
 
TIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdfTIP_TAViT_presentation.pdf
TIP_TAViT_presentation.pdf
 
ImageHubExplorerPosterReduced
ImageHubExplorerPosterReducedImageHubExplorerPosterReduced
ImageHubExplorerPosterReduced
 
Gephi icwsm-tutorial
Gephi icwsm-tutorialGephi icwsm-tutorial
Gephi icwsm-tutorial
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
 
Collaborative Immersive Analytics
Collaborative Immersive AnalyticsCollaborative Immersive Analytics
Collaborative Immersive Analytics
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
 

Mais de Antonio Tejero de Pablos

ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...Antonio Tejero de Pablos
 
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...Antonio Tejero de Pablos
 
WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理Antonio Tejero de Pablos
 
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)Antonio Tejero de Pablos
 
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)Antonio Tejero de Pablos
 

Mais de Antonio Tejero de Pablos (6)

ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
 
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...Inteligencia artificial,visión por ordenador,y datos multimodales - ACE Jap...
Inteligencia artificial, visión por ordenador, y datos multimodales - ACE Jap...
 
WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理WakateMIRU2022 (グループ8) ディープワークによる時間管理
WakateMIRU2022 (グループ8) ディープワークによる時間管理
 
Machine Learning Fundamentals IEEE
Machine Learning Fundamentals IEEEMachine Learning Fundamentals IEEE
Machine Learning Fundamentals IEEE
 
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
Presentation Skills Up! Seminar - Harada Ushiku Lab (English)
 
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
プレゼンスキルアップ!講座 原田・牛久研究室(日本語)
 

Último

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 

Último (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 

VAEs for multimodal disentanglement

  • 1. AGREEMENT • If you plan to share these slides or to use the content in these slides for your own work, please include the following reference: • 本スライドの公開や個人使用の場合には、下記の様にリファレンスをお願いします: Tejero-de-Pablos A. (2022) “VAEs for multimodal disentanglement”. All Japan Computer Vision Study Group.
  • 2. VAEs for multimodal disentanglement 2022/05/15 Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp
  • 5. Antonio TEJERO DE PABLOS Background • Present: Research scientist @ CyberAgent (AI Lab) • ~2021: Researcher @ U-Tokyo (Harada Lab) & RIKEN (AIP) • ~2017: PhD @ NAIST (Yokoya Lab) Research interests • Learning of multimodal data (RGB, depth, audio, text) • and its applications (action recognition, advertisement classification, etc.) 父 母 分野:コンピュータビジョン
  • 7. What is a VAE? • Auto-encoder • Variational auto-encoder With the proper regularization:
  • 8. There is more! • Vector Quantized-VAE Quantize the bottleneck using a discrete codebook There are a number of algorithms (like transformers) that are designed to work on discrete data, so we would like to have a discrete representation of the data for these algorithms to use. Advantages of VQ-VAE: - Simplified latent space (easier to train) - Likelihood-based model: do not suffer from the problems of mode collapse and lack of diversity - Real world data favors a discrete representation (number of images that make sense is kind of finite)
  • 9. Why are VAEs cool? • Usage of VAEs (state-of-the-art) Multimodal generation (DALL-E) Representation learning, latent space disentanglement
  • 10. Paper introduction An interesting usage of VAEs for disentangling multimodal data that grabbed my attention
  • 11. Today I’m introducing: 1) Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in Neural Information Processing Systems, 32. 2) Lee, M., & Pavlovic, V. (2021). Private-shared disentangled multimodal VAE for learning of latent representations. Conference on Computer Vision and Pattern Recognition (pp. 1692- 1700). 3) Joy, T., Shi, Y., Torr, P. H., Rainforth, T., Schmon, S. M., & Siddharth, N. (2022). Learning Multimodal VAEs through Mutual Supervision. International Conference on Learning Representations.
  • 12. Motivation and goal • Importance of multimodal data Learning in the real world involves multiple perspectives: visual, auditive, linguistic Understanding them individually allows only a partial learning of concepts • Understanding how different modalities work together is not trivial A similar joint-embedding process happens in the brain for reasoning and understanding • Multimodal VAE facilitate representation learning on data with multiple views/modalities Capture common underlying factors between the modalities
  • 13. Motivation and goal • Normally, only the shared aspects of modalities are modeled The private information of each modality is totally LOST E.g., image captioning • Leverage VAE’s latent space for disentanglement Private spaces are leveraged for modeling the disjoint properties of each modality, and cross-modal generation • Basically, such disentanglement can be used as: An analytical tool to understand how modalities intertwine A way of cross-generating modalities
  • 14. Motivation and goal • [1] and [2] propose a similar methodology According to [1], a true multimodal generative model should meet four criteria: Today I will introduce [2] (most recent), and explain briefly the differences with [3]
  • 15. Dataset • Digit images: MNIST & SVHN - Shared features: Digit class - Private features: Number style, background, etc. Image domains as different modalities? • Flower images and text description: Oxford-102 Flowers - Shared features: Words and image features present in both modalities - Private features: Words and image features exclusive from their modality MNIST SVHN
  • 16. Related work • Multimodal generation and joint multimodal VAEs (e.g., JMVAE, MVAE) The learning of a common disentangled embedding (i.e., private-shared) is often ignored Only some works in image-to-image translation separate ”content” (~shared) and ”style” (~private) in the latent space (e.g., via adversarial loss) Exclusively for between-image modalities: Not suitable for different modalities such as image and text • Domain adaptation Learning joint embeddings of multimodal observations
  • 17. Proposed method: DMVAE • Generative variational model: Introducing separate shared and private spaces Usage: Cross-generation (analytical tool) • Representations induced using pairs of individual modalities (encoder, decoder) • Consistency of representations via Product of Experts (PoE). For a number of modalities N: 𝑞 𝑧! 𝑥", 𝑥#, ⋯ , 𝑥$ ∝ 𝑝(𝑧!) * %&" $ 𝑞(𝑧!|𝑥%) In VAE, inference networks and priors assume conditional Gaussian forms 𝑝 𝑧 = 𝑁 𝑧 0, 𝐼 , 𝑞 𝑧 𝑥% = 𝑁 𝑧 𝜇%, 𝐶% 𝑧"~𝑞'! 𝑧 𝑥" , 𝑧#~𝑞'" 𝑧 𝑥# 𝑧" = 𝑧(! , 𝑧!! , 𝑧# = 𝑧(" , 𝑧!" We want: 𝑧) = 𝑧!! = 𝑧!" → PoE
  • 18. Proposed method: DMVAE • Reconstruction inference PoE-induced shared inference allows for inference when one or more modalities are missing Thus, we consider three reconstruction tasks: - Reconstruct both modalities at the same time: 𝑥", 𝑥# → 4 𝑥", 4 𝑥# 𝑧(! , 𝑧(" , 𝑧) - Reconstruct a single modality from its own input: 𝑥" → 4 𝑥" 𝑧(! , 𝑧) or 𝑥# → 4 𝑥# 𝑧(" , 𝑧) - Reconstruct a single modality from the opposite modality’s input: 𝑥# → 4 𝑥" 𝑧(! , 𝑧) or 𝑥" → 4 𝑥# 𝑧(" , 𝑧) • Loss function Accuracy of reconstruction for jointly learned shared latent + KL-divergence of each normal distribution Accuracy of cross-modal and self reconstruction + KL-divergence
  • 19. Experiments: Digits (image-image) • Evaluation Qualitative: Cross-generation between modalities Quantitative: Accuracy of the cross-generated images using a pre-trained classifier for each modality - Joint: A sample from zs generates two image modalities that must be assigned the same class Input Output for different samples of zp2 Input Output for different samples of zp1
  • 20. Experiments: Digits (image-image) • Ablation study: DMVAE [2] vs MMVAE [1]’s shared latent space
  • 21. Experiments: Flowers (image-text) • This task is more complex Instead of the image-text, the intermediate features are reconstructed • Quantitative evaluation Class recognition (image-to-text) and cosine-similarity retrieval (text-to-image) on the shared latent space • Qualitative evaluation Retrieval
  • 22. Conclusions • Multimodal VAE for disentangling private and shared spaces Improve the representational performance of multimodal VAEs Successful application to image-image and image-text modalities • Shaping a latent space into subspaces that capture the private-shared aspects of the modalities “is important from the perspective of downstream tasks, where better decomposed representations are more amenable for using on a wider variety of tasks”
  • 23. [3] Multimodal VAEs via mutual supervision • Main differences with [1] and [2] A type of multimodal VAE, without private-shared disentanglement Does not rely on factorizations such as MoE or PoE for modeling modality-shared information Instead, it repurposes semi-supervised VAEs for combining inter-modality information - Allows learning from partially-observed modalities (Reg. = KL divergence) • Proposed method: Mutually supErvised Multimodal vaE (MEME)
  • 24. [3] Multimodal VAEs via mutual supervision • Qualitative evaluation Cross-modal generation • Quantitative evaluation Coherence: Percentage of matching predictions of the cross-generated modality using a pretrained classifier Relatedness: Wassertein Distance between the representations of two modalities (closer if same class)
  • 26. Final remarks • VAE not only for generation but also for reconstruction and disentanglement tasks Recommended textbook: “An Introduction to Variational Autoencoders”, Kingma & Welling • Private-shared latent spaces as an effective tool for analyzing multimodal data • There is still a lot of potential for this research It has been only applied to a limited number of multimodal problems • このテーマに興味のある博士課程の学生 → インターン募集中 https://www.cyberagent.co.jp/news/detail/id=27453 違うテーマでも大丈夫! 共同研究も大歓迎!
  • 27. ありがとうございました︕ Antonio TEJERO DE PABLOS antonio_tejero@cyberagent.co.jp