SlideShare uma empresa Scribd logo
1 de 27
2022-04-21
Sangmin Woo
Computational Intelligence Lab.
School of Electrical Engineering
Korea Advanced Institute of Science and Technology (KAIST)
Masked Autoencoders
Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
Facebook AI Research (FAIR)
Submitted to CVPR 2022
Video: https://www.youtube.com/watch?v=Dp6iICL2dVI
2
In a nutshell…
Masked Autoencoder (MAE)
 MAE are scalable self-supervised learners for computer vision
 Mask random patches of the input image and reconstruct the missing
pixels
 Asymmetric Encoder-Decoder Architecture
• Visible subset of patches (without mask tokens) → Encoder
• Latent representation & Mask tokens → Decoder (lightweight)
 Masking high proportion of the input image, e.g., 75%, yields a non-trivial
and meaningful self-supervisory task
 Accelerate training (by 3× or more) and improves accuracy
 Learning high-capacity models that generalizes well
• Vanilla ViT-H [] achieves the best accuracy (87.8%) among methods
that use only ImageNet-1K data
 Transfer performance in downstream tasks outperforms supervised pre-
training and shows promising scaling behavior
[1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
3
Background
Masked Language Modeling
 Success of self-supervised pre-training in NLP
 Masked language modeling (e.g., BERT [])
 Autoregressive language modeling (e.g., GPT [])
 Method: remove a portion of the input sequence and learn to
predict the missing content
 These methods have been shown to scale excellently
 These pre-trained representations generalize well to various
downstream tasks
[1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
[2]
4
Background
Autoencoder
 Encoder maps an input to a latent representation
 Decoder reconstructs the input
 E.g., PCA and k-means are autoencoders
 Denoising autoencoders (DAE) [1] are a class of autoencoders that corrupt an
input signal and learn to reconstruct the original, uncorrupted signal
 A series of methods can be thought of as a generalized DAE under different
corruptions, e.g., masking pixels or removing color channels
Masked Image Encoding
 DAE [1] presents masking as a noise type
 Convolution-based
 Context Encoder [2] inpaints large missing regions
 Transformer-based
 iGPT [3] operates on sequences of pixels and predicts unknown pixels
 The ViT [4] studies masked patch prediction for self-supervised learning
 Most recently, BEiT [5] proposes to predict discrete tokens
[1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
[2]
[3]
[4]
[5]
5
Background
Self-supervised Learning
 Early self-supervised learning approaches often focused on different
pretext tasks [] for pre-training
 Contrastive learning [] has been popular, e.g., [], which models image
similarity and dissimilarity (or only similarity []) between two or more views
 Contrastive and related methods strongly depend on data augmentation []
 Autoencoding pursues a conceptually different direction, and it exhibits
different behaviors
[1] Vaswani, Ashish, et al. "Attention is all you need." NeurIPS 2017.
[2] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. “Non-local neural networks.” CVPR 2018.
[3] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. “Gcnet: Non-local networks meet squeeze-excitation networks and beyond.” ICCVW
2019.
[4] Prajit Ramachandran, et al. “Stand-alone self-attention in vision models.” NeurIPS 2019.
[5] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. “Exploring self-attention for image recognition.” CVPR 2020..
[6] Nicolas Carion, et al. “End-to-end object detection with Transformers.” ECCV 2020.
[7] Alexey Dosovitskiy, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” ICLR 2021.
6
What makes masked autoencoding
different between vision and language?
Architectural Gap
 Convolutions typically operate on regular grids
 It is not straightforward to integrate ‘indicators’ such as mask tokens or positional embeddings
into convolutional networks
 Solution: ViT [1]
Information Density
 Language: human-generated signals, highly semantic, information-dense, Need sophisticated
language understanding to train a model to predict only a few missing words per sentence
 Images: natural signals, heavy spatial redundancy, A missing patch can be recovered from
neighboring patches with little high-level understanding of parts, objects, and scenes
 Solution: masking a very high portion of random patches
 This strategy largely reduces redundancy and creates a challenging self-supervisory task that
requires holistic understanding beyond low-level image statistics
The Role of Autoencoder’s Decoder
 Language: decoder predicts missing words that contain rich semantic information
 Image: decoder reconstructs pixels, hence its output is of a lower semantic level than common
recognition tasks
 Decoder design plays a key role in determining the semantic level of the learned latent
representations for images
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
7
Masked Autoencoder (MAE)
MAE architecture
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
8
MAE: A Simple, Effective, and Scalable Self-supervised Learner
 MAE masks random patches from the input image and reconstructs the missing
patches in the pixel space
 Asymmetric encoder-decoder design
 Encoder operates only on the visible subset of patches (without mask tokens)
 (Lightweight) Decoder reconstructs the input from the latent representation along with
mask tokens
 Shifting the mask tokens to the small decoder results in a large reduction in computation
 A very high masking ratio (e.g., 75%) can achieve a win-win scenario
 It optimizes accuracy while allowing the encoder to process only a small portion (e.g.,
25%) of patches
 This can reduce overall pre-training time by 3 × or more and likewise reduce memory
consumption, enabling us to easily scale our MAE to large models
 With MAE pre-training,
 Vanilla ViT-H achieves 87.8% accuracy when fine-tuned on ImageNet-1K
 Transfer Learning on object detection, instance segmentation, semantic segmentation
achieve better results than its supervised pre-training counterparts
 Significant gains by scaling up models
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Masked Autoencoder (MAE)
9
ImageNet validation images
 Masking ratio 80%
 No loss is computed on visible patches (i.e., Loss is only computed on
masked patches)
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
MAE Reconstruction
10
COCO validation images
 MAE trained on ImageNet
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
MAE Reconstruction
11
ImageNet validation images
 MAE pre-trained with a masking ratio 75%
 Applied on inputs with higher masking ratios (75%, 85%, 95%)
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
MAE Reconstruction
12
Masking
 Image is divided into regular non-overlapping patches
 Sample a subset of patches and mask (i.e., remove) the remaining ones
 High masking ratio (e.g., 75%)
MAE encoder
 Encoder is a ViT [] but applied only on visible, unmasked patches
 Encoder only operates on a small subset (e.g., 25%) of the full set
 Masked patches are removed; No mask tokens are used
 This allows us to train very large encoders with only a fraction of compute and
memory
MAE decoder
 The input to the MAE decoder is the full set of tokens consisting of encoded visible
patches and mask tokens
 Positional embeddings are added to all tokens in this full set
 MAE decoder is only used during pre-training to perform the image reconstruction
task (only the encoder is used to produce image representations for recognition)
 Decoder has <10% computation per token vs. the encoder
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Masked Autoencoder (MAE)
13
Reconstruction Target
 MAE reconstructs the input by predicting the pixel values for each
masked patch
 Loss function computes the MSE between reconstructed and original
images in the pixel space
 Loss is only computed on masked patches, similar to BERT []
 masks random patches from the input image and reconstructs the
missing patches in the pixel space
Simple Implementation
1. Generate a token for every input patch
2. Randomly shuffle the list of tokens and remove the last portion of the list
3. After encoding, mask tokens are appended to the list of encoded patches,
and unshuffle the full list (inverting the random shuffle operation)
4. The decoder is applied to this full list (with positional embeddings added)
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Masked Autoencoder (MAE)
14
Experimental Settings
 Self-supervised pre-training on the ImageNet-1K training set →
supervised training to evaluate the representation with
• End-to-end fine-tuning
• Linear probing (freeze backbone and only train linear layer)
ViT-L trained from scratch vs. fine-tuned from MAE
 Backbone: ViT-L/16
 Scratch: 200 epochs vs. Fine-tuning: 50 epochs
 Fine-tuning accuracy heavily depends on pre-training
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
ImageNet Experiments
15
Masking ratio
 BERT typical masking ratio: 15%
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
ImageNet Experiments
16
Ablations
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
ImageNet Experiments
17
Wall-clock time (with mask vs. without mask)
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
ImageNet Experiments
18
Mask sampling strategies (random vs. block vs. grid)
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
ImageNet Experiments
19
Training Schedules
 Not observed saturation of linear probing accuracy even at 1600 epochs
 MoCo v3 saturates at 300 epochs for ViT-L
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
ImageNet Experiments
20
Comparisons with self-supervised methods
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Comparisons with Previous Results
21
Comparisons with supervised pre-training
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Comparisons with Previous Results
22
MAE vs. MoCo v3 (w.r.t. # fine-tuned transformer blocks)
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Partial Fine-tuning
23
Object Detection and Segmentation
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Transfer Learning Experiments
24
Semantic Segmentation
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Transfer Learning Experiments
25
Pixels vs. Tokens
He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
Transfer Learning Experiments
26
Appendix: Vision Transformers
Sorted by Date
https://github.com/DirtyHarryLYL/Transformer-in-Vision
Sorted by Task
https://github.com/Yangzhangcst/Transformer-in-
Computer-Vision
Sorted by Venue
https://github.com/dk-liang/Awesome-Visual-Transformer
Thank You!
Sangmin Woo
sangminwoo.github.i
o
smwoo95@kaist.ac.k
Q&A

Mais conteúdo relacionado

Mais procurados

PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NAVER Engineering
 

Mais procurados (20)

Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep Learning
 
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
210523 swin transformer v1.5
210523 swin transformer v1.5210523 swin transformer v1.5
210523 swin transformer v1.5
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Generative adversarial text to image synthesis
Generative adversarial text to image synthesisGenerative adversarial text to image synthesis
Generative adversarial text to image synthesis
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
Pr083 Non-local Neural Networks
Pr083 Non-local Neural NetworksPr083 Non-local Neural Networks
Pr083 Non-local Neural Networks
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 

Semelhante a Masked Autoencoders Are Scalable Vision Learners.pptx

Semelhante a Masked Autoencoders Are Scalable Vision Learners.pptx (20)

jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretrainingjefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
 
Sign Detection from Hearing Impaired
Sign Detection from Hearing ImpairedSign Detection from Hearing Impaired
Sign Detection from Hearing Impaired
 
| Assignment on Masked Autoencoders (MAE) |
| Assignment on Masked Autoencoders (MAE) || Assignment on Masked Autoencoders (MAE) |
| Assignment on Masked Autoencoders (MAE) |
 
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptxAn Empirical Study of Training Self-Supervised Vision Transformers.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
 
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
 
Color Detection & Segmentation based Invisible Cloak
Color Detection & Segmentation based Invisible CloakColor Detection & Segmentation based Invisible Cloak
Color Detection & Segmentation based Invisible Cloak
 
BEIT:BERT Pre-training of Image Transformers.pdf
BEIT:BERT Pre-training of Image Transformers.pdfBEIT:BERT Pre-training of Image Transformers.pdf
BEIT:BERT Pre-training of Image Transformers.pdf
 
False colouring
False colouringFalse colouring
False colouring
 
Depth-DensePose: an efficient densely connected deep learning model for came...
Depth-DensePose: an efficient densely connected deep learning  model for came...Depth-DensePose: an efficient densely connected deep learning  model for came...
Depth-DensePose: an efficient densely connected deep learning model for came...
 
PIES_Profile_INDIA
PIES_Profile_INDIAPIES_Profile_INDIA
PIES_Profile_INDIA
 
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
 
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
 
Image De-Noising Using Deep Neural Network
Image De-Noising Using Deep Neural NetworkImage De-Noising Using Deep Neural Network
Image De-Noising Using Deep Neural Network
 
IMAGE DE-NOISING USING DEEP NEURAL NETWORK
IMAGE DE-NOISING USING DEEP NEURAL NETWORKIMAGE DE-NOISING USING DEEP NEURAL NETWORK
IMAGE DE-NOISING USING DEEP NEURAL NETWORK
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using Keras
 
Robust Malware Detection using Residual Attention Network
Robust Malware Detection using Residual Attention NetworkRobust Malware Detection using Residual Attention Network
Robust Malware Detection using Residual Attention Network
 
IMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUESIMAGE SEGMENTATION AND ITS TECHNIQUES
IMAGE SEGMENTATION AND ITS TECHNIQUES
 
IRJET- Efficient Face Detection from Video Sequences using KNN and PCA
IRJET-  	  Efficient Face Detection from Video Sequences using KNN and PCAIRJET-  	  Efficient Face Detection from Video Sequences using KNN and PCA
IRJET- Efficient Face Detection from Video Sequences using KNN and PCA
 
Introduction to Segmentation in Computer vision
Introduction to Segmentation in Computer vision Introduction to Segmentation in Computer vision
Introduction to Segmentation in Computer vision
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
 

Mais de Sangmin Woo

Mais de Sangmin Woo (12)

Multimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptxMultimodal Learning with Severely Missing Modality.pptx
Multimodal Learning with Severely Missing Modality.pptx
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Visual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptxVisual Commonsense Reasoning.pptx
Visual Commonsense Reasoning.pptx
 
Video Grounding.pptx
Video Grounding.pptxVideo Grounding.pptx
Video Grounding.pptx
 
Action Recognition Datasets.pptx
Action Recognition Datasets.pptxAction Recognition Datasets.pptx
Action Recognition Datasets.pptx
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
Towards Efficient Transformers
Towards Efficient TransformersTowards Efficient Transformers
Towards Efficient Transformers
 
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene GraphsAction Genome: Action As Composition of Spatio Temporal Scene Graphs
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
 
Neural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global contextNeural motifs scene graph parsing with global context
Neural motifs scene graph parsing with global context
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Graph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph GenerationGraph R-CNN for Scene Graph Generation
Graph R-CNN for Scene Graph Generation
 

Último

Último (20)

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 

Masked Autoencoders Are Scalable Vision Learners.pptx

  • 1. 2022-04-21 Sangmin Woo Computational Intelligence Lab. School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Masked Autoencoders Are Scalable Vision Learners Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick Facebook AI Research (FAIR) Submitted to CVPR 2022 Video: https://www.youtube.com/watch?v=Dp6iICL2dVI
  • 2. 2 In a nutshell… Masked Autoencoder (MAE)  MAE are scalable self-supervised learners for computer vision  Mask random patches of the input image and reconstruct the missing pixels  Asymmetric Encoder-Decoder Architecture • Visible subset of patches (without mask tokens) → Encoder • Latent representation & Mask tokens → Decoder (lightweight)  Masking high proportion of the input image, e.g., 75%, yields a non-trivial and meaningful self-supervisory task  Accelerate training (by 3× or more) and improves accuracy  Learning high-capacity models that generalizes well • Vanilla ViT-H [] achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data  Transfer performance in downstream tasks outperforms supervised pre- training and shows promising scaling behavior [1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
  • 3. 3 Background Masked Language Modeling  Success of self-supervised pre-training in NLP  Masked language modeling (e.g., BERT [])  Autoregressive language modeling (e.g., GPT [])  Method: remove a portion of the input sequence and learn to predict the missing content  These methods have been shown to scale excellently  These pre-trained representations generalize well to various downstream tasks [1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. [2]
  • 4. 4 Background Autoencoder  Encoder maps an input to a latent representation  Decoder reconstructs the input  E.g., PCA and k-means are autoencoders  Denoising autoencoders (DAE) [1] are a class of autoencoders that corrupt an input signal and learn to reconstruct the original, uncorrupted signal  A series of methods can be thought of as a generalized DAE under different corruptions, e.g., masking pixels or removing color channels Masked Image Encoding  DAE [1] presents masking as a noise type  Convolution-based  Context Encoder [2] inpaints large missing regions  Transformer-based  iGPT [3] operates on sequences of pixels and predicts unknown pixels  The ViT [4] studies masked patch prediction for self-supervised learning  Most recently, BEiT [5] proposes to predict discrete tokens [1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. [2] [3] [4] [5]
  • 5. 5 Background Self-supervised Learning  Early self-supervised learning approaches often focused on different pretext tasks [] for pre-training  Contrastive learning [] has been popular, e.g., [], which models image similarity and dissimilarity (or only similarity []) between two or more views  Contrastive and related methods strongly depend on data augmentation []  Autoencoding pursues a conceptually different direction, and it exhibits different behaviors [1] Vaswani, Ashish, et al. "Attention is all you need." NeurIPS 2017. [2] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. “Non-local neural networks.” CVPR 2018. [3] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. “Gcnet: Non-local networks meet squeeze-excitation networks and beyond.” ICCVW 2019. [4] Prajit Ramachandran, et al. “Stand-alone self-attention in vision models.” NeurIPS 2019. [5] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. “Exploring self-attention for image recognition.” CVPR 2020.. [6] Nicolas Carion, et al. “End-to-end object detection with Transformers.” ECCV 2020. [7] Alexey Dosovitskiy, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” ICLR 2021.
  • 6. 6 What makes masked autoencoding different between vision and language? Architectural Gap  Convolutions typically operate on regular grids  It is not straightforward to integrate ‘indicators’ such as mask tokens or positional embeddings into convolutional networks  Solution: ViT [1] Information Density  Language: human-generated signals, highly semantic, information-dense, Need sophisticated language understanding to train a model to predict only a few missing words per sentence  Images: natural signals, heavy spatial redundancy, A missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes  Solution: masking a very high portion of random patches  This strategy largely reduces redundancy and creates a challenging self-supervisory task that requires holistic understanding beyond low-level image statistics The Role of Autoencoder’s Decoder  Language: decoder predicts missing words that contain rich semantic information  Image: decoder reconstructs pixels, hence its output is of a lower semantic level than common recognition tasks  Decoder design plays a key role in determining the semantic level of the learned latent representations for images He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
  • 7. 7 Masked Autoencoder (MAE) MAE architecture He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021.
  • 8. 8 MAE: A Simple, Effective, and Scalable Self-supervised Learner  MAE masks random patches from the input image and reconstructs the missing patches in the pixel space  Asymmetric encoder-decoder design  Encoder operates only on the visible subset of patches (without mask tokens)  (Lightweight) Decoder reconstructs the input from the latent representation along with mask tokens  Shifting the mask tokens to the small decoder results in a large reduction in computation  A very high masking ratio (e.g., 75%) can achieve a win-win scenario  It optimizes accuracy while allowing the encoder to process only a small portion (e.g., 25%) of patches  This can reduce overall pre-training time by 3 × or more and likewise reduce memory consumption, enabling us to easily scale our MAE to large models  With MAE pre-training,  Vanilla ViT-H achieves 87.8% accuracy when fine-tuned on ImageNet-1K  Transfer Learning on object detection, instance segmentation, semantic segmentation achieve better results than its supervised pre-training counterparts  Significant gains by scaling up models He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Masked Autoencoder (MAE)
  • 9. 9 ImageNet validation images  Masking ratio 80%  No loss is computed on visible patches (i.e., Loss is only computed on masked patches) He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. MAE Reconstruction
  • 10. 10 COCO validation images  MAE trained on ImageNet He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. MAE Reconstruction
  • 11. 11 ImageNet validation images  MAE pre-trained with a masking ratio 75%  Applied on inputs with higher masking ratios (75%, 85%, 95%) He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. MAE Reconstruction
  • 12. 12 Masking  Image is divided into regular non-overlapping patches  Sample a subset of patches and mask (i.e., remove) the remaining ones  High masking ratio (e.g., 75%) MAE encoder  Encoder is a ViT [] but applied only on visible, unmasked patches  Encoder only operates on a small subset (e.g., 25%) of the full set  Masked patches are removed; No mask tokens are used  This allows us to train very large encoders with only a fraction of compute and memory MAE decoder  The input to the MAE decoder is the full set of tokens consisting of encoded visible patches and mask tokens  Positional embeddings are added to all tokens in this full set  MAE decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition)  Decoder has <10% computation per token vs. the encoder He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Masked Autoencoder (MAE)
  • 13. 13 Reconstruction Target  MAE reconstructs the input by predicting the pixel values for each masked patch  Loss function computes the MSE between reconstructed and original images in the pixel space  Loss is only computed on masked patches, similar to BERT []  masks random patches from the input image and reconstructs the missing patches in the pixel space Simple Implementation 1. Generate a token for every input patch 2. Randomly shuffle the list of tokens and remove the last portion of the list 3. After encoding, mask tokens are appended to the list of encoded patches, and unshuffle the full list (inverting the random shuffle operation) 4. The decoder is applied to this full list (with positional embeddings added) He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Masked Autoencoder (MAE)
  • 14. 14 Experimental Settings  Self-supervised pre-training on the ImageNet-1K training set → supervised training to evaluate the representation with • End-to-end fine-tuning • Linear probing (freeze backbone and only train linear layer) ViT-L trained from scratch vs. fine-tuned from MAE  Backbone: ViT-L/16  Scratch: 200 epochs vs. Fine-tuning: 50 epochs  Fine-tuning accuracy heavily depends on pre-training He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. ImageNet Experiments
  • 15. 15 Masking ratio  BERT typical masking ratio: 15% He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. ImageNet Experiments
  • 16. 16 Ablations He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. ImageNet Experiments
  • 17. 17 Wall-clock time (with mask vs. without mask) He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. ImageNet Experiments
  • 18. 18 Mask sampling strategies (random vs. block vs. grid) He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. ImageNet Experiments
  • 19. 19 Training Schedules  Not observed saturation of linear probing accuracy even at 1600 epochs  MoCo v3 saturates at 300 epochs for ViT-L He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. ImageNet Experiments
  • 20. 20 Comparisons with self-supervised methods He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Comparisons with Previous Results
  • 21. 21 Comparisons with supervised pre-training He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Comparisons with Previous Results
  • 22. 22 MAE vs. MoCo v3 (w.r.t. # fine-tuned transformer blocks) He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Partial Fine-tuning
  • 23. 23 Object Detection and Segmentation He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Transfer Learning Experiments
  • 24. 24 Semantic Segmentation He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Transfer Learning Experiments
  • 25. 25 Pixels vs. Tokens He, Kaiming, et al. "Masked autoencoders are scalable vision learners." arXiv 2021. Transfer Learning Experiments
  • 26. 26 Appendix: Vision Transformers Sorted by Date https://github.com/DirtyHarryLYL/Transformer-in-Vision Sorted by Task https://github.com/Yangzhangcst/Transformer-in- Computer-Vision Sorted by Venue https://github.com/dk-liang/Awesome-Visual-Transformer

Notas do Editor

  1. [Masking] High masking ratio (e.g., 75%) largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation from visible neighboring patches Highly sparse input creates an opportunity for designing an efficient encoder [MAE encoder] The full set is handled by a lightweight decoder [MAE decoder] Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted Positional embeddings are added to all tokens in this full set; without this, mask tokens would have no information about their location in the image the decoder architecture can be flexibly designed in a manner that is independent of the encoder design.
  2. Thank you.