SlideShare uma empresa Scribd logo
Masked Autoencoders
Are Scalable Vision Learners
Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”
14th November, 2021
PR12 Paper Review
JinWon Lee
Introduction
• Deep learning has witnessed an explosion of architectures of
continuously growing capability and capacity.
• Aided by the rapid gains in hardware, models today can easily overfit
one million images and begin to demand hundreds of millions of—
often publicly inaccessible—labeled images.
• This appetite for data has been successfully addressed in natural
language processing (NLP) by self-supervised pretraining.
Introduction
• The solutions, based on autoregressive language modeling in GPT and
masked autoencoding in BERT, are conceptually simple: they remove
a portion of the data and learn to predict the removed content.
• These methods now enable training of generalizable NLP models
containing over one hundred billion parameters.
Introduction
• The idea of masked autoencoders, a form of more general denoising
autoencoders, is natural and applicable in computer vision as well.
• However, despite significant interest in this idea following the success
of BERT, progress of autoencoding methods in vision lags behind NLP.
What makes masked autoencoding different
between vision and language?
• Until recently, architectures were different. In vision, convolutional
networks were dominant over the last decade.
• Convolutions typically operate on regular grids and it is not
straightforward to integrate ‘indicators’ such as mask tokens or
positional embeddings into convolutional networks.
• This architectural gap, however, has been addressed with the
introduction of Vision Transformers (ViT) and should no longer
present an obstacle.
What makes masked autoencoding different
between vision and language?
• Information density is different between language and vision.
• Languages are human-generated signals that are highly semantic and
information-dense.
• When training a model to predict only a few missing words per
sentence, this task appears to induce sophisticated language
understanding.
• Images, on the contrary, are natural signals with heavy spatial
redundancy—e.g., a missing patch can be recovered from neighboring
patches with little high-level understanding of parts, objects, and
scenes.
What makes masked autoencoding different
between vision and language?
• The autoencoder’s decoder, which maps the latent representation
back to the input, plays a different role between reconstructing text
and images.
• In vision, the decoder reconstructs pixels, hence its output is of a
lower semantic level than common recognition tasks.
• This is in contrast to language, where the decoder predicts missing
words that contain rich semantic information.
• While in BERT the decoder can be trivial (an MLP), we found that for
images, the decoder design plays a key role in determining the
semantic level of the learned latent representations.
Related Work
• Masked language modeling
▪ BERT and GPT, are highly successful methods for pre-training in NLP.
▪ These methods hold out a portion of the input sequence and train models to
predict the missing content.
▪ These methods have been shown to scale excellently and a large abundance
of evidence indicates that these pre-trained representations generalize well
to various downstream tasks.
Related Work
• Autoencoding
▪ It has an encoder that maps an input to a latent representation and a decoder
that reconstructs the input.
▪ Denoising autoencoders (DAE) are a class of autoencoders that corrupt an
input signal and learn to reconstruct the original, uncorrupted signal.
▪ A series of methods can be thought of as a generalized DAE under different
corruptions, e.g., masking pixels or removing color channels.
Related Work
• Masked image encoding
▪ The pioneering work of presents masking as a noise type in DAE.
▪ Context Encoder inpaints large missing regions using convolutional networks.
▪ Motivated by the success in NLP, related recent methods are based on
Transformers. iGPT operates on sequences of pixels and predicts unknown
pixels. The ViT paper studies masked patch prediction for self-supervised
learning. Most recently, BEiT proposes to predict discrete tokens.
BEiT: BERT Pre-Training of Image Transformers
Related Work
• Self-supervised learning(SSL)
▪ SSL approaches have seen significant interest in computer vision, often
focusing on different pretext tasks for pre-training.
▪ Recently, contrastive learning has been popular, which models image
similarity and dissimilarity (or only similarity) between two or more views.
▪ Contrastive and related methods strongly depend on data augmentation.
▪ Autoencoding pursues a conceptually different direction, and it exhibits
different behaviors.
Approach
• Suggested masked autoencoder(MAE) is a simple autoencoding
approach that reconstructs the original signal given its partial
observation.
• Like all autoencoders, MAE has an encoder that maps the observed
signal to a latent representation, and a decoder that reconstructs the
original signal from the latent representation.
• Unlike classical autoencoders, The authors adopt an asymmetric
design that allows the encoder to operate only on the partial,
observed signal (without mask tokens) and a lightweight decoder that
reconstructs the full signal from the latent representation and mask
tokens.
MAE Architecture
• During pre-training, a large random
subset of image patches (e.g., 75%)
is masked out.
• The encoder is applied to the small
subset of visible patches.
• Mask tokens are introduced after
the encoder, and the full set of
encoded patches and mask tokens
is processed by a small decoder
that reconstructs the original
image in pixels.
• After pre-training, the decoder is
discarded and the encoder is
applied to uncorrupted images to
produce representations for
recognition tasks.
Results
MAE Details
• Masking
▪ Following ViT, an image is divided into regular non-overlapping patches. Then
a subset of patches is sampled and masked.
▪ Sampling strategy is straightforward: random patches without replacement,
following a uniform distribution.
▪ Random sampling with a high masking ratio largely eliminates redundancy,
thus creating a task that cannot be easily solved by extrapolation from visible
neighboring patches.
MAE Details
• MAE encoder
▪ MAE encoder is a ViT but applied only on visible, unmasked patches.
▪ Just as in a standard ViT, MAE encoder embeds patches by a linear projection
with added positional embeddings.
▪ However, this encoder only operates on a small subset (e.g., 25%) of the full
set.
▪ This can reduce overall pre-training time by 3x or more and likewise reduce
memory consumption, enabling us to easily scale MAE to large models.
MAE Details
• MAE decoder
▪ The input to the MAE decoder is the full set of tokens consisting of (i)
encoded visible patches, and (ii) mask tokens.
▪ Each mask token is a shared, learned vector that indicates the presence of a
missing patch to be predicted.
▪ Positional embeddings are added to all tokens in this full set.
▪ The MAE decoder is only used during pre-training to perform the image
reconstruction task, so the decoder architecture can be flexible designed in a
manner that is independent of the encoder design.
▪ MAE’s default decoder has <10% computation per token vs. the encoder. With
this asymmetrical design, the full set of tokens are only processed by the
lightweight decoder, which significantly reduces pre-training time.
MAE Details
• Reconstruction target
▪ MAE reconstructs the input by predicting the pixel values for each masked
patch.
▪ The loss function computes the mean squared error (MSE) between the
reconstructed and original images in the pixel space. Computing the loss only
on masked patches, similar to BERT.
• Simple implementation
▪ First, generate a token for every input patch (by linear projection with an
added positional embedding). Next, randomly shuffle the list of tokens and
remove the last portion of the list, based on the masking ratio.
▪ After encoding, append a list of mask tokens to the list of encoded patches,
and unshuffle this full list (inverting the random shuffle operation) to align all
tokens with their targets.
ImageNet Experiments
• The authors do self-supervised pre-training on the ImageNet-1K(IN1K)
training set.
• Then they do supervised training to evaluate the representations with
(i) end-to-end fine-tuning or (ii) linear probing.
• Baseline: ViT-Large
▪ ViT-Large(Vit-L/16) is used as backbone in ablation study.
▪ ViT-L is very big and tends to overfit.
▪ It is nontrivial to train supervised ViT-L from scratch and a good recipe with
strong regularization is needed.
Masking Ratio
• The optimal ratios are surprisingly high.
• 75% is good for both linear probing and
fine-tuning.
• This is in contrast with BERT(15%) and also
much higher than those in related works in
CV(20%~50%).
• Reasoning-like behavior is linked to the
learning of useful representations.
• For linear probing, the accuracy increases
steadily until the sweet point, but for fine-
tuning, the results are less sensitive to the
ratios.
• All fine-tuning results are better that
training from scratch(82.5%)
Decoder Design
• A sufficiently deep decoder is important for linear probing. This can be explained
by the gap between a pixel reconstruction task and a recognition task: the last
several layers in an autoencoder are more specialized for reconstruction, but are
less relevant for recognition.
• However, if fine-tuning is used, the last layers of the encoder can be tuned to
adapt to the recognition task. The decoder depth is less influential.
• Interestingly, MAE with a single-block decoder can perform strongly with fine-
tuning (84.8%). Note that a single Transformer block is the minimal requirement
to propagate information from visible tokens to mask tokens.
• Overall, default MAE decoder is lightweight. It has 8 blocks and a width of 512-d.
It only has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d).
Mask Token
• If the encoder uses mask tokens, it
performs worse.
• In this case, there is a gap between
pre-training and deploying: this
encoder has a large portion of mask
tokens in its input in pretraining,
which does not exist in uncorrupted
images. This gap may degrade
accuracy in deployment.
Mask Token
• By skipping the mask token in the encoder,
training computation is greatly reduced.
• Note that the speedup can be >4x for a
masking ratio of 75%, partially because the
self-attention complexity is quadratic.
• In addition, memory is greatly reduced, which
can enable training even larger models or
speeding up more by large-batch training.
• The time and memory efficiency makes MAE
favorable for training very large models.
Reconstruction Target
• Using pixels with normalization improves accuracy. This per-patch normalization
enhances the contrast locally.
• In another variant, the authors perform PCA in the patch space and use the largest PCA
coefficients (96 here) as the target. Doing so degrades accuracy.
• The authors also compare an MAE variant that predicts tokens, the target used in BEiT.
Specifically for this variant, the DALLE pre-trained dVAE is used as the tokenizer, following
BEIT.
• This tokenization improves fine-tuning accuracy vs. unnormalized pixels, but has no
advantage vs. normalized pixels.
• The dVAE tokenizer requires one more pre-training stage, which may depend on extra
data (250M images). The dVAE encoder is a large convolutional network (40% FLOPs of
ViT-L) and adds nontrivial overhead. Using pixels does not suffer from these problems.
Data Augmentation
• MAE works well using cropping-only augmentation, either fixed-size or
random-size (both having random horizontal flipping).
• Adding color jittering degrades the results.
• Surprisingly, MAE behaves decently even if using no data augmentation
(only center-crop, no flipping). This property is dramatically different from
contrastive learning and related methods, which heavily rely on data
augmentation.
• In MAE, the role of data augmentation is mainly performed by random
masking. The masks are different for each iteration and so they generate
new training samples regardless of data augmentation.
Masking Sampling Strategy
• MAE with block-wise masking works reasonably well at a ratio of 50%,
but degrades at a ratio of 75%.
• Grid-wise sampling regularly keeps one of every four patches and this
is an easier task and has lower training loss. The reconstruction is
sharper, but the representation quality is lower.
• Simple random sampling works the best for MAE. It allows for a
higher masking ratio, which provides a greater speedup benefit.
Training Schedule
• The accuracy improves steadily with longer
training. Indeed, saturation of linear probing
have not been observed even at 1600 epochs.
• This behavior is unlike contrastive learning
methods, e.g., MoCo v3 saturates at 300
epochs for ViT-L.
• Note that the MAE encoder only sees 25% of
patches per epoch, while in contrastive
learning the encoder sees 200% (two crop) or
even more (multi-crop) patches per epoch.
Comparisons with Self-supervised Methods
• For ViT-B, all methods perform closely.
• For Vit-L the gaps among methods are bigger,
suggesting that a challenge for bigger models is to
reduce overfitting.
• MAE can scale up easily and has shown steady
improvement from bigger models.
• By fine-tuning with a 448 size, MAE achieve 87.8%
accuracy, using only IN1K data. The previous best
accuracy, among all methods using only IN1K data,
is 87.1% (512 size), based on advanced networks.
• Comparing with BEiT, MAE is more accurate while
being simpler and faster.
Comparisons with Supervised Pre-training
• In the original ViT paper, ViT-L degrades
when trained in IN1K. The authors
improved supervised recipe works better
for training from scratch, but the accuracy is
saturated.
• MAE pre-training, using only IN1K, can
generalize better: the gain over training
from scratch is bigger for higher-capacity
models.
• It follows a trend similar to the JFT-300M
supervised pre-training. This comparison
shows that MAE can help scale up model
sizes.
Partial Fine-tuning
• Table 1. shows that linear probing and fine-tuning results are largely uncorrelated.
• Linear probing has been a popular protocol in the past few years; however, it
misses the opportunity of pursuing strong but non-linear features—which is
indeed a strength of deep learning.
• As a middle ground, we study a partial fine-tuning protocol: fine-tune the last
several layers while freezing the others.
Partial Fine-tuning
• Notably, fine-tuning only one Transformer block
boosts the accuracy significantly from 73.5% to
81.0%. Moreover, if we fine-tune only “half” of
the last block (i.e., its MLP sub-block), we can get
79.1%, much better than linear probing.
• Comparing with MoCo v3 which is a contrastive
method, MOCO v3 has higher linear probing
accuracy than MAE however, all of its partial fine-
tuning results are worse than MAE.
• These results show that the MAE representations
are less linearly separable, but they are stronger
non-linear features and perform well when a non-
linear head is tuned. These observations suggest
that linear separability is not the sole metric for
evaluating representation quality.
Transfer Learning Experiments
• Object detection and instance segmentation
▪ Mask R-CNN is fine-tuned on COCO. The ViT backbone is adapted for use with
FPN.
• Semantic segmentation
▪ Experiments on ADE20K use UperNet following the code in BEiT.
Transfer Learning Experiments
• Pixels vs. tokens
▪ While using dVAE tokens is better than using unnormalized pixels, it is
statistically similar to just using normalized pixels across all tasks and models.
It agains shows that tokenization is not necessary for MAE.
Discussion and Conclusion
• Simple algorithms that scale well are the core of deep learning.
• In NLP, simple self-supervised learning methods enable benefits from
exponentially scaling models. In computer vision, practical pre-
training paradigms are dominantly supervised despite progress in
self-supervised learning.
• Self supervised learning in vision may now be embarking on a similar
trajectory as in NLP.
Discussion and Conclusion
• On the other hand, images and languages are signals of a different
nature and this difference must be addressed carefully. Images are
merely recorded light without a semantic decomposition into the
visual analogue of words.
• Removing random patches that most likely do not form a semantic
segment and MAE reconstructs pixels which are not semantic entities.
Nevertheless, MAE infers complex, holistic reconstructions suggesting
it has learned numerous visual concepts, i.e., semantics.
• The authors hypothesize that this behavior occurs by way of a rich
hidden representation inside the MAE.
Thank you

Mais conteúdo relacionado

Mais procurados

Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
changedaeoh
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
Jun Lang
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
Vitaly Bondar
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
PyData
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
남주 김
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
Bushra Jbawi
 
Diffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisDiffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesis
BeerenSahu
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
Dongmin Choi
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
Sungjoon Choi
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
Jinwon Lee
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
milad abbasi
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
[Paper review] contrastive language image pre-training, open ai, 2020
[Paper review] contrastive language image pre-training, open ai, 2020[Paper review] contrastive language image pre-training, open ai, 2020
[Paper review] contrastive language image pre-training, open ai, 2020
Seonghoon Jung
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
Dongmin Choi
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
Gabriel Cypriano Saca
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
MLReview
 

Mais procurados (20)

Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 
Transfer Learning and Fine-tuning Deep Neural Networks
 Transfer Learning and Fine-tuning Deep Neural Networks Transfer Learning and Fine-tuning Deep Neural Networks
Transfer Learning and Fine-tuning Deep Neural Networks
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Diffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesisDiffusion models beat gans on image synthesis
Diffusion models beat gans on image synthesis
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
 
Semantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep LearningSemantic Segmentation Methods using Deep Learning
Semantic Segmentation Methods using Deep Learning
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
[Paper review] contrastive language image pre-training, open ai, 2020
[Paper review] contrastive language image pre-training, open ai, 2020[Paper review] contrastive language image pre-training, open ai, 2020
[Paper review] contrastive language image pre-training, open ai, 2020
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 

Semelhante a PR-355: Masked Autoencoders Are Scalable Vision Learners

jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretrainingjefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
cevesom156
 
multi modal transformers representation generation .pptx
multi modal transformers representation generation .pptxmulti modal transformers representation generation .pptx
multi modal transformers representation generation .pptx
siddharth1729
 
Image captioning
Image captioningImage captioning
Image captioning
Muhammad Zbeedat
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
SwatiNarkhede1
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
SwatiNarkhede1
 
Face Detection.pptx
Face Detection.pptxFace Detection.pptx
Face Detection.pptx
TorshaSett
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image Processing
MeetupDataScienceRoma
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
devismileyrockz
 
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
Edge AI and Vision Alliance
 
UNIT-4.pdf
UNIT-4.pdfUNIT-4.pdf
UNIT-4.pdf
NiharikaThakur32
 
UNIT-4.pdf
UNIT-4.pdfUNIT-4.pdf
UNIT-4.pdf
NiharikaThakur32
 
Deep learning seminar report
Deep learning seminar reportDeep learning seminar report
Deep learning seminar report
SKS
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
UNIT-4.pptx
UNIT-4.pptxUNIT-4.pptx
UNIT-4.pptx
NiharikaThakur32
 
lec6a.ppt
lec6a.pptlec6a.ppt
lec6a.ppt
SaadMemon23
 
Intro to Deep Learning
Intro to Deep LearningIntro to Deep Learning
Intro to Deep Learning
Kushal Arora
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
CHENHuiMei
 
WT in IP.ppt
WT in IP.pptWT in IP.ppt
WT in IP.ppt
viveksingh19210115
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Sangmin Woo
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
zperjaccico
 

Semelhante a PR-355: Masked Autoencoders Are Scalable Vision Learners (20)

jefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretrainingjefferson-mae Masked Autoencoders based Pretraining
jefferson-mae Masked Autoencoders based Pretraining
 
multi modal transformers representation generation .pptx
multi modal transformers representation generation .pptxmulti modal transformers representation generation .pptx
multi modal transformers representation generation .pptx
 
Image captioning
Image captioningImage captioning
Image captioning
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
Face Detection.pptx
Face Detection.pptxFace Detection.pptx
Face Detection.pptx
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image Processing
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
 
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
“High-fidelity Conversion of Floating-point Networks for Low-precision Infere...
 
UNIT-4.pdf
UNIT-4.pdfUNIT-4.pdf
UNIT-4.pdf
 
UNIT-4.pdf
UNIT-4.pdfUNIT-4.pdf
UNIT-4.pdf
 
Deep learning seminar report
Deep learning seminar reportDeep learning seminar report
Deep learning seminar report
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
UNIT-4.pptx
UNIT-4.pptxUNIT-4.pptx
UNIT-4.pptx
 
lec6a.ppt
lec6a.pptlec6a.ppt
lec6a.ppt
 
Intro to Deep Learning
Intro to Deep LearningIntro to Deep Learning
Intro to Deep Learning
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
WT in IP.ppt
WT in IP.pptWT in IP.ppt
WT in IP.ppt
 
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 

Mais de Jinwon Lee

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
Jinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
Jinwon Lee
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
Jinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
Jinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
Jinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
Jinwon Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Jinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
Jinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
Jinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Jinwon Lee
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Jinwon Lee
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
Jinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 

Mais de Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksPR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
PR095: Modularity Matters: Learning Invariant Relational Reasoning Tasks
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 

Último

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 

Último (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 

PR-355: Masked Autoencoders Are Scalable Vision Learners

  • 1. Masked Autoencoders Are Scalable Vision Learners Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners” 14th November, 2021 PR12 Paper Review JinWon Lee
  • 2. Introduction • Deep learning has witnessed an explosion of architectures of continuously growing capability and capacity. • Aided by the rapid gains in hardware, models today can easily overfit one million images and begin to demand hundreds of millions of— often publicly inaccessible—labeled images. • This appetite for data has been successfully addressed in natural language processing (NLP) by self-supervised pretraining.
  • 3. Introduction • The solutions, based on autoregressive language modeling in GPT and masked autoencoding in BERT, are conceptually simple: they remove a portion of the data and learn to predict the removed content. • These methods now enable training of generalizable NLP models containing over one hundred billion parameters.
  • 4. Introduction • The idea of masked autoencoders, a form of more general denoising autoencoders, is natural and applicable in computer vision as well. • However, despite significant interest in this idea following the success of BERT, progress of autoencoding methods in vision lags behind NLP.
  • 5. What makes masked autoencoding different between vision and language? • Until recently, architectures were different. In vision, convolutional networks were dominant over the last decade. • Convolutions typically operate on regular grids and it is not straightforward to integrate ‘indicators’ such as mask tokens or positional embeddings into convolutional networks. • This architectural gap, however, has been addressed with the introduction of Vision Transformers (ViT) and should no longer present an obstacle.
  • 6. What makes masked autoencoding different between vision and language? • Information density is different between language and vision. • Languages are human-generated signals that are highly semantic and information-dense. • When training a model to predict only a few missing words per sentence, this task appears to induce sophisticated language understanding. • Images, on the contrary, are natural signals with heavy spatial redundancy—e.g., a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes.
  • 7. What makes masked autoencoding different between vision and language? • The autoencoder’s decoder, which maps the latent representation back to the input, plays a different role between reconstructing text and images. • In vision, the decoder reconstructs pixels, hence its output is of a lower semantic level than common recognition tasks. • This is in contrast to language, where the decoder predicts missing words that contain rich semantic information. • While in BERT the decoder can be trivial (an MLP), we found that for images, the decoder design plays a key role in determining the semantic level of the learned latent representations.
  • 8. Related Work • Masked language modeling ▪ BERT and GPT, are highly successful methods for pre-training in NLP. ▪ These methods hold out a portion of the input sequence and train models to predict the missing content. ▪ These methods have been shown to scale excellently and a large abundance of evidence indicates that these pre-trained representations generalize well to various downstream tasks.
  • 9. Related Work • Autoencoding ▪ It has an encoder that maps an input to a latent representation and a decoder that reconstructs the input. ▪ Denoising autoencoders (DAE) are a class of autoencoders that corrupt an input signal and learn to reconstruct the original, uncorrupted signal. ▪ A series of methods can be thought of as a generalized DAE under different corruptions, e.g., masking pixels or removing color channels.
  • 10. Related Work • Masked image encoding ▪ The pioneering work of presents masking as a noise type in DAE. ▪ Context Encoder inpaints large missing regions using convolutional networks. ▪ Motivated by the success in NLP, related recent methods are based on Transformers. iGPT operates on sequences of pixels and predicts unknown pixels. The ViT paper studies masked patch prediction for self-supervised learning. Most recently, BEiT proposes to predict discrete tokens.
  • 11. BEiT: BERT Pre-Training of Image Transformers
  • 12. Related Work • Self-supervised learning(SSL) ▪ SSL approaches have seen significant interest in computer vision, often focusing on different pretext tasks for pre-training. ▪ Recently, contrastive learning has been popular, which models image similarity and dissimilarity (or only similarity) between two or more views. ▪ Contrastive and related methods strongly depend on data augmentation. ▪ Autoencoding pursues a conceptually different direction, and it exhibits different behaviors.
  • 13. Approach • Suggested masked autoencoder(MAE) is a simple autoencoding approach that reconstructs the original signal given its partial observation. • Like all autoencoders, MAE has an encoder that maps the observed signal to a latent representation, and a decoder that reconstructs the original signal from the latent representation. • Unlike classical autoencoders, The authors adopt an asymmetric design that allows the encoder to operate only on the partial, observed signal (without mask tokens) and a lightweight decoder that reconstructs the full signal from the latent representation and mask tokens.
  • 14. MAE Architecture • During pre-training, a large random subset of image patches (e.g., 75%) is masked out. • The encoder is applied to the small subset of visible patches. • Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. • After pre-training, the decoder is discarded and the encoder is applied to uncorrupted images to produce representations for recognition tasks.
  • 16. MAE Details • Masking ▪ Following ViT, an image is divided into regular non-overlapping patches. Then a subset of patches is sampled and masked. ▪ Sampling strategy is straightforward: random patches without replacement, following a uniform distribution. ▪ Random sampling with a high masking ratio largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation from visible neighboring patches.
  • 17. MAE Details • MAE encoder ▪ MAE encoder is a ViT but applied only on visible, unmasked patches. ▪ Just as in a standard ViT, MAE encoder embeds patches by a linear projection with added positional embeddings. ▪ However, this encoder only operates on a small subset (e.g., 25%) of the full set. ▪ This can reduce overall pre-training time by 3x or more and likewise reduce memory consumption, enabling us to easily scale MAE to large models.
  • 18. MAE Details • MAE decoder ▪ The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. ▪ Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. ▪ Positional embeddings are added to all tokens in this full set. ▪ The MAE decoder is only used during pre-training to perform the image reconstruction task, so the decoder architecture can be flexible designed in a manner that is independent of the encoder design. ▪ MAE’s default decoder has <10% computation per token vs. the encoder. With this asymmetrical design, the full set of tokens are only processed by the lightweight decoder, which significantly reduces pre-training time.
  • 19. MAE Details • Reconstruction target ▪ MAE reconstructs the input by predicting the pixel values for each masked patch. ▪ The loss function computes the mean squared error (MSE) between the reconstructed and original images in the pixel space. Computing the loss only on masked patches, similar to BERT. • Simple implementation ▪ First, generate a token for every input patch (by linear projection with an added positional embedding). Next, randomly shuffle the list of tokens and remove the last portion of the list, based on the masking ratio. ▪ After encoding, append a list of mask tokens to the list of encoded patches, and unshuffle this full list (inverting the random shuffle operation) to align all tokens with their targets.
  • 20. ImageNet Experiments • The authors do self-supervised pre-training on the ImageNet-1K(IN1K) training set. • Then they do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. • Baseline: ViT-Large ▪ ViT-Large(Vit-L/16) is used as backbone in ablation study. ▪ ViT-L is very big and tends to overfit. ▪ It is nontrivial to train supervised ViT-L from scratch and a good recipe with strong regularization is needed.
  • 21. Masking Ratio • The optimal ratios are surprisingly high. • 75% is good for both linear probing and fine-tuning. • This is in contrast with BERT(15%) and also much higher than those in related works in CV(20%~50%). • Reasoning-like behavior is linked to the learning of useful representations. • For linear probing, the accuracy increases steadily until the sweet point, but for fine- tuning, the results are less sensitive to the ratios. • All fine-tuning results are better that training from scratch(82.5%)
  • 22. Decoder Design • A sufficiently deep decoder is important for linear probing. This can be explained by the gap between a pixel reconstruction task and a recognition task: the last several layers in an autoencoder are more specialized for reconstruction, but are less relevant for recognition. • However, if fine-tuning is used, the last layers of the encoder can be tuned to adapt to the recognition task. The decoder depth is less influential. • Interestingly, MAE with a single-block decoder can perform strongly with fine- tuning (84.8%). Note that a single Transformer block is the minimal requirement to propagate information from visible tokens to mask tokens. • Overall, default MAE decoder is lightweight. It has 8 blocks and a width of 512-d. It only has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d).
  • 23. Mask Token • If the encoder uses mask tokens, it performs worse. • In this case, there is a gap between pre-training and deploying: this encoder has a large portion of mask tokens in its input in pretraining, which does not exist in uncorrupted images. This gap may degrade accuracy in deployment.
  • 24. Mask Token • By skipping the mask token in the encoder, training computation is greatly reduced. • Note that the speedup can be >4x for a masking ratio of 75%, partially because the self-attention complexity is quadratic. • In addition, memory is greatly reduced, which can enable training even larger models or speeding up more by large-batch training. • The time and memory efficiency makes MAE favorable for training very large models.
  • 25. Reconstruction Target • Using pixels with normalization improves accuracy. This per-patch normalization enhances the contrast locally. • In another variant, the authors perform PCA in the patch space and use the largest PCA coefficients (96 here) as the target. Doing so degrades accuracy. • The authors also compare an MAE variant that predicts tokens, the target used in BEiT. Specifically for this variant, the DALLE pre-trained dVAE is used as the tokenizer, following BEIT. • This tokenization improves fine-tuning accuracy vs. unnormalized pixels, but has no advantage vs. normalized pixels. • The dVAE tokenizer requires one more pre-training stage, which may depend on extra data (250M images). The dVAE encoder is a large convolutional network (40% FLOPs of ViT-L) and adds nontrivial overhead. Using pixels does not suffer from these problems.
  • 26. Data Augmentation • MAE works well using cropping-only augmentation, either fixed-size or random-size (both having random horizontal flipping). • Adding color jittering degrades the results. • Surprisingly, MAE behaves decently even if using no data augmentation (only center-crop, no flipping). This property is dramatically different from contrastive learning and related methods, which heavily rely on data augmentation. • In MAE, the role of data augmentation is mainly performed by random masking. The masks are different for each iteration and so they generate new training samples regardless of data augmentation.
  • 27. Masking Sampling Strategy • MAE with block-wise masking works reasonably well at a ratio of 50%, but degrades at a ratio of 75%. • Grid-wise sampling regularly keeps one of every four patches and this is an easier task and has lower training loss. The reconstruction is sharper, but the representation quality is lower. • Simple random sampling works the best for MAE. It allows for a higher masking ratio, which provides a greater speedup benefit.
  • 28. Training Schedule • The accuracy improves steadily with longer training. Indeed, saturation of linear probing have not been observed even at 1600 epochs. • This behavior is unlike contrastive learning methods, e.g., MoCo v3 saturates at 300 epochs for ViT-L. • Note that the MAE encoder only sees 25% of patches per epoch, while in contrastive learning the encoder sees 200% (two crop) or even more (multi-crop) patches per epoch.
  • 29. Comparisons with Self-supervised Methods • For ViT-B, all methods perform closely. • For Vit-L the gaps among methods are bigger, suggesting that a challenge for bigger models is to reduce overfitting. • MAE can scale up easily and has shown steady improvement from bigger models. • By fine-tuning with a 448 size, MAE achieve 87.8% accuracy, using only IN1K data. The previous best accuracy, among all methods using only IN1K data, is 87.1% (512 size), based on advanced networks. • Comparing with BEiT, MAE is more accurate while being simpler and faster.
  • 30. Comparisons with Supervised Pre-training • In the original ViT paper, ViT-L degrades when trained in IN1K. The authors improved supervised recipe works better for training from scratch, but the accuracy is saturated. • MAE pre-training, using only IN1K, can generalize better: the gain over training from scratch is bigger for higher-capacity models. • It follows a trend similar to the JFT-300M supervised pre-training. This comparison shows that MAE can help scale up model sizes.
  • 31. Partial Fine-tuning • Table 1. shows that linear probing and fine-tuning results are largely uncorrelated. • Linear probing has been a popular protocol in the past few years; however, it misses the opportunity of pursuing strong but non-linear features—which is indeed a strength of deep learning. • As a middle ground, we study a partial fine-tuning protocol: fine-tune the last several layers while freezing the others.
  • 32. Partial Fine-tuning • Notably, fine-tuning only one Transformer block boosts the accuracy significantly from 73.5% to 81.0%. Moreover, if we fine-tune only “half” of the last block (i.e., its MLP sub-block), we can get 79.1%, much better than linear probing. • Comparing with MoCo v3 which is a contrastive method, MOCO v3 has higher linear probing accuracy than MAE however, all of its partial fine- tuning results are worse than MAE. • These results show that the MAE representations are less linearly separable, but they are stronger non-linear features and perform well when a non- linear head is tuned. These observations suggest that linear separability is not the sole metric for evaluating representation quality.
  • 33. Transfer Learning Experiments • Object detection and instance segmentation ▪ Mask R-CNN is fine-tuned on COCO. The ViT backbone is adapted for use with FPN. • Semantic segmentation ▪ Experiments on ADE20K use UperNet following the code in BEiT.
  • 34. Transfer Learning Experiments • Pixels vs. tokens ▪ While using dVAE tokens is better than using unnormalized pixels, it is statistically similar to just using normalized pixels across all tasks and models. It agains shows that tokenization is not necessary for MAE.
  • 35. Discussion and Conclusion • Simple algorithms that scale well are the core of deep learning. • In NLP, simple self-supervised learning methods enable benefits from exponentially scaling models. In computer vision, practical pre- training paradigms are dominantly supervised despite progress in self-supervised learning. • Self supervised learning in vision may now be embarking on a similar trajectory as in NLP.
  • 36. Discussion and Conclusion • On the other hand, images and languages are signals of a different nature and this difference must be addressed carefully. Images are merely recorded light without a semantic decomposition into the visual analogue of words. • Removing random patches that most likely do not form a semantic segment and MAE reconstructs pixels which are not semantic entities. Nevertheless, MAE infers complex, holistic reconstructions suggesting it has learned numerous visual concepts, i.e., semantics. • The authors hypothesize that this behavior occurs by way of a rich hidden representation inside the MAE.