1. An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
Anonymous (ICLR 2021 under review)
Yonsei University Severance Hospital CCIDS
Choi Dongmin
2. Abstract
• Transformer
- standard architecture for NLP
• Convolutional Networks
- attention is applied keeping their overall structure
• Transformer in Computer Vision
- a pure transformer can perform very well on image classification tasks
when applied directly to sequences of image patches
- achieved S.O.T.A with small computational costs when pre-trained on
large dataset
3. Introduction
Vaswani et al. Attention Is All You Need. NIPS 2017
Transformer
BERT
Self-attention
based architecture
The dominant approach : pre-training on a large text corpus
and then fine-tuning on a smaller task-specific dataset
4. Introduction
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Wang et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. ECCV 2020
Self-Attention in CV inspired by NLP
DETR
Axial-DeepLab
However, classic ResNet-like architectures are still S.O.T.A
5. • Applying a Transformer Directly to Images
- with the fewest possible modifications
- provide the sequence of linear embeddings of the patches as an input
- image patches = tokens (words) in NLP
• Small Scale Training
- achieved accuracies below ResNets of comparable size
- Transformers lack some inductive biased inherent to CNNs
(such as translation equivariance and locality)
• Large Scale Training
- trumps (surpass) inductive bias
- excellent results when pre-trained at sufficient scale and transferred
Introduction
6. Related Works
Transformer
Vaswani et al. Attention Is All You Need. NIPS 2017
Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL 2019
Radford et al. Improving language under- standing with unsupervised learning. Technical Report 2018
- Standard model in NLP tasks
- Only consists of attention modules
not using RNN
- Encoder-decoder
- Requires large scale dataset and
high computational cost
- Pre-training and fine-tuning
approaches : BERT & GPT
8. Method
Image → A sequence of flattened 2D patchesx ∈ RH×W×C
xp ∈ RN×(P2
·C)
Trainable linear projection maps
→xp ∈ RN×(P2
·C)
xpE ∈ RN×D
Learnable Position Embedding
Epos ∈ R(N+1)×D
* Because Transformer uses constant
widths, model dimension , through all of its layersD
* to retain positional information
z0
L
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L99-L111
14. Method
Hybrid Architecture
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Flattened intermediate feature
maps of a ResNet
as the input sequence like DETR
15. Method
Fine-tuning and Higher Resolution
Carion et al. End-to-End Object Detection with Transformers. ECCV 2020
Remove the pre-trained prediction head and attach a zero-initialized
feedforward layer ( =the number of downstream classes)D × K K
17. Experiments
• Training & Fine-tuning
< Pre-training>
- Adam with
- Batch size 4,096
- Weight decay 0.1 (high weight decay is useful for transfer models)
- Linear learning rate warmup and decay
< Fine-tuning >
- SGD with momentum, batch size 512
• Metrics
- Few-shot (for fast on-the-fly evaluation)
- Fine-tuning accuracy
β1 = 0.9, β2 = 0.999
18. Experiments
• Comparison to State of the Art
Kolesnikov et al. Big Transfer (BiT): General Visual Representation Learning. ECCV 2020
Xie et al. Self-training with noisy student improves imagenet classification. CVPR 2020
* BiT-L : Big Transfer, which performs supervised transfer learning with large ResNets
* Noisy Student : a large EfficientNet trained using semi-supervised learning
22. Experiments
• Inspecting Vision Transformer
The components resemble plausible basis functions
for a low-dimensional representation of the fine structure within each patch
analogous to receptive field size in CNNs
23. Conclusion
• Application of Transformers to Image Recognition
- no image-specific inductive biases in the architecture
- interpret an image as sequence of patches and process it by a standard
Transformer encoder
- simple, yet scalable, strategy works
- matches or exceeds the S.O.T.A being cheap to pre-train
• Many Challenges Remain
- other computer vision tasks, such as detection and segmentation
- further scaling ViT
24. Q&A
• ViT for Segmentation
• Fine-tuning on Grayscale Dataset