SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Visual Transformers
Kwanghee Choi (Jonas)
Table of Contents
● Preliminary
○ Key, Value, Query, Attention
○ Pooling
○ Multi-head Attention
○ Unsupervised Representation Learning
○ Syntactic Knowledge
● State-of-the-art Papers
○ Generative Pretraining from Pixels (ICML 2020)
○ An Image is Worth 16x16 Words (ICLR 2021)
○ End-to-End Object Detection with Transformers (ECCV 2020)
○ Additional Works
Key, Value, Query, Attention
● Problem: Given a set of data points (xi
, yi
), find unknown y for x.
● Simplest approach:
● A bit more complicated approach: Watson-Nadaraya Estimator (1964)
● Key, value pairs (xi
, yi
)
● Query x
● Attention ⍺
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
Pooling
● Nonlinearity ⍴, ɸ, learnable weight w
● Deep sets (Zaheer et al. 2017)
○ Permutation Invariant
● Word2Vec (Mikolov et al. 2013)
○ Embed each word in a sentence
● Attention Weighting (Wang et al. 2016)
○ Query x depends on the context ⍺
● Iterative Attention Pooling (Yang et al. 2016)
○ Repeatedly update internal state qt
Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
Multi-head Attention
● Attention module
○ Softmax acts as an attention function.
○ Dot product of Q and K acts as a similarity.
○ sqrt(dk
): Standard deviation of the dot product when Q, V ~ N(0, 1)
● Multi-head Attention
○ Single-head limits the ability of focusing on a specific position.
○ Multi-head gives attention layers different representation subspace.
Attention Is All You Need (Vaswani et al. NeurIPS 2017)
Unsupervised Representation Learning
● Input sequence x=(x1
, x2
, … )
● Autoregressive (AR)
○ ex) ELMo, GPT
○ No bidirectional context.
○ ELMO: Need to separately train forward/backward context.
● Auto Encoding (AE)
○ Corrupted input x’=(x1
, x2
, …, [MASK], … )
○ ex) BERT
○ Bi-directional self-attention
○ Different input distribution due to corruption
Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet
Syntactic Knowledge
● BERT representations are hierarchical rather
than linear.
○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge
(Lin et al. ACLW 2019)
● BERT “naturally” learns some syntactic
information, although it is not very similar to
linguistic annotated resources.
○ Perturbed Masking: Parameter-free Probing for Analyzing
and Interpreting BERT (Wu et al. ACL 2020)
A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)
Generative Pretraining from Pixels
ICML 2020, OpenAI
Towards a general “image” model
● Just as a general LM can generate coherent text, Image GPT can
generate coherent images.
● “Analysis by Synthesis” suggests that model will also know about
object categories after it learns to do so.
● Generative sequence modeling is a universal unsupervised algorithm.
Image GPT (https://openai.com/blog/image-gpt/)
Approach
Generative Pretraining from Pixels (Chen et al. ICML 2020)
What representation works best?
● In supervised pre-training, representation quality tends to increase
monotonically with depth, but with generative pre-training, it is not
obvious whether a task like pixel prediction is relevant to image
classification.
● Representations first improve as a function of depth, and then,
starting around the middle layer, begin to deteriorate.
○ In the first phase, each position gathers information from its surrounding context in
order to build a more global image representation.
○ In the second phase, this contextualized input is used to solve the conditional next
pixel prediction task.
○ This could resemble the behavior of encoder-decoder architectures, but learned
within a monolithic architecture via a pre-training objective.
Generative Pretraining from Pixels (Chen et al. ICML 2020)
Performance on CIFAR dataset
● We find that both increasing the
scale of our models and training for
more iterations result in better
generative performance, which
directly translates into better
feature quality.
● Generative models produce much
better features than BERT models
after pre-training, but BERT
models catch up after fine-tuning.
Generative Pretraining from Pixels (Chen et al. ICML 2020)
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale
ICLR 2021, Google
When does Transformers work?
● When trained on mid-sized datasets (i.e. ImageNet), Transformers
yield modest accuracies, few % below ResNets of comparable size.
● However, large scale training (14M-300M images) trumps inductive
bias of CNNs such as translation invariance & locality.
● Naive application of self-attention to images would require that each
pixel attends to every other pixel. With quadratic cost in the number
of pixels, this does not scale to realistic input sizes.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Model overview
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Performance
With self-supervised pre-training (masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy
on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
Interpreting the Results
● Positional embeddings
○ We speculate that learning to represent the spatial relations in
this resolution (14 x 14) is equally easy for different strategies.
○ Closer patches tend to have more similar position embeddings.
○ Row-column structure & sinusoidal structure appears.
● Self-attention
○ “Attention distance” analogous to “receptive field size”.
○ Highly localized attention may serve a similar function as early
convolutional layers in CNNs.
○ Model attends to image regions that are semantically relevant
for classification.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
End-to-End Object Detection
with Transformers
ECCV 2020, Facebook
End-to-end object detection
Object detection as a direct set prediction problem.
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Removing NMS
● Conventional CNN to learn a 2D representation + Positional encoding
● 100 learned positional embeddings as object queries
● Global reasoning using pairwise relations
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Encoder’s attention mechanism in action
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Decoder’s attention mechanism in action
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Performance in Object Detection
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Panoptic Segmentation
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Performance in Panoptic Segmentation
End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
Additional Works
Notable Extensions
● Training data-efficient image transformers & distillation through
attention (Touvron et al. Arxiv 2021)
○ Add another token: distillation token to ViT. Using only the classification token
doesn’t help much.
○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher
model’s argmax with label smoothing).
○ Surpasses SOTA yet again.
● DALL·E: Creating Images from Text (Ramesh et al. 2021)
○ Decoder-only transformer that receives both the text and the image as a single
stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively.
○ Creates images from text captions for a wide range of concepts expressible in natural
language.
Task-specific: Object Detection
● End-to-End Object Detection with Adaptive Clustering Transformer
(Zheng et al. Arxiv 2020)
○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and
approximate the query-key interaction using the prototype-key interaction.
○ ACT can replace the original self-attention module in DETR without degrading the
performance of pre-trained DETR model.
● Deformable DETR: Deformable Transformers for End-to-End Object
Detection (Zhu et al. ICLR 2021)
○ Deformable DETR can achieve better performance than DETR (especially on small
objects) with 10× less training epochs.
○ Deformable attention module: Choose only prominent feature map pixels, aggregate
multi-scale features.
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Task-specific: Object Detection
● UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers (Dai et al. Arxiv 2020)
○ Propose a pretext task named random query patch detection to unsupervisedly
pretrain DETR (UP-DETR) for object detection.
● Rethinking Transformer-based Set Prediction for Object Detection
(Sun et al. Arxiv 2020)
○ Encoder-only DETR significantly accelerate the training of small object detection, as
it removes cross-attention.
○ Feature generation for transformer encoders with FCOS (Fully Convolutional
One-Stage object detector) or RCNN
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Task-specific: Segmentation
● MaX-DeepLab: End-to-End Panoptic Segmentation with Mask
Transformers (Wang et al. Arxiv 2020)
○ Infers masks and classes directly without hand-coded priors like object boxes.
○ Dual-path transformer enables CNNs to read and write a global memory at any layer.
● End-to-End Video Instance Segmentation with Transformers (Wang
et al. Arxiv 2020)
○ Three dimensional (temporal, horizontal and vertical) positional encoding
○ Instance sequence matching strategy - applying loss across different time
signatures
A Survey on Visual Transformer (Han et al. Arxiv 2021)
Additional Tasks
● Learning Joint Spatial-Temporal Transformations for Video
Inpainting (Zeng et al. ECCV 2020)
● End-to-End Dense Video Captioning with Masked Transformer (Zhou
et al. CVPR 2018)
● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D
Hand Pose Estimation (Huang et al. ECCV 2020)
● Taming Transformers for High-Resolution Image Synthesis (Esser et
al. Arxiv 2020)
● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020)
○ ImageNet pre-training for image denoising/superresolution
A Survey on Visual Transformer (Han et al. Arxiv 2021)

Mais conteúdo relacionado

Semelhante a Visual Transformers

IISc Internship Report
IISc Internship ReportIISc Internship Report
IISc Internship ReportHarshilJain26
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureRouyun Pan
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingPreferred Networks
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxcongtran88
 
Deep Neural Networks Presentation
Deep Neural Networks PresentationDeep Neural Networks Presentation
Deep Neural Networks PresentationBohdan Klimenko
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deakin University
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Pirouz Nourian
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Universitat Politècnica de Catalunya
 
210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformerexwzds
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsSangmin Woo
 
Brodmann17 CVPR 2017 review - meetup slides
Brodmann17 CVPR 2017 review - meetup slides Brodmann17 CVPR 2017 review - meetup slides
Brodmann17 CVPR 2017 review - meetup slides Brodmann17
 
Cvpr 2017 Summary Meetup
Cvpr 2017 Summary MeetupCvpr 2017 Summary Meetup
Cvpr 2017 Summary MeetupAmir Alush
 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal EmbeddingIRJET Journal
 
Computer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectComputer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectIOSR Journals
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
VIBE: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape EstimationVIBE: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape EstimationArithmer Inc.
 

Semelhante a Visual Transformers (20)

IISc Internship Report
IISc Internship ReportIISc Internship Report
IISc Internship Report
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Introduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable RenderingIntroduction to 3D Computer Vision and Differentiable Rendering
Introduction to 3D Computer Vision and Differentiable Rendering
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
Deep Neural Networks Presentation
Deep Neural Networks PresentationDeep Neural Networks Presentation
Deep Neural Networks Presentation
 
IPT.pdf
IPT.pdfIPT.pdf
IPT.pdf
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1
 
Conv xg
Conv xgConv xg
Conv xg
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
Point Cloud Processing: Estimating Normal Vectors and Curvature Indicators us...
 
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
 
210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer210610 SSIIi2021 Computer Vision x Trasnformer
210610 SSIIi2021 Computer Vision x Trasnformer
 
Attentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene GraphsAttentive Relational Networks for Mapping Images to Scene Graphs
Attentive Relational Networks for Mapping Images to Scene Graphs
 
Brodmann17 CVPR 2017 review - meetup slides
Brodmann17 CVPR 2017 review - meetup slides Brodmann17 CVPR 2017 review - meetup slides
Brodmann17 CVPR 2017 review - meetup slides
 
Cvpr 2017 Summary Meetup
Cvpr 2017 Summary MeetupCvpr 2017 Summary Meetup
Cvpr 2017 Summary Meetup
 
IRJET- Image Captioning using Multimodal Embedding
IRJET-  	  Image Captioning using Multimodal EmbeddingIRJET-  	  Image Captioning using Multimodal Embedding
IRJET- Image Captioning using Multimodal Embedding
 
Computer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an ObjectComputer Vision: Visual Extent of an Object
Computer Vision: Visual Extent of an Object
 
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
VIBE: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape EstimationVIBE: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape Estimation
 

Mais de Kwanghee Choi

Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022Kwanghee Choi
 
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)Kwanghee Choi
 
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsKwanghee Choi
 
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)Kwanghee Choi
 
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)Kwanghee Choi
 
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - RecapKwanghee Choi
 
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - RecapKwanghee Choi
 
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada LovelaceKwanghee Choi
 
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewKwanghee Choi
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnKwanghee Choi
 
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RLKwanghee Choi
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryKwanghee Choi
 
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson samplingKwanghee Choi
 
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: QuickstartKwanghee Choi
 
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectorsKwanghee Choi
 
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving AverageKwanghee Choi
 
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming languageKwanghee Choi
 

Mais de Kwanghee Choi (19)

Trends of ICASSP 2022
Trends of ICASSP 2022Trends of ICASSP 2022
Trends of ICASSP 2022
 
추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)추천 시스템 한 발짝 떨어져 살펴보기 (3)
추천 시스템 한 발짝 떨어져 살펴보기 (3)
 
Recommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal ScrollsRecommendation systems: Vertical and Horizontal Scrolls
Recommendation systems: Vertical and Horizontal Scrolls
 
추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)추천 시스템 한 발짝 떨어져 살펴보기 (1)
추천 시스템 한 발짝 떨어져 살펴보기 (1)
 
추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)추천 시스템 한 발짝 떨어져 살펴보기 (2)
추천 시스템 한 발짝 떨어져 살펴보기 (2)
 
Before and After the AI Winter - Recap
Before and After the AI Winter - RecapBefore and After the AI Winter - Recap
Before and After the AI Winter - Recap
 
Mastering Gomoku - Recap
Mastering Gomoku - RecapMastering Gomoku - Recap
Mastering Gomoku - Recap
 
Teachings of Ada Lovelace
Teachings of Ada LovelaceTeachings of Ada Lovelace
Teachings of Ada Lovelace
 
div, grad, curl, and all that - a review
div, grad, curl, and all that - a reviewdiv, grad, curl, and all that - a review
div, grad, curl, and all that - a review
 
Gaussian processes
Gaussian processesGaussian processes
Gaussian processes
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to Learn
 
Duality between OOP and RL
Duality between OOP and RLDuality between OOP and RL
Duality between OOP and RL
 
JFEF encoding
JFEF encodingJFEF encoding
JFEF encoding
 
Bandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summaryBandit algorithms for website optimization - A summary
Bandit algorithms for website optimization - A summary
 
Dummy log generation using poisson sampling
 Dummy log generation using poisson sampling Dummy log generation using poisson sampling
Dummy log generation using poisson sampling
 
Azure functions: Quickstart
Azure functions: QuickstartAzure functions: Quickstart
Azure functions: Quickstart
 
Modern convolutional object detectors
Modern convolutional object detectorsModern convolutional object detectors
Modern convolutional object detectors
 
Usage of Moving Average
Usage of Moving AverageUsage of Moving Average
Usage of Moving Average
 
Jpl coding standard for the c programming language
Jpl coding standard for the c programming languageJpl coding standard for the c programming language
Jpl coding standard for the c programming language
 

Último

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Visual Transformers

  • 2. Table of Contents ● Preliminary ○ Key, Value, Query, Attention ○ Pooling ○ Multi-head Attention ○ Unsupervised Representation Learning ○ Syntactic Knowledge ● State-of-the-art Papers ○ Generative Pretraining from Pixels (ICML 2020) ○ An Image is Worth 16x16 Words (ICLR 2021) ○ End-to-End Object Detection with Transformers (ECCV 2020) ○ Additional Works
  • 3. Key, Value, Query, Attention ● Problem: Given a set of data points (xi , yi ), find unknown y for x. ● Simplest approach: ● A bit more complicated approach: Watson-Nadaraya Estimator (1964) ● Key, value pairs (xi , yi ) ● Query x ● Attention ⍺ Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
  • 4. Pooling ● Nonlinearity ⍴, ɸ, learnable weight w ● Deep sets (Zaheer et al. 2017) ○ Permutation Invariant ● Word2Vec (Mikolov et al. 2013) ○ Embed each word in a sentence ● Attention Weighting (Wang et al. 2016) ○ Query x depends on the context ⍺ ● Iterative Attention Pooling (Yang et al. 2016) ○ Repeatedly update internal state qt Reference. Attention in Deep Learning (Alex Smola, ICML 2019 Tutorials, http://alex.smola.org/talks/ICML19-attention.pdf )
  • 5. Multi-head Attention ● Attention module ○ Softmax acts as an attention function. ○ Dot product of Q and K acts as a similarity. ○ sqrt(dk ): Standard deviation of the dot product when Q, V ~ N(0, 1) ● Multi-head Attention ○ Single-head limits the ability of focusing on a specific position. ○ Multi-head gives attention layers different representation subspace. Attention Is All You Need (Vaswani et al. NeurIPS 2017)
  • 6. Unsupervised Representation Learning ● Input sequence x=(x1 , x2 , … ) ● Autoregressive (AR) ○ ex) ELMo, GPT ○ No bidirectional context. ○ ELMO: Need to separately train forward/backward context. ● Auto Encoding (AE) ○ Corrupted input x’=(x1 , x2 , …, [MASK], … ) ○ ex) BERT ○ Bi-directional self-attention ○ Different input distribution due to corruption Understanding XLNet https://www.borealisai.com/en/blog/understanding-xlnet
  • 7. Syntactic Knowledge ● BERT representations are hierarchical rather than linear. ○ Open Sesame: Getting Inside BERT’s Linguistic Knowledge (Lin et al. ACLW 2019) ● BERT “naturally” learns some syntactic information, although it is not very similar to linguistic annotated resources. ○ Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT (Wu et al. ACL 2020) A Primer in BERTology: What we know about how BERT works (Rogers et al. TACL 2020)
  • 8. Generative Pretraining from Pixels ICML 2020, OpenAI
  • 9. Towards a general “image” model ● Just as a general LM can generate coherent text, Image GPT can generate coherent images. ● “Analysis by Synthesis” suggests that model will also know about object categories after it learns to do so. ● Generative sequence modeling is a universal unsupervised algorithm. Image GPT (https://openai.com/blog/image-gpt/)
  • 10. Approach Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 11. What representation works best? ● In supervised pre-training, representation quality tends to increase monotonically with depth, but with generative pre-training, it is not obvious whether a task like pixel prediction is relevant to image classification. ● Representations first improve as a function of depth, and then, starting around the middle layer, begin to deteriorate. ○ In the first phase, each position gathers information from its surrounding context in order to build a more global image representation. ○ In the second phase, this contextualized input is used to solve the conditional next pixel prediction task. ○ This could resemble the behavior of encoder-decoder architectures, but learned within a monolithic architecture via a pre-training objective. Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 12. Performance on CIFAR dataset ● We find that both increasing the scale of our models and training for more iterations result in better generative performance, which directly translates into better feature quality. ● Generative models produce much better features than BERT models after pre-training, but BERT models catch up after fine-tuning. Generative Pretraining from Pixels (Chen et al. ICML 2020)
  • 13. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ICLR 2021, Google
  • 14. When does Transformers work? ● When trained on mid-sized datasets (i.e. ImageNet), Transformers yield modest accuracies, few % below ResNets of comparable size. ● However, large scale training (14M-300M images) trumps inductive bias of CNNs such as translation invariance & locality. ● Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 15. Model overview An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 16. Performance With self-supervised pre-training (masked patch prediction), our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 17. Interpreting the Results ● Positional embeddings ○ We speculate that learning to represent the spatial relations in this resolution (14 x 14) is equally easy for different strategies. ○ Closer patches tend to have more similar position embeddings. ○ Row-column structure & sinusoidal structure appears. ● Self-attention ○ “Attention distance” analogous to “receptive field size”. ○ Highly localized attention may serve a similar function as early convolutional layers in CNNs. ○ Model attends to image regions that are semantically relevant for classification. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al. ICLR 2021)
  • 18. End-to-End Object Detection with Transformers ECCV 2020, Facebook
  • 19. End-to-end object detection Object detection as a direct set prediction problem. End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 20. Removing NMS ● Conventional CNN to learn a 2D representation + Positional encoding ● 100 learned positional embeddings as object queries ● Global reasoning using pairwise relations End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 21. Encoder’s attention mechanism in action End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 22. Decoder’s attention mechanism in action End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 23. Performance in Object Detection End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 24. Panoptic Segmentation End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 25. Performance in Panoptic Segmentation End-to-End Object Detection with Transformers (Carion et al. ECCV 2020)
  • 27. Notable Extensions ● Training data-efficient image transformers & distillation through attention (Touvron et al. Arxiv 2021) ○ Add another token: distillation token to ViT. Using only the classification token doesn’t help much. ○ Soft distillation (teacher model’s softmax output) and hard-distillation (teacher model’s argmax with label smoothing). ○ Surpasses SOTA yet again. ● DALL·E: Creating Images from Text (Ramesh et al. 2021) ○ Decoder-only transformer that receives both the text and the image as a single stream of tokens (Text: 256, Image: 1024) and models all of them autoregressively. ○ Creates images from text captions for a wide range of concepts expressible in natural language.
  • 28. Task-specific: Object Detection ● End-to-End Object Detection with Adaptive Clustering Transformer (Zheng et al. Arxiv 2020) ○ ACT cluster the query features adaptively using Locality Sensitive Hashing (LSH) and approximate the query-key interaction using the prototype-key interaction. ○ ACT can replace the original self-attention module in DETR without degrading the performance of pre-trained DETR model. ● Deformable DETR: Deformable Transformers for End-to-End Object Detection (Zhu et al. ICLR 2021) ○ Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. ○ Deformable attention module: Choose only prominent feature map pixels, aggregate multi-scale features. A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 29. Task-specific: Object Detection ● UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (Dai et al. Arxiv 2020) ○ Propose a pretext task named random query patch detection to unsupervisedly pretrain DETR (UP-DETR) for object detection. ● Rethinking Transformer-based Set Prediction for Object Detection (Sun et al. Arxiv 2020) ○ Encoder-only DETR significantly accelerate the training of small object detection, as it removes cross-attention. ○ Feature generation for transformer encoders with FCOS (Fully Convolutional One-Stage object detector) or RCNN A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 30. Task-specific: Segmentation ● MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers (Wang et al. Arxiv 2020) ○ Infers masks and classes directly without hand-coded priors like object boxes. ○ Dual-path transformer enables CNNs to read and write a global memory at any layer. ● End-to-End Video Instance Segmentation with Transformers (Wang et al. Arxiv 2020) ○ Three dimensional (temporal, horizontal and vertical) positional encoding ○ Instance sequence matching strategy - applying loss across different time signatures A Survey on Visual Transformer (Han et al. Arxiv 2021)
  • 31. Additional Tasks ● Learning Joint Spatial-Temporal Transformations for Video Inpainting (Zeng et al. ECCV 2020) ● End-to-End Dense Video Captioning with Masked Transformer (Zhou et al. CVPR 2018) ● Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation (Huang et al. ECCV 2020) ● Taming Transformers for High-Resolution Image Synthesis (Esser et al. Arxiv 2020) ● Pre-Trained Image Processing Transformer (Chen et al. Arxiv 2020) ○ ImageNet pre-training for image denoising/superresolution A Survey on Visual Transformer (Han et al. Arxiv 2021)