- Masked Autoencoders Are Scalable Vision Learners presents a new self-supervised learning method called Masked Autoencoder (MAE) for computer vision.
- MAE works by masking random patches of input images, encoding the visible patches, and decoding to reconstruct the full image. This forces the model to learn visual representations from incomplete views of images.
- Experiments on ImageNet show that MAE achieves superior results compared to supervised pre-training from scratch as well as other self-supervised methods, scaling effectively to larger models. MAE representations also transfer well to downstream tasks like object detection, instance segmentation and semantic segmentation.
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
AI & BigData Online Day 2021
Website - https://aiconf.com.ua/
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
Optical Character Recognition: from data driven to self-supervised learning (...LianWenJin1
- The document discusses optical character recognition (OCR) and scene text recognition, focusing on recent trends from strongly supervised to weakly supervised and self-supervised learning methods.
- It introduces the SPTS method for single-point text spotting which only requires point annotations rather than bounding boxes, achieving end-to-end text spotting with weaker supervision.
- It also discusses using data synthesis and style transfer to generate additional training data for OCR, as well as weakly supervised and self-supervised learning methods that require less labeled data.
BERT is a deeply bidirectional, unsupervised language representation model pre-trained using only plain text. It is the first model to use a bidirectional Transformer for pre-training. BERT learns representations from both left and right contexts within text, unlike previous models like ELMo which use independently trained left-to-right and right-to-left LSTMs. BERT was pre-trained on two large text corpora using masked language modeling and next sentence prediction tasks. It establishes new state-of-the-art results on a wide range of natural language understanding benchmarks.
The document summarizes the Transformer neural network model proposed in the paper "Attention is All You Need". The Transformer uses self-attention mechanisms rather than recurrent or convolutional layers. It achieves state-of-the-art results in machine translation by allowing the model to jointly attend to information from different representation subspaces. The key components of the Transformer include multi-head self-attention layers in the encoder and masked multi-head self-attention layers in the decoder. Self-attention allows the model to learn long-range dependencies in sequence data more effectively than RNNs.
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
The document presents SimCLR, a framework for contrastive learning of visual representations using simple data augmentation. Key aspects of SimCLR include using random cropping and color distortions to generate positive sample pairs for the contrastive loss, a nonlinear projection head to learn representations, and large batch sizes. Evaluation shows SimCLR learns representations that outperform supervised pretraining on downstream tasks and achieves state-of-the-art results with only view augmentation and contrastive loss.
1. The document discusses recent developments in transformer architectures in 2021. It covers large transformers with models of over 100 billion parameters, efficient transformers that aim to address the quadratic attention problem, and new modalities like image, audio and graph transformers.
2. Issues with large models include high costs of training, carbon emissions, potential biases, and static training data not reflecting changing social views. Efficient transformers use techniques like mixture of experts, linear attention approximations, and selective memory to improve scalability.
3. New modalities of transformers in 2021 include vision transformers applied to images and audio transformers for processing sound. Multimodal transformers aim to combine multiple modalities.
1) Transformers use self-attention to solve problems with RNNs like vanishing gradients and parallelization. They combine CNNs and attention.
2) Transformers have encoder and decoder blocks. The encoder models input and decoder models output. Variations remove encoder (GPT) or decoder (BERT) for language modeling.
3) GPT-3 is a large Transformer with 175B parameters that can perform many NLP tasks but still has safety and bias issues.
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
AI & BigData Online Day 2021
Website - https://aiconf.com.ua/
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
Optical Character Recognition: from data driven to self-supervised learning (...LianWenJin1
- The document discusses optical character recognition (OCR) and scene text recognition, focusing on recent trends from strongly supervised to weakly supervised and self-supervised learning methods.
- It introduces the SPTS method for single-point text spotting which only requires point annotations rather than bounding boxes, achieving end-to-end text spotting with weaker supervision.
- It also discusses using data synthesis and style transfer to generate additional training data for OCR, as well as weakly supervised and self-supervised learning methods that require less labeled data.
BERT is a deeply bidirectional, unsupervised language representation model pre-trained using only plain text. It is the first model to use a bidirectional Transformer for pre-training. BERT learns representations from both left and right contexts within text, unlike previous models like ELMo which use independently trained left-to-right and right-to-left LSTMs. BERT was pre-trained on two large text corpora using masked language modeling and next sentence prediction tasks. It establishes new state-of-the-art results on a wide range of natural language understanding benchmarks.
The document summarizes the Transformer neural network model proposed in the paper "Attention is All You Need". The Transformer uses self-attention mechanisms rather than recurrent or convolutional layers. It achieves state-of-the-art results in machine translation by allowing the model to jointly attend to information from different representation subspaces. The key components of the Transformer include multi-head self-attention layers in the encoder and masked multi-head self-attention layers in the decoder. Self-attention allows the model to learn long-range dependencies in sequence data more effectively than RNNs.
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
The document presents SimCLR, a framework for contrastive learning of visual representations using simple data augmentation. Key aspects of SimCLR include using random cropping and color distortions to generate positive sample pairs for the contrastive loss, a nonlinear projection head to learn representations, and large batch sizes. Evaluation shows SimCLR learns representations that outperform supervised pretraining on downstream tasks and achieves state-of-the-art results with only view augmentation and contrastive loss.
1. The document discusses recent developments in transformer architectures in 2021. It covers large transformers with models of over 100 billion parameters, efficient transformers that aim to address the quadratic attention problem, and new modalities like image, audio and graph transformers.
2. Issues with large models include high costs of training, carbon emissions, potential biases, and static training data not reflecting changing social views. Efficient transformers use techniques like mixture of experts, linear attention approximations, and selective memory to improve scalability.
3. New modalities of transformers in 2021 include vision transformers applied to images and audio transformers for processing sound. Multimodal transformers aim to combine multiple modalities.
1) Transformers use self-attention to solve problems with RNNs like vanishing gradients and parallelization. They combine CNNs and attention.
2) Transformers have encoder and decoder blocks. The encoder models input and decoder models output. Variations remove encoder (GPT) or decoder (BERT) for language modeling.
3) GPT-3 is a large Transformer with 175B parameters that can perform many NLP tasks but still has safety and bias issues.
Attention Is All You Need.
With these simple words, the Deep Learning industry was forever changed. Transformers were initially introduced in the field of Natural Language Processing to enhance language translation, but they demonstrated astonishing results even outside language processing. In particular, they recently spread in the Computer Vision community, advancing the state-of-the-art on many vision tasks. But what are Transformers? What is the mechanism of self-attention, and do we really need it? How did they revolutionize Computer Vision? Will they ever replace convolutional neural networks?
These and many other questions will be answered during the talk.
In this tech talk, we will discuss:
- A piece of history: Why did we need a new architecture?
- What is self-attention, and where does this concept come from?
- The Transformer architecture and its mechanisms
- Vision Transformers: An Image is worth 16x16 words
- Video Understanding using Transformers: the space + time approach
- The scale and data problem: Is Attention what we really need?
- The future of Computer Vision through Transformers
Speaker: Davide Coccomini, Nicola Messina
Website: https://www.aicamp.ai/event/eventdetails/W2021101110
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
The document summarizes a research seminar presentation on using transformers for image recognition without convolutional biases. It discusses how a pure transformer architecture called Vision Transformer (ViT) can achieve state-of-the-art image classification performance when pretrained on large datasets. ViT works by splitting images into patches and treating the sequence of patch embeddings with a standard transformer. Experiments show ViT outperforms convolutional models in performance per computation and can learn spatial representations without explicit inductive biases. While limited to classification, ViT shows potential for vision tasks if pretrained self-supervision and model extensions are improved.
The document introduces autoencoders, which are neural networks that compress an input into a lower-dimensional code and then reconstruct the output from that code. It discusses that autoencoders can be trained using an unsupervised pre-training method called restricted Boltzmann machines to minimize the reconstruction error. Autoencoders can be used for dimensionality reduction, document retrieval by compressing documents into codes, and data visualization by compressing high-dimensional data points into 2D for plotting with different categories colored separately.
A presentation about the development of the ideas from the autoencoder to the Stable Diffusion text-to-image model.
Models covered: autoencoder, VAE, VQ-VAE, VQ-GAN, latent diffusion, and stable diffusion.
Transfer Learning and Fine-tuning Deep Neural NetworksPyData
This document outlines Anusua Trivedi's talk on transfer learning and fine-tuning deep neural networks. The talk covers traditional machine learning versus deep learning, using deep convolutional neural networks (DCNNs) for image analysis, transfer learning and fine-tuning DCNNs, recurrent neural networks (RNNs), and case studies applying these techniques to diabetic retinopathy prediction and fashion image caption generation.
[Paper Reading] Attention is All You NeedDaiki Tanaka
The document summarizes the "Attention Is All You Need" paper, which introduced the Transformer model for natural language processing. The Transformer uses attention mechanisms rather than recurrent or convolutional layers, allowing for more parallelization. It achieved state-of-the-art results in machine translation tasks using techniques like multi-head attention, positional encoding, and beam search decoding. The paper demonstrated the Transformer's ability to draw global dependencies between input and output with constant computational complexity.
Generative Adversarial Networks (GANs) are a class of machine learning frameworks where two neural networks contest with each other in a game. A generator network generates new data instances, while a discriminator network evaluates them for authenticity, classifying them as real or generated. This adversarial process allows the generator to improve over time and generate highly realistic samples that can pass for real data. The document provides an overview of GANs and their variants, including DCGAN, InfoGAN, EBGAN, and ACGAN models. It also discusses techniques for training more stable GANs and escaping issues like mode collapse.
This document is a slide presentation on recent advances in deep learning. It discusses self-supervised learning, which involves using unlabeled data to learn representations by predicting structural information within the data. The presentation covers pretext tasks, invariance-based approaches, and generation-based approaches for self-supervised learning in computer vision and natural language processing. It provides examples of specific self-supervised methods like predicting image rotations, clustering representations to generate pseudo-labels, and masked language modeling.
Transfer learning aims to improve learning outcomes for a target task by leveraging knowledge from a related source task. It does this by influencing the target task's assumptions based on what was learned from the source task. This can allow for faster and better generalized learning in the target task. However, there is a risk of negative transfer where performance decreases. To avoid this, methods examine task similarity and reject harmful source knowledge, or generate multiple mappings between source and target to identify the best match. The goal of transfer learning is to start higher, learn faster, and achieve better overall performance compared to learning the target task without transfer.
Diffusion models beat gans on image synthesisBeerenSahu
Diffusion models have recently been shown to produce higher quality images than GANs while also offering better diversity and being easier to scale and train. Specifically, a 2021 paper by OpenAI demonstrated that a diffusion model achieved an FID score of 2.97 on ImageNet 128x128, beating the previous state-of-the-art held by BigGAN. Diffusion models work by gradually adding noise to images in a forward process and then learning to remove noise in a backward denoising process, allowing them to generate diverse, high fidelity images.
The document discusses the Vision Transformer (ViT) model for computer vision tasks. It covers:
1. How ViT tokenizes images into patches and uses position embeddings to encode spatial relationships.
2. ViT uses a class embedding to trigger class predictions, unlike CNNs which have decoders.
3. The receptive field of ViT grows as the attention mechanism allows elements to attend to other distant elements in later layers.
4. Initial results showed ViT performance was comparable to CNNs when trained on large datasets but lagged CNNs trained on smaller datasets like ImageNet.
1) iMTFA is an incremental approach to few-shot instance segmentation that allows adding new classes without retraining.
2) It extends the MTFA baseline by training an instance feature extractor to generate discriminative embeddings for each instance, with the average embedding used as the class representative.
3) At inference, it predicts classes based on the cosine distance between ROI embeddings and stored class representatives, using class-agnostic box regression and mask prediction.
4) Experiments on COCO, VOC2007 and VOC2012 show iMTFA outperforms SOTA few-shot object detection and instance segmentation methods while enabling incremental class addition.
Semantic Segmentation Methods using Deep LearningSungjoon Choi
This document discusses semantic segmentation, which is the task of assigning each pixel in an image to a semantic class. It introduces semantic segmentation and provides a leader board of top performing models. It then details the results of various semantic segmentation models on benchmark datasets, including PSPNet, DeepLab v3+, and DeepLab v3. The models are evaluated based on metrics like mean intersection over union.
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 284번째 논문 review입니다.
이번 논문은 Facebook에서 나온 DETR(DEtection with TRansformer) 입니다.
arxiv-sanity에 top recent/last year에서 가장 상위에 자리하고 있는 논문이기도 합니다(http://www.arxiv-sanity.com/top?timefilter=year&vfilter=all)
최근에 ICLR 2021에 submit된 ViT로 인해서 이제 Transformer가 CNN을 대체하는 것 아닌가 하는 얘기들이 많이 나오고 있는데요, 올 해 ECCV에 발표된 논문이고 feature extraction 부분은 CNN을 사용하긴 했지만 transformer를 활용하여 효과적으로 Object Detection을 수행하는 방법을 제안한 중요한 논문이라고 생각합니다. 이 논문에서는 detection 문제에서 anchor box나 NMS(Non Maximum Supression)와 같은 heuristic 하고 미분 불가능한 방법들이 많이 사용되고, 이로 인해서 유독 object detection 문제는 딥러닝의 철학인 end-to-end 방식으로 해결되지 못하고 있음을 지적하고 있습니다. 그 해결책으로 bounding box를 예측하는 문제를 set prediction problem(중복을 허용하지 않고, 순서에 무관함)으로 보고 transformer를 활용한 end-to-end 방식의 알고리즘을 제안하였습니다. anchor box도 필요없고 NMS도 필요없는 DETR 알고리즘의 자세한 내용이 알고싶으시면 영상을 참고해주세요!
영상링크: https://youtu.be/lXpBcW_I54U
논문링크: https://arxiv.org/abs/2005.12872
1. Autoencoders are unsupervised neural networks that are useful for dimensionality reduction and clustering. They compress the input into a latent-space representation then reconstruct the output from this representation.
2. Deep autoencoders stack multiple autoencoder layers to learn hierarchical representations of the data. Each layer is trained sequentially.
3. Variational autoencoders use probabilistic encoders and decoders to learn a Gaussian latent space. They can generate new samples from the learned data distribution.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
The document discusses the application of transformers to computer vision tasks. It first introduces the standard transformer architecture and its use in natural language processing. It then summarizes recent works on applying transformers to object detection (DETR) and image classification (ViT). DETR proposes an end-to-end object detection method using a CNN-Transformer encoder-decoder architecture. Deformable DETR improves on DETR by incorporating deformable attention mechanisms. ViT represents images as sequences of patches and applies a standard Transformer encoder for image recognition, exceeding state-of-the-art models with less pre-training computation. While promising results have been achieved, challenges remain regarding model parameters and expanding transformer applications to other computer vision tasks.
This document summarizes gradient boosting algorithms XGBoost and LightGBM. It covers decision trees, overfitting, regularization, feature engineering, parameter tuning, evaluation metrics, and comparisons between XGBoost and LightGBM. Key aspects discussed include XGBoost and LightGBM's tolerance of outliers, non-standardized features, collinear features, and NaN values. Parameter tuning, using RandomizedSearchCV and GridSearchCV, and ensembling models to optimize multiple metrics are also covered.
This tutorial provides an overview of recent advances in deep generative models. It will cover three types of generative models: Markov models, latent variable models, and implicit models. The tutorial aims to give attendees a full understanding of the latest developments in generative modeling and how these models can be applied to high-dimensional data. Several challenges and open questions in the field will also be discussed. The tutorial is intended for the 2017 conference of the International Society for Bayesian Analysis.
jefferson-mae Masked Autoencoders based Pretrainingcevesom156
1) Masked self-supervised pre-training (MAE) provides an effective way to pre-train vision models like ViT in a similar manner to masked language models.
2) MAE works by masking patches of images at a high ratio like 75%, encoding the visible patches, and predicting the masked patches with a lightweight decoder.
3) MAE achieves superior results compared to contrastive learning methods on downstream tasks with either linear probes or end-to-end fine-tuning.
4) MAE can also be extended to videos by masking 3D spatiotemporal patches and works well with even higher masking ratios of 90%.
multi modal transformers representation generation .pptxsiddharth1729
This presentation covers multi modal representation learning by using multi modal transformer techniques like VisualBert, VilBert and MBit transformers. It discusses pros and cons of various modeling approaches.
Attention Is All You Need.
With these simple words, the Deep Learning industry was forever changed. Transformers were initially introduced in the field of Natural Language Processing to enhance language translation, but they demonstrated astonishing results even outside language processing. In particular, they recently spread in the Computer Vision community, advancing the state-of-the-art on many vision tasks. But what are Transformers? What is the mechanism of self-attention, and do we really need it? How did they revolutionize Computer Vision? Will they ever replace convolutional neural networks?
These and many other questions will be answered during the talk.
In this tech talk, we will discuss:
- A piece of history: Why did we need a new architecture?
- What is self-attention, and where does this concept come from?
- The Transformer architecture and its mechanisms
- Vision Transformers: An Image is worth 16x16 words
- Video Understanding using Transformers: the space + time approach
- The scale and data problem: Is Attention what we really need?
- The future of Computer Vision through Transformers
Speaker: Davide Coccomini, Nicola Messina
Website: https://www.aicamp.ai/event/eventdetails/W2021101110
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...changedaeoh
The document summarizes a research seminar presentation on using transformers for image recognition without convolutional biases. It discusses how a pure transformer architecture called Vision Transformer (ViT) can achieve state-of-the-art image classification performance when pretrained on large datasets. ViT works by splitting images into patches and treating the sequence of patch embeddings with a standard transformer. Experiments show ViT outperforms convolutional models in performance per computation and can learn spatial representations without explicit inductive biases. While limited to classification, ViT shows potential for vision tasks if pretrained self-supervision and model extensions are improved.
The document introduces autoencoders, which are neural networks that compress an input into a lower-dimensional code and then reconstruct the output from that code. It discusses that autoencoders can be trained using an unsupervised pre-training method called restricted Boltzmann machines to minimize the reconstruction error. Autoencoders can be used for dimensionality reduction, document retrieval by compressing documents into codes, and data visualization by compressing high-dimensional data points into 2D for plotting with different categories colored separately.
A presentation about the development of the ideas from the autoencoder to the Stable Diffusion text-to-image model.
Models covered: autoencoder, VAE, VQ-VAE, VQ-GAN, latent diffusion, and stable diffusion.
Transfer Learning and Fine-tuning Deep Neural NetworksPyData
This document outlines Anusua Trivedi's talk on transfer learning and fine-tuning deep neural networks. The talk covers traditional machine learning versus deep learning, using deep convolutional neural networks (DCNNs) for image analysis, transfer learning and fine-tuning DCNNs, recurrent neural networks (RNNs), and case studies applying these techniques to diabetic retinopathy prediction and fashion image caption generation.
[Paper Reading] Attention is All You NeedDaiki Tanaka
The document summarizes the "Attention Is All You Need" paper, which introduced the Transformer model for natural language processing. The Transformer uses attention mechanisms rather than recurrent or convolutional layers, allowing for more parallelization. It achieved state-of-the-art results in machine translation tasks using techniques like multi-head attention, positional encoding, and beam search decoding. The paper demonstrated the Transformer's ability to draw global dependencies between input and output with constant computational complexity.
Generative Adversarial Networks (GANs) are a class of machine learning frameworks where two neural networks contest with each other in a game. A generator network generates new data instances, while a discriminator network evaluates them for authenticity, classifying them as real or generated. This adversarial process allows the generator to improve over time and generate highly realistic samples that can pass for real data. The document provides an overview of GANs and their variants, including DCGAN, InfoGAN, EBGAN, and ACGAN models. It also discusses techniques for training more stable GANs and escaping issues like mode collapse.
This document is a slide presentation on recent advances in deep learning. It discusses self-supervised learning, which involves using unlabeled data to learn representations by predicting structural information within the data. The presentation covers pretext tasks, invariance-based approaches, and generation-based approaches for self-supervised learning in computer vision and natural language processing. It provides examples of specific self-supervised methods like predicting image rotations, clustering representations to generate pseudo-labels, and masked language modeling.
Transfer learning aims to improve learning outcomes for a target task by leveraging knowledge from a related source task. It does this by influencing the target task's assumptions based on what was learned from the source task. This can allow for faster and better generalized learning in the target task. However, there is a risk of negative transfer where performance decreases. To avoid this, methods examine task similarity and reject harmful source knowledge, or generate multiple mappings between source and target to identify the best match. The goal of transfer learning is to start higher, learn faster, and achieve better overall performance compared to learning the target task without transfer.
Diffusion models beat gans on image synthesisBeerenSahu
Diffusion models have recently been shown to produce higher quality images than GANs while also offering better diversity and being easier to scale and train. Specifically, a 2021 paper by OpenAI demonstrated that a diffusion model achieved an FID score of 2.97 on ImageNet 128x128, beating the previous state-of-the-art held by BigGAN. Diffusion models work by gradually adding noise to images in a forward process and then learning to remove noise in a backward denoising process, allowing them to generate diverse, high fidelity images.
The document discusses the Vision Transformer (ViT) model for computer vision tasks. It covers:
1. How ViT tokenizes images into patches and uses position embeddings to encode spatial relationships.
2. ViT uses a class embedding to trigger class predictions, unlike CNNs which have decoders.
3. The receptive field of ViT grows as the attention mechanism allows elements to attend to other distant elements in later layers.
4. Initial results showed ViT performance was comparable to CNNs when trained on large datasets but lagged CNNs trained on smaller datasets like ImageNet.
1) iMTFA is an incremental approach to few-shot instance segmentation that allows adding new classes without retraining.
2) It extends the MTFA baseline by training an instance feature extractor to generate discriminative embeddings for each instance, with the average embedding used as the class representative.
3) At inference, it predicts classes based on the cosine distance between ROI embeddings and stored class representatives, using class-agnostic box regression and mask prediction.
4) Experiments on COCO, VOC2007 and VOC2012 show iMTFA outperforms SOTA few-shot object detection and instance segmentation methods while enabling incremental class addition.
Semantic Segmentation Methods using Deep LearningSungjoon Choi
This document discusses semantic segmentation, which is the task of assigning each pixel in an image to a semantic class. It introduces semantic segmentation and provides a leader board of top performing models. It then details the results of various semantic segmentation models on benchmark datasets, including PSPNet, DeepLab v3+, and DeepLab v3. The models are evaluated based on metrics like mean intersection over union.
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 284번째 논문 review입니다.
이번 논문은 Facebook에서 나온 DETR(DEtection with TRansformer) 입니다.
arxiv-sanity에 top recent/last year에서 가장 상위에 자리하고 있는 논문이기도 합니다(http://www.arxiv-sanity.com/top?timefilter=year&vfilter=all)
최근에 ICLR 2021에 submit된 ViT로 인해서 이제 Transformer가 CNN을 대체하는 것 아닌가 하는 얘기들이 많이 나오고 있는데요, 올 해 ECCV에 발표된 논문이고 feature extraction 부분은 CNN을 사용하긴 했지만 transformer를 활용하여 효과적으로 Object Detection을 수행하는 방법을 제안한 중요한 논문이라고 생각합니다. 이 논문에서는 detection 문제에서 anchor box나 NMS(Non Maximum Supression)와 같은 heuristic 하고 미분 불가능한 방법들이 많이 사용되고, 이로 인해서 유독 object detection 문제는 딥러닝의 철학인 end-to-end 방식으로 해결되지 못하고 있음을 지적하고 있습니다. 그 해결책으로 bounding box를 예측하는 문제를 set prediction problem(중복을 허용하지 않고, 순서에 무관함)으로 보고 transformer를 활용한 end-to-end 방식의 알고리즘을 제안하였습니다. anchor box도 필요없고 NMS도 필요없는 DETR 알고리즘의 자세한 내용이 알고싶으시면 영상을 참고해주세요!
영상링크: https://youtu.be/lXpBcW_I54U
논문링크: https://arxiv.org/abs/2005.12872
1. Autoencoders are unsupervised neural networks that are useful for dimensionality reduction and clustering. They compress the input into a latent-space representation then reconstruct the output from this representation.
2. Deep autoencoders stack multiple autoencoder layers to learn hierarchical representations of the data. Each layer is trained sequentially.
3. Variational autoencoders use probabilistic encoders and decoders to learn a Gaussian latent space. They can generate new samples from the learned data distribution.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
The document discusses the application of transformers to computer vision tasks. It first introduces the standard transformer architecture and its use in natural language processing. It then summarizes recent works on applying transformers to object detection (DETR) and image classification (ViT). DETR proposes an end-to-end object detection method using a CNN-Transformer encoder-decoder architecture. Deformable DETR improves on DETR by incorporating deformable attention mechanisms. ViT represents images as sequences of patches and applies a standard Transformer encoder for image recognition, exceeding state-of-the-art models with less pre-training computation. While promising results have been achieved, challenges remain regarding model parameters and expanding transformer applications to other computer vision tasks.
This document summarizes gradient boosting algorithms XGBoost and LightGBM. It covers decision trees, overfitting, regularization, feature engineering, parameter tuning, evaluation metrics, and comparisons between XGBoost and LightGBM. Key aspects discussed include XGBoost and LightGBM's tolerance of outliers, non-standardized features, collinear features, and NaN values. Parameter tuning, using RandomizedSearchCV and GridSearchCV, and ensembling models to optimize multiple metrics are also covered.
This tutorial provides an overview of recent advances in deep generative models. It will cover three types of generative models: Markov models, latent variable models, and implicit models. The tutorial aims to give attendees a full understanding of the latest developments in generative modeling and how these models can be applied to high-dimensional data. Several challenges and open questions in the field will also be discussed. The tutorial is intended for the 2017 conference of the International Society for Bayesian Analysis.
jefferson-mae Masked Autoencoders based Pretrainingcevesom156
1) Masked self-supervised pre-training (MAE) provides an effective way to pre-train vision models like ViT in a similar manner to masked language models.
2) MAE works by masking patches of images at a high ratio like 75%, encoding the visible patches, and predicting the masked patches with a lightweight decoder.
3) MAE achieves superior results compared to contrastive learning methods on downstream tasks with either linear probes or end-to-end fine-tuning.
4) MAE can also be extended to videos by masking 3D spatiotemporal patches and works well with even higher masking ratios of 90%.
multi modal transformers representation generation .pptxsiddharth1729
This presentation covers multi modal representation learning by using multi modal transformer techniques like VisualBert, VilBert and MBit transformers. It discusses pros and cons of various modeling approaches.
The document discusses image captioning using deep neural networks. It begins by providing examples of how humans can easily describe images but generating image captions with a computer program was previously very difficult. Recent advances in deep learning, specifically using convolutional neural networks (CNNs) to recognize objects in images and recurrent neural networks (RNNs) to generate captions, have enabled automated image captioning. The document discusses CNN and RNN architectures for image captioning and provides examples of pre-trained models that can be used, such as VGG-16.
This document discusses attention mechanisms, which focus on important parts of input data while ignoring other parts. Attention was first used in machine translation in 2014 and is now widely used in natural language processing, computer vision, speech recognition and other domains. Attention mechanisms help overcome drawbacks of traditional encoder-decoder models. There are several categories of attention based on the number of sequences, abstraction levels, positions and representations. Stand-alone self-attention models have been applied successfully to computer vision tasks like image classification and object detection, outperforming convolutional baselines.
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
This presentation contains the overview of Attention models. It also has information of the stand alone self attention model used for Computer Vision tasks.
FaceNet provides a unified embedding for face recognition, verification, and clustering tasks using a deep convolutional neural network. It was developed by Google researchers and achieved state-of-the-art results on benchmark datasets, cutting the error rate by 30% compared to previous work. The model uses a 22-layer CNN that maps face images to 128-dimensional embeddings, where distances between embeddings correspond to face similarity. It was trained with triplet loss to optimize the embeddings.
This document summarizes a presentation on deep image processing and computer vision. It introduces common deep learning techniques like CNNs, autoencoders, variational autoencoders and generative adversarial networks. It then discusses applications including image classification using models like LeNet, AlexNet and VGG. It also covers face detection, segmentation, object detection algorithms like R-CNN, Fast R-CNN and Faster R-CNN. Additional topics include document automation using character recognition and graphical element analysis, as well as identity recognition using face detection. Real-world examples are provided for document processing, handwritten letter recognition and event pass verification.
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...devismileyrockz
An autoencoder is an artificial neural network used for unsupervised learning tasks. It compresses the input into a lower-dimensional encoding and then reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, anomaly detection, and extracting meaningful features from data. They work by learning efficient data encodings in an unsupervised manner.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/08/high-fidelity-conversion-of-floating-point-networks-for-low-precision-inference-using-distillation-with-limited-data-a-presentation-from-imagination-technologies/
James Imber, Senior Research Engineer at Imagination Technologies, presents the “High-fidelity Conversion of Floating-point Networks for Low-precision Inference using Distillation with Limited Data” tutorial at the May 2021 Embedded Vision Summit.
When converting floating-point networks to low-precision equivalents for high-performance inference, the primary objective is to maximally compress the network whilst maintaining fidelity to the original, floating-point network. This is made particularly challenging when only a reduced or unlabelled dataset is available. Data may be limited for reasons of a commercial or legal nature: for example, companies may be unwilling to share valuable data and labels that represent a substantial investment of resources; or the collector of the original dataset may not be permitted to share it for data privacy reasons.
Imber presents a method based on distillation that allows high-fidelity, low-precision networks to be produced for a wide range of different network types, using the original trained network in place of a labeled dataset. The proposed approach is directly applicable across multiple domains (e.g. classification, segmentation and style transfer) and can be adapted to numerous network compression techniques.
- Autoencoders are unsupervised neural networks that compress input data into a latent space representation and then reconstruct the output from this representation. They aim to copy their input to their output with minimal loss of information.
- Autoencoders consist of an encoder that compresses the input into a latent space and a decoder that decompresses this latent space back into the original input space. The network is trained to minimize the reconstruction loss between the input and output.
- Autoencoders are commonly used for dimensionality reduction, feature extraction, denoising images, and generating new data similar to the training data distribution.
- Autoencoders are unsupervised neural networks that compress input data into a latent space representation and then reconstruct the output from this representation. They aim to copy their input to their output with minimal loss of information.
- Autoencoders consist of an encoder that compresses the input into a latent space and a decoder that decompresses this latent space back into the original input space. The network is trained to minimize the reconstruction loss between the input and output.
- Autoencoders are commonly used for dimensionality reduction, feature extraction, denoising images, and generating new data similar to the training data distribution.
This document provides an overview of deep learning including definitions, architectures, types of deep learning networks, and applications. It defines deep learning as a branch of machine learning that uses neural networks with multiple hidden layers to perform feature extraction and transformation without being explicitly programmed. The main architectures discussed are deep neural networks, deep belief networks, and recurrent neural networks. The types of deep learning networks covered include feedforward neural networks, recurrent neural networks, convolutional neural networks, restricted Boltzmann machines, and autoencoders. Finally, the document discusses several applications of deep learning across industries such as self-driving cars, natural language processing, virtual assistants, and healthcare.
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
Words are no longer sufficient in delivering the search results users are looking for, particularly in relation to image search. Text and languages pose many challenges in describing visual details and providing the necessary context for optimal results. Machine Learning technology opens a new world of search innovation that has yet to be applied by businesses.
In this session, Mike Ranzinger of Shutterstock will share a technical presentation detailing his research on composition aware search. He will also demonstrate how the research led to the launch of AI technology allowing users to more precisely find the image they need within Shutterstock’s collection of more than 150 million images. While the company released a number of AI search enabled tools in 2016, this new technology allows users to search for items in an image and specify where they should be located within the image. The research identifies the networks that localize and describe regions of an image as well as the relationships between things. The goal of this research was to improve the future of search using visual data, contextual search functions, and AI. A combination of multiple machine learning technologies led to this breakthrough.
- Autoencoders are unsupervised neural networks that are used for dimensionality reduction and feature extraction. They compress the input into a latent-space representation and then reconstruct the output from this representation.
- The architecture of an autoencoder consists of an encoder that compresses the input into a latent space, a decoder that reconstructs the output from the latent space, and a reconstruction loss that is minimized during training.
- There are different types of autoencoders like undercomplete, convolutional, sparse, denoising, contractive, stacked, and deep autoencoders that apply additional constraints or have more complex architectures. Autoencoders can be used for tasks like image compression, anomaly detection, and feature learning.
Convolutional neural networks can be used for handwritten digit recognition. They employ replicated feature detectors with shared weights to achieve translation equivariance. Pooling layers provide some translation invariance while reducing the number of inputs to subsequent layers. The LeNet architecture developed by Yann LeCun used these techniques along with multiple hidden layers and achieved an error rate of around 1% on handwritten digit recognition. Dropout regularization helps convolutional neural networks generalize well when applied to large scale tasks like ImageNet classification by preventing complex co-adaptations between hidden units.
This document discusses using fully convolutional neural networks for defect inspection. It begins with an agenda that outlines image segmentation using FCNs and defect inspection. It then provides details on data preparation including labeling guidelines, data augmentation, and model setup using techniques like deconvolution layers and the U-Net architecture. Metrics for evaluating the model like Dice score and IoU are also covered. The document concludes with best practices for successful deep learning projects focusing on aspects like having a large reusable dataset, feasibility of the problem, potential payoff, and fault tolerance.
This document discusses two-dimensional wavelets for image processing. It explains that 2D wavelets can be constructed as separable products of 1D wavelets, using scaling functions and wavelet functions. The document provides examples of 2D Haar wavelets and discusses how a 2D wavelet decomposition breaks down the frequency content of an image into different subbands. It also summarizes applications of 2D wavelets such as image denoising, edge detection, and compression.
An Introduction to Pre-training General Language Representationszperjaccico
This document provides an overview of pre-training general language representations. It discusses early methods like ELMo and GPT that used bidirectional and autoregressive language models. It then focuses on BERT, explaining its bidirectional transformer architecture and pre-training objectives of masked language modeling and next sentence prediction. The document outlines extensions to BERT like ALBERT which aims to reduce parameters. It also discusses Chinese models like ERNIE and MT-BERT which were adapted from BERT for the Chinese language.
Semelhante a PR-355: Masked Autoencoders Are Scalable Vision Learners (20)
#PR12 #PR366
안녕하세요 논문 읽기 모임 PR-12의 366번째 논문리뷰입니다.
올해가 AlexNet이 나온지 10주년이 되는 해네요.
AlexNet이 2012년에 혜성처럼 등장한 이후, Solve computer vision problem = Use CNN이 공식처럼 사용되던 2010년대가 가고
2020년대 들어서 ViT의 등장을 시작으로 Transformer 기반의 network들이 CNN의 자리를 위협하고 상당부분 이미 뺏어간 상황입니다.
2020년대에 CNN의 가야할 길은 어디일까요?
Inductive bias가 적은 Transformer가 대용량의 데이터로 학습하면 항상 CNN보다 더 낫다는 건 진실일까요?
이 논문에서는 2020년대를 위한 CNN이라는 제목으로 ConvNeXt라는 새로운(?) architecture를 제안합니다.
사실 새로운 건 없고 그동안 있었던 것들과 Transformer에서 적용한 것들을 copy해와서 CNN에 적용해보았는데요,
Transformer보다 성능도 좋고 속도도 빠른 결과가 나왔다고 합니다.
결과에 대해서 약간의 논란이 twitter 상에서 나오고 있는데 이 부분 포함해서 자세한 내용은 영상을 통해서 보실 수 있습니다.
늘 재밌게 봐주시고 좋아요 댓글 구독 해주시는 분들께 감사드립니다 :)
논문링크: https://arxiv.org/abs/2201.03545
영상링크: https://youtu.be/Mw7IhO2uBGc
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
#PR12 #PR344
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 344번째 논문 리뷰입니다.
오늘은 중국과기대와 MSRA에서 나온 A Battle of Network Structures라는 강렬한 제목을 가진 논문입니다.
부제에서 잘 나와있듯이 이 논문은 computer vision에서 CNN, Transformer, MLP에 대해서 같은 환경에서 비교를 통해 어떤 특징들이 있는지를 알아본 논문입니다.
우선 같은 조건에서 실험하기 위하여 SPACH라는 unified framework을 만들고 그 안에 CNN, Transformer, MLP를 넣어서 실험을 합니다.
셋 모두 조건이 잘 갖춰지면 비슷한 성능을 내지만, MLP는 model size가 커지면 overfitting이 발생하고
CNN은 Transformer에 비해서 적은 data에서도 좋은 성능이 나오는 generalization capability가 좋고,
Transformer는 model capacity가 커서 data가 충분하고 연산량도 큰 환경에서 잘한다는 것이 실험의 한가지 결과입니다.
또하나는 global receptive field를 갖는 transformer나 MLP의 경우에도 local한 연산을 하는 local model을 같이 써줄때에 성능이 좋아진다는 것입니다.
이런 insight들을 통해서 이 논문에서는 CNN과 Transformer를 결합한 형태의 Hybrid model을 제안하여 SOTA 성능을 낼 수 있음을 보여줍니다.
개인적으로 놀랄만한 insight를 발견한 것은 아니었지만 세가지 network의 특징과 장단점에 대해서 정리해볼 수 있는 그런 논문이라고 평하고 싶습니다.
자세한 내용은 영상을 참고해주세요! 감사합니다
영상링크: https://youtu.be/NVLMZZglx14
논문링크: https://arxiv.org/abs/2108.13002
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
The document summarizes a study on training Vision Transformers (ViTs) by exploring different combinations of data augmentation, regularization techniques, model sizes, and training dataset sizes. Some key findings include: 1) Models trained with extensive data augmentation on ImageNet-1k performed comparably to those trained on the larger ImageNet-21k dataset without augmentation. 2) Transfer learning from pre-trained models was more efficient and achieved better results than training models from scratch, even with extensive compute. 3) Models pre-trained on more data showed better transfer ability, indicating more data yields more generic representations.
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
Computer Vision 분야에서 CNN은 과연 살아남을 수 있을까요?
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 317번째 논문 리뷰입니다.
이번에는 Google Research, Brain Team의 MLP-Mixer: An all-MLP Architecture for Vision을 리뷰해보았습니다.
Attention의 공격도 버거운데 이번에는 MLP(Multi-Layer Perceptron)의 공격입니다.
MLP만을 사용해서 Image Classification을 하는데 성능도 좋고 속도도 빠르고....
구조를 간단히 소개해드리면 ViT(Vision Transformer)의 self-attention 부분을 MLP로 변경하였습니다.
MLP block 2개를 사용하여 하나는 patch(token)들 간의 연산을 하는데 사용하고, 하나는 patch 내부 연산을 하는데 사용합니다.
사실 MLP를 사용하긴 했지만 논문에도 언급되어 있듯이, 이 부분을 일종의 convolution이라고 볼 수 있는데요...
그래도 transformer 기반의 network이 가질 수밖에 없는 quadratic complexity를 linear로 낮춰주고
convolution의 inductive bias 거의 없이 아주아주 simple한 구조를 활용하여 이렇게 좋은 성능을 보여준 점이 멋집니다.
반면에 역시나 data를 많이 써야 한다거나, MLP의 한계인 fixed length의 input만 받을 수 있다는 점은 단점이라고 생각하는데요,
이 연구를 시작으로 MLP도 다시한번 조명받는 계기가 되면 좋을 것 같네요
비슷한 시점에 나온 비슷한 연구들도 마지막에 간략하게 소개하였습니다.
재미있게 봐주세요. 감사합니다!
논문링크: https://arxiv.org/abs/2105.01601
영상링크: https://youtu.be/KQmZlxdnnuY
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 297번째 리뷰입니다
어느덧 PR-12 시즌 3의 끝까지 논문 3편밖에 남지 않았네요.
시즌 3가 끝나면 바로 시즌 4의 새 멤버 모집이 시작될 예정입니다. 많은 관심과 지원 부탁드립니다~~
(멤버 모집 공지는 Facebook TensorFlow Korea 그룹에 올라올 예정입니다)
오늘 제가 리뷰한 논문은 Facebook의 Training data-efficient image transformers & distillation through attention 입니다.
Google에서 나왔던 ViT논문 이후에 convolution을 전혀 사용하지 않고 오직 attention만을 이용한 computer vision algorithm에 어느때보다 관심이 높아지고 있는데요
이 논문에서 제안한 DeiT 모델은 ViT와 같은 architecture를 사용하면서 ViT가 ImageNet data만으로는 성능이 잘 안나왔던 것에 비해서
Training 방법 개선과 새로운 Knowledge Distillation 방법을 사용하여 mageNet data 만으로 EfficientNet보다 뛰어난 성능을 보여주는 결과를 얻었습니다.
정말 CNN은 이제 서서히 사라지게 되는 것일까요? Attention이 computer vision도 정복하게 될 것인지....
개인적으로는 당분간은 attention 기반의 CV 논문이 쏟아질 거라고 확신하고, 또 여기에서 놀라운 일들이 일어날 수 있을 거라고 생각하고 있습니다
CNN은 10년간 많은 연구를 통해서 발전해왔지만, transformer는 이제 CV에 적용된 지 얼마 안된 시점이라서 더 기대가 크구요,
attention이 inductive bias가 가장 적은 형태의 모델이기 때문에 더 놀라운 이들을 만들 수 있을거라고 생각합니다
얼마 전에 나온 open AI의 DALL-E도 그 대표적인 예라고 할 수 있을 것 같습니다. Transformer의 또하나의 transformation이 궁금하신 분들은 아래 영상을 참고해주세요
영상링크: https://youtu.be/DjEvzeiWBTo
논문링크: https://arxiv.org/abs/2012.12877
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 270번째 논문 review입니다.
이번 논문은 Baidu에서 나온 PP-YOLO: An Effective and Efficient Implementation of Object Detector입니다. YOLOv3에 다양한 방법을 적용하여 매우 높은 성능과 함께 매우 빠른 속도 두마리 토끼를 다 잡아버린(?) 그런 논문입니다. 논문에서 사용한 다양한 trick들에 대해서 좀 더 깊이있게 살펴보았습니다. Object detection에 사용된 기법 들 중에 Deformable convolution, Exponential Moving Average, DropBlock, IoU aware prediction, Grid sensitivity elimination, MatrixNMS, CoordConv, 등의 방법에 관심이 있으시거나 알고 싶으신 분들은 영상과 발표자료를 참고하시면 좋을 것 같습니다!
논문링크: https://arxiv.org/abs/2007.12099
영상링크: https://youtu.be/7v34cCE5H4k
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 258번째 논문 review입니다.
이번 논문은 MIT에서 나온 From ImageNet to Image Classification: Contextualizing Progress on Benchmarks입니다.
Deep Learning 하시는 분들이면 ImageNet 모르시는 분들이 없을텐데요, 이 논문은 ImageNet의 labeling 방법의 한계와 문제점에 대해서 얘기하고 top-1 accuracy 기반의 평가 방법에도 문제가 있을 수 있음을 지적하고 있습니다.
ImageNet data의 20% 이상이 multi object를 포함하고 있지만 그 중에 하나만 정답으로 인정되는 문제가 있고, annotation 방법의 한계로 인하여 실제로 사람이 생각하는 것과 다른 class가 정답으로 labeling되어 있는 경우도 많았습니다. 또한 terrier만 20종이 넘는 등 전문가가 아니면 판단하기 어려운 label도 많다는 문제도 있었구요. 이 밖에도 다양한 실험을 통해서 정량적인 분석과 함께 human-in-the-loop을 이용한 평가로 현재 model들의 성능이 어디까지 와있는지, 그리고 앞으로 더 높은 성능을 내기 위해서 data labeling 측면에서 해결해야할 과제는 무엇인지에 대해서 이야기하고 있습니다. 논문이 양이 좀 많긴 하지만 기술적인 내용이 별로 없어서 쉽게 읽으실 수 있는데요, 자세한 내용이 궁금하신 분들은 영상을 참고해주세요!
논문링크: https://arxiv.org/abs/2005.11295
발표영상링크: https://youtu.be/CPMgX5ikL_8
TensorFlow Korea 논문읽기모임 PR12 243째 논문 review입니다
이번 논문은 RegNet으로 알려진 Facebook AI Research의 Designing Network Design Spaces 입니다.
CNN을 디자인할 때, bottleneck layer는 정말 좋을까요? layer 수는 많을 수록 높은 성능을 낼까요? activation map의 width, height를 절반으로 줄일 때(stride 2 혹은 pooling), channel을 2배로 늘려주는데 이게 최선일까요? 혹시 bottleneck layer가 없는 게 더 좋지는 않은지, 최고 성능을 내는 layer 수에 magic number가 있는 건 아닐지, activation이 절반으로 줄어들 때 channel을 2배가 아니라 3배로 늘리는 게 더 좋은건 아닌지?
이 논문에서는 하나의 neural network을 잘 design하는 것이 아니라 Auto ML과 같은 기술로 좋은 neural network을 찾을 수 있는 즉 좋은 neural network들이 살고 있는 좋은 design space를 design하는 방법에 대해서 얘기하고 있습니다. constraint이 거의 없는 design space에서 human-in-the-loop을 통해 좋은 design space로 그 공간을 좁혀나가는 방법을 제안하였는데요, EfficientNet보다 더 좋은 성능을 보여주는 RegNet은 어떤 design space에서 탄생하였는지 그리고 그 과정에서 우리가 당연하게 여기고 있었던 design choice들이 잘못된 부분은 없었는지 아래 동영상에서 확인하실 수 있습니다~
영상링크: https://youtu.be/bnbKQRae_u4
논문링크: https://arxiv.org/abs/2003.13678
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 217번째 논문 review입니다
이번 논문은 GoogleBrain에서 쓴 EfficientDet입니다. EfficientNet의 후속작으로 accuracy와 efficiency를 둘 다 잡기 위한 object detection 방법을 제안한 논문입니다. 이를 위하여 weighted bidirectional feature pyramid network(BiFPN)과 EfficientNet과 유사한 방법의 detection용 compound scaling 방법을 제안하고 있는데요, 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1911.09070
영상링크: https://youtu.be/11jDC8uZL0E
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
YOLOv3 makes the following incremental improvements over previous versions of YOLO:
1. It predicts bounding boxes at three different scales to detect objects more accurately at a variety of sizes.
2. It uses Darknet-53 as its feature extractor, which provides better performance than ResNet while being faster to evaluate.
3. It predicts more bounding boxes overall (over 10,000) to detect objects more precisely, as compared to YOLOv2 which predicts around 800 boxes.
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 197번째 논문 review입니다
(2기 목표 200편까지 이제 3편이 남았습니다)
이번에 제가 발표한 논문은 FAIR(Facebook AI Research)에서 나온 One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers 입니다
한 장의 ticket으로 모든 복권에서 1등을 할 수 있다면 얼마나 좋을까요?
일반적인 network pruning 방법은 pruning 하기 이전에 학습된 network weight를 그대로 사용하면서 fine tuning하는 방법을 사용해왔습니다
pruning한 이후에 network에 weight를 random intialization한 후 학습하면 성능이 잘 나오지 않는 문제가 있었는데요
작년 MIT에서 나온 Lottery ticket hypothesis라는 논문에서는 이렇게 pruning된 이후의 network를 어떻게 random intialization하면 높은 성능을 낼 수 있는지
이 intialization 방법을 공개하며 lottery ticket의 winning ticket이라고 이름붙였습니다.
그런데 이 winning ticket이 혹시 다른 dataset이나 다른 optimizer를 사용하는 경우에도 잘 동작할 수 있을까요?
예를 들어 CIFAR10에서 찾은 winning ticket이 ImageNet에서도 winning ticket의 성능을 나타낼 수 있을까요?
이 논문은 이러한 질문에 대한 답을 실험을 통해서 확인하였고, initialization에 대한 여러가지 insight를 담고 있습니다.
자세한 내용은 발표 영상을 참고해주세요~!
영상링크: https://youtu.be/YmTNpF2OOjA
발표자료링크: https://www.slideshare.net/JinwonLee9/pr197-one-ticket-to-win-them-all-generalizing-lottery-ticket-initializations-across-datasets-and-optimizers
논문링크: https://arxiv.org/abs/1906.02773
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
TensorFlow-KR 논문읽기모임 PR12(12PR) 183번째 논문 review입니다.
이번에 살펴볼 논문은 Google Brain에서 발표한 MixNet입니다. Efficiency를 추구하는 CNN에서 depthwise convolution이 많이 사용되는데, 이 때 depthwise convolution filter의 size를 다양하게 해서 성능도 높이고 efficiency도 높이는 방법을 제안한 논문입니다. 자세한 내용은 영상을 참고해주세요
논문링크 : https://arxiv.org/abs/1907.09595
발표영상 : https://youtu.be/252YxqpHzsg
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
TensorFlow-KR 논문읽기모임 PR12 169번째 논문 review입니다.
이번에 살펴본 논문은 Google에서 발표한 EfficientNet입니다. efficient neural network은 보통 mobile과 같은 제한된 computing power를 가진 edge device를 위한 작은 network 위주로 연구되어왔는데, 이 논문은 성능을 높이기 위해서 일반적으로 network를 점점 더 키워나가는 경우가 많은데, 이 때 어떻게 하면 더 효율적인 방법으로 network을 키울 수 있을지에 대해서 연구한 논문입니다. 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1905.11946
영상링크: https://youtu.be/Vhz0quyvR7I
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionJinwon Lee
The document discusses exploring randomly wired neural networks for image recognition. The authors define network generators based on random graph models like Erdos-Renyi, Barabasi-Albert, and Watts-Strogatz. These generators produce neural networks with random connectivity patterns. Experiments show that networks generated this way achieve competitive accuracy compared to hand-designed architectures, demonstrating that the generator design is important. The random wiring patterns provide performance comparable to networks from neural architecture search with fewer parameters and computations.
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
Tensorfkow-KR 논문읽기모임 PR12 144번째 논문 review입니다.
이번에는 Efficient CNN의 대표 중 하나인 SqueezeNext를 review해보았습니다. SqueezeNext의 전신인 SqueezeNet도 같이 review하였고, CNN을 평가하는 metric에 대한 논문인 NetScore에서 SqueezeNext가 1등을 하여 NetScore도 같이 review하였습니다.
논문링크:
SqueezeNext - https://arxiv.org/abs/1803.10615
SqueezeNet - https://arxiv.org/abs/1602.07360
NetScore - https://arxiv.org/abs/1806.05512
영상링크: https://youtu.be/WReWeADJ3Pw
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
SSD is a single-shot object detector that processes the entire image at once, rather than proposing regions of interest. It uses a base VGG16 network with additional convolutional layers to predict bounding boxes and class probabilities at three scales simultaneously. SSD achieves state-of-the-art accuracy while running significantly faster than two-stage detectors like Faster R-CNN. It introduces techniques like default boxes, hard negative mining, and data augmentation to address class imbalance and improve results on small objects. On PASCAL VOC 2007, SSD detects objects at 59 FPS with 74.3% mAP, comparable to Faster R-CNN but much faster.
PR095: Modularity Matters: Learning Invariant Relational Reasoning TasksJinwon Lee
Tensorflow-KR 논문읽기모임 95번째 발표영상입니다
Modularity Matters라는 제목으로 visual relational reasoning 문제를 풀 수 있는 방법을 제시한 논문입니다. 기존 CNN들이 이런 문제이 취약함을 보여주고 이를 해결하기 위한 방법을 제시합니다. 관심있는 주제이기도 하고 Bengio 교수님 팀에서 쓴 논문이라서 review 해보았습니다
발표영상: https://youtu.be/dAGI3mlOmfw
논문링크: https://arxiv.org/abs/1806.06765
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Driving Business Innovation: Latest Generative AI Advancements & Success Story
PR-355: Masked Autoencoders Are Scalable Vision Learners
1. Masked Autoencoders
Are Scalable Vision Learners
Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”
14th November, 2021
PR12 Paper Review
JinWon Lee
2. Introduction
• Deep learning has witnessed an explosion of architectures of
continuously growing capability and capacity.
• Aided by the rapid gains in hardware, models today can easily overfit
one million images and begin to demand hundreds of millions of—
often publicly inaccessible—labeled images.
• This appetite for data has been successfully addressed in natural
language processing (NLP) by self-supervised pretraining.
3. Introduction
• The solutions, based on autoregressive language modeling in GPT and
masked autoencoding in BERT, are conceptually simple: they remove
a portion of the data and learn to predict the removed content.
• These methods now enable training of generalizable NLP models
containing over one hundred billion parameters.
4. Introduction
• The idea of masked autoencoders, a form of more general denoising
autoencoders, is natural and applicable in computer vision as well.
• However, despite significant interest in this idea following the success
of BERT, progress of autoencoding methods in vision lags behind NLP.
5. What makes masked autoencoding different
between vision and language?
• Until recently, architectures were different. In vision, convolutional
networks were dominant over the last decade.
• Convolutions typically operate on regular grids and it is not
straightforward to integrate ‘indicators’ such as mask tokens or
positional embeddings into convolutional networks.
• This architectural gap, however, has been addressed with the
introduction of Vision Transformers (ViT) and should no longer
present an obstacle.
6. What makes masked autoencoding different
between vision and language?
• Information density is different between language and vision.
• Languages are human-generated signals that are highly semantic and
information-dense.
• When training a model to predict only a few missing words per
sentence, this task appears to induce sophisticated language
understanding.
• Images, on the contrary, are natural signals with heavy spatial
redundancy—e.g., a missing patch can be recovered from neighboring
patches with little high-level understanding of parts, objects, and
scenes.
7. What makes masked autoencoding different
between vision and language?
• The autoencoder’s decoder, which maps the latent representation
back to the input, plays a different role between reconstructing text
and images.
• In vision, the decoder reconstructs pixels, hence its output is of a
lower semantic level than common recognition tasks.
• This is in contrast to language, where the decoder predicts missing
words that contain rich semantic information.
• While in BERT the decoder can be trivial (an MLP), we found that for
images, the decoder design plays a key role in determining the
semantic level of the learned latent representations.
8. Related Work
• Masked language modeling
▪ BERT and GPT, are highly successful methods for pre-training in NLP.
▪ These methods hold out a portion of the input sequence and train models to
predict the missing content.
▪ These methods have been shown to scale excellently and a large abundance
of evidence indicates that these pre-trained representations generalize well
to various downstream tasks.
9. Related Work
• Autoencoding
▪ It has an encoder that maps an input to a latent representation and a decoder
that reconstructs the input.
▪ Denoising autoencoders (DAE) are a class of autoencoders that corrupt an
input signal and learn to reconstruct the original, uncorrupted signal.
▪ A series of methods can be thought of as a generalized DAE under different
corruptions, e.g., masking pixels or removing color channels.
10. Related Work
• Masked image encoding
▪ The pioneering work of presents masking as a noise type in DAE.
▪ Context Encoder inpaints large missing regions using convolutional networks.
▪ Motivated by the success in NLP, related recent methods are based on
Transformers. iGPT operates on sequences of pixels and predicts unknown
pixels. The ViT paper studies masked patch prediction for self-supervised
learning. Most recently, BEiT proposes to predict discrete tokens.
12. Related Work
• Self-supervised learning(SSL)
▪ SSL approaches have seen significant interest in computer vision, often
focusing on different pretext tasks for pre-training.
▪ Recently, contrastive learning has been popular, which models image
similarity and dissimilarity (or only similarity) between two or more views.
▪ Contrastive and related methods strongly depend on data augmentation.
▪ Autoencoding pursues a conceptually different direction, and it exhibits
different behaviors.
13. Approach
• Suggested masked autoencoder(MAE) is a simple autoencoding
approach that reconstructs the original signal given its partial
observation.
• Like all autoencoders, MAE has an encoder that maps the observed
signal to a latent representation, and a decoder that reconstructs the
original signal from the latent representation.
• Unlike classical autoencoders, The authors adopt an asymmetric
design that allows the encoder to operate only on the partial,
observed signal (without mask tokens) and a lightweight decoder that
reconstructs the full signal from the latent representation and mask
tokens.
14. MAE Architecture
• During pre-training, a large random
subset of image patches (e.g., 75%)
is masked out.
• The encoder is applied to the small
subset of visible patches.
• Mask tokens are introduced after
the encoder, and the full set of
encoded patches and mask tokens
is processed by a small decoder
that reconstructs the original
image in pixels.
• After pre-training, the decoder is
discarded and the encoder is
applied to uncorrupted images to
produce representations for
recognition tasks.
16. MAE Details
• Masking
▪ Following ViT, an image is divided into regular non-overlapping patches. Then
a subset of patches is sampled and masked.
▪ Sampling strategy is straightforward: random patches without replacement,
following a uniform distribution.
▪ Random sampling with a high masking ratio largely eliminates redundancy,
thus creating a task that cannot be easily solved by extrapolation from visible
neighboring patches.
17. MAE Details
• MAE encoder
▪ MAE encoder is a ViT but applied only on visible, unmasked patches.
▪ Just as in a standard ViT, MAE encoder embeds patches by a linear projection
with added positional embeddings.
▪ However, this encoder only operates on a small subset (e.g., 25%) of the full
set.
▪ This can reduce overall pre-training time by 3x or more and likewise reduce
memory consumption, enabling us to easily scale MAE to large models.
18. MAE Details
• MAE decoder
▪ The input to the MAE decoder is the full set of tokens consisting of (i)
encoded visible patches, and (ii) mask tokens.
▪ Each mask token is a shared, learned vector that indicates the presence of a
missing patch to be predicted.
▪ Positional embeddings are added to all tokens in this full set.
▪ The MAE decoder is only used during pre-training to perform the image
reconstruction task, so the decoder architecture can be flexible designed in a
manner that is independent of the encoder design.
▪ MAE’s default decoder has <10% computation per token vs. the encoder. With
this asymmetrical design, the full set of tokens are only processed by the
lightweight decoder, which significantly reduces pre-training time.
19. MAE Details
• Reconstruction target
▪ MAE reconstructs the input by predicting the pixel values for each masked
patch.
▪ The loss function computes the mean squared error (MSE) between the
reconstructed and original images in the pixel space. Computing the loss only
on masked patches, similar to BERT.
• Simple implementation
▪ First, generate a token for every input patch (by linear projection with an
added positional embedding). Next, randomly shuffle the list of tokens and
remove the last portion of the list, based on the masking ratio.
▪ After encoding, append a list of mask tokens to the list of encoded patches,
and unshuffle this full list (inverting the random shuffle operation) to align all
tokens with their targets.
20. ImageNet Experiments
• The authors do self-supervised pre-training on the ImageNet-1K(IN1K)
training set.
• Then they do supervised training to evaluate the representations with
(i) end-to-end fine-tuning or (ii) linear probing.
• Baseline: ViT-Large
▪ ViT-Large(Vit-L/16) is used as backbone in ablation study.
▪ ViT-L is very big and tends to overfit.
▪ It is nontrivial to train supervised ViT-L from scratch and a good recipe with
strong regularization is needed.
21. Masking Ratio
• The optimal ratios are surprisingly high.
• 75% is good for both linear probing and
fine-tuning.
• This is in contrast with BERT(15%) and also
much higher than those in related works in
CV(20%~50%).
• Reasoning-like behavior is linked to the
learning of useful representations.
• For linear probing, the accuracy increases
steadily until the sweet point, but for fine-
tuning, the results are less sensitive to the
ratios.
• All fine-tuning results are better that
training from scratch(82.5%)
22. Decoder Design
• A sufficiently deep decoder is important for linear probing. This can be explained
by the gap between a pixel reconstruction task and a recognition task: the last
several layers in an autoencoder are more specialized for reconstruction, but are
less relevant for recognition.
• However, if fine-tuning is used, the last layers of the encoder can be tuned to
adapt to the recognition task. The decoder depth is less influential.
• Interestingly, MAE with a single-block decoder can perform strongly with fine-
tuning (84.8%). Note that a single Transformer block is the minimal requirement
to propagate information from visible tokens to mask tokens.
• Overall, default MAE decoder is lightweight. It has 8 blocks and a width of 512-d.
It only has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d).
23. Mask Token
• If the encoder uses mask tokens, it
performs worse.
• In this case, there is a gap between
pre-training and deploying: this
encoder has a large portion of mask
tokens in its input in pretraining,
which does not exist in uncorrupted
images. This gap may degrade
accuracy in deployment.
24. Mask Token
• By skipping the mask token in the encoder,
training computation is greatly reduced.
• Note that the speedup can be >4x for a
masking ratio of 75%, partially because the
self-attention complexity is quadratic.
• In addition, memory is greatly reduced, which
can enable training even larger models or
speeding up more by large-batch training.
• The time and memory efficiency makes MAE
favorable for training very large models.
25. Reconstruction Target
• Using pixels with normalization improves accuracy. This per-patch normalization
enhances the contrast locally.
• In another variant, the authors perform PCA in the patch space and use the largest PCA
coefficients (96 here) as the target. Doing so degrades accuracy.
• The authors also compare an MAE variant that predicts tokens, the target used in BEiT.
Specifically for this variant, the DALLE pre-trained dVAE is used as the tokenizer, following
BEIT.
• This tokenization improves fine-tuning accuracy vs. unnormalized pixels, but has no
advantage vs. normalized pixels.
• The dVAE tokenizer requires one more pre-training stage, which may depend on extra
data (250M images). The dVAE encoder is a large convolutional network (40% FLOPs of
ViT-L) and adds nontrivial overhead. Using pixels does not suffer from these problems.
26. Data Augmentation
• MAE works well using cropping-only augmentation, either fixed-size or
random-size (both having random horizontal flipping).
• Adding color jittering degrades the results.
• Surprisingly, MAE behaves decently even if using no data augmentation
(only center-crop, no flipping). This property is dramatically different from
contrastive learning and related methods, which heavily rely on data
augmentation.
• In MAE, the role of data augmentation is mainly performed by random
masking. The masks are different for each iteration and so they generate
new training samples regardless of data augmentation.
27. Masking Sampling Strategy
• MAE with block-wise masking works reasonably well at a ratio of 50%,
but degrades at a ratio of 75%.
• Grid-wise sampling regularly keeps one of every four patches and this
is an easier task and has lower training loss. The reconstruction is
sharper, but the representation quality is lower.
• Simple random sampling works the best for MAE. It allows for a
higher masking ratio, which provides a greater speedup benefit.
28. Training Schedule
• The accuracy improves steadily with longer
training. Indeed, saturation of linear probing
have not been observed even at 1600 epochs.
• This behavior is unlike contrastive learning
methods, e.g., MoCo v3 saturates at 300
epochs for ViT-L.
• Note that the MAE encoder only sees 25% of
patches per epoch, while in contrastive
learning the encoder sees 200% (two crop) or
even more (multi-crop) patches per epoch.
29. Comparisons with Self-supervised Methods
• For ViT-B, all methods perform closely.
• For Vit-L the gaps among methods are bigger,
suggesting that a challenge for bigger models is to
reduce overfitting.
• MAE can scale up easily and has shown steady
improvement from bigger models.
• By fine-tuning with a 448 size, MAE achieve 87.8%
accuracy, using only IN1K data. The previous best
accuracy, among all methods using only IN1K data,
is 87.1% (512 size), based on advanced networks.
• Comparing with BEiT, MAE is more accurate while
being simpler and faster.
30. Comparisons with Supervised Pre-training
• In the original ViT paper, ViT-L degrades
when trained in IN1K. The authors
improved supervised recipe works better
for training from scratch, but the accuracy is
saturated.
• MAE pre-training, using only IN1K, can
generalize better: the gain over training
from scratch is bigger for higher-capacity
models.
• It follows a trend similar to the JFT-300M
supervised pre-training. This comparison
shows that MAE can help scale up model
sizes.
31. Partial Fine-tuning
• Table 1. shows that linear probing and fine-tuning results are largely uncorrelated.
• Linear probing has been a popular protocol in the past few years; however, it
misses the opportunity of pursuing strong but non-linear features—which is
indeed a strength of deep learning.
• As a middle ground, we study a partial fine-tuning protocol: fine-tune the last
several layers while freezing the others.
32. Partial Fine-tuning
• Notably, fine-tuning only one Transformer block
boosts the accuracy significantly from 73.5% to
81.0%. Moreover, if we fine-tune only “half” of
the last block (i.e., its MLP sub-block), we can get
79.1%, much better than linear probing.
• Comparing with MoCo v3 which is a contrastive
method, MOCO v3 has higher linear probing
accuracy than MAE however, all of its partial fine-
tuning results are worse than MAE.
• These results show that the MAE representations
are less linearly separable, but they are stronger
non-linear features and perform well when a non-
linear head is tuned. These observations suggest
that linear separability is not the sole metric for
evaluating representation quality.
33. Transfer Learning Experiments
• Object detection and instance segmentation
▪ Mask R-CNN is fine-tuned on COCO. The ViT backbone is adapted for use with
FPN.
• Semantic segmentation
▪ Experiments on ADE20K use UperNet following the code in BEiT.
34. Transfer Learning Experiments
• Pixels vs. tokens
▪ While using dVAE tokens is better than using unnormalized pixels, it is
statistically similar to just using normalized pixels across all tasks and models.
It agains shows that tokenization is not necessary for MAE.
35. Discussion and Conclusion
• Simple algorithms that scale well are the core of deep learning.
• In NLP, simple self-supervised learning methods enable benefits from
exponentially scaling models. In computer vision, practical pre-
training paradigms are dominantly supervised despite progress in
self-supervised learning.
• Self supervised learning in vision may now be embarking on a similar
trajectory as in NLP.
36. Discussion and Conclusion
• On the other hand, images and languages are signals of a different
nature and this difference must be addressed carefully. Images are
merely recorded light without a semantic decomposition into the
visual analogue of words.
• Removing random patches that most likely do not form a semantic
segment and MAE reconstructs pixels which are not semantic entities.
Nevertheless, MAE infers complex, holistic reconstructions suggesting
it has learned numerous visual concepts, i.e., semantics.
• The authors hypothesize that this behavior occurs by way of a rich
hidden representation inside the MAE.