SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
PR-373
NeurIPS 2021 주성훈, VUNO Inc.


2022. 2. 27.
1. Research Background
2. Methods
1. Research Background 3
The performance of a vision model
•최신 training methodology를 적용한 새로운 architectures에서의 성능과,


이전 방식의 training methodology를 적용한 오래된 architectures성능과 비교하는 것을 지적함
•Architecture •Training methods •Scaling strategy
/ 24
2. Methods
1. Research Background 4
Objective:


Vision architecture를 scaling 하는데에 new perspectives와 practical advice 제공
•ResNet에 modern training and regularization techniques를 적용. Large Epoch.
•Re-scaled (Depth, Width, Resolution) ResNet (ResNet-RS)
/ 24
2. Methods
1. Research Background 5
ResNeXt
Automated architecture search를 활용한 구조
[67 (NASNET), 41 (AmoebaNet: 83.9), 55 (EfficientNet-B7, 84.4%, 2019)].
Adapting self-attention to the visual domain
AA-ResNet-152, 79.1%, 2019
ViT-L/16 87.76±0.03%, 2020
LambdaResNet200 84.3%, 2021
Previous works
•Architecture
VGG
ResNet
Inception
ViT-L/16 87.76±0.03%, 2020
/ 24
2. Methods
1. Research Background 6
Previous works
Compound Scaling
•이 논문에서 지적하는 단점들
• Small Epoch
• FLOPS 를 최적화해 설계된 baseline network (EfficientNet-B0)
•Architecture - EfficientNet
•EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, 2019, ICML 2019)
•PR-169
/ 24
2. Methods
1. Research Background 7
•Innovations in training (e.g. improved learning rate schedules)
•regularization methods
•dropout
•label smoothing
•stochastic depth
•dropblock
•data augmentation
Regularization methods have become especially useful to prevent overfitting when training ever-increasingly
larger models on limited data (e.g. 1.2M ImageNet images).
Previous works
•training/regularization methodology
https://arxiv.org/pdf/2011.12562.pdf
Label smoothing
/ 24
2. Methods
1. Research Background 8
Previous works
NFNets : 89.2% (Brock, A. et al. High-Performance Large-Scale Image Recognition Without
Normalization. (2021).)
• ResNet
• JFT-300M
• Batch normalization -> Adaptive Gradient Clipping(AGC)
ViT: An Image is Worth 16x16 Words:
Transformers for Image Recognition
at Scale (ICLR 2021): 88.6%
• Transformer
• ImageNet-21k (1,400만 개, 10배)
• JFT-300M
•Additional training data.
•Pre-training on large-scale datasets
76.4 -> 79.2
Revisiting Unreasonable Effectiveness of
Data in Deep Learning Era. ICCV. 2017
/ 24
2. Methods
1. Research Background 9
NoisyStudent (EfficientNet-L2) 88.4%, CVPR
-Self-training with Noisy Student improves ImageNet classification
Previous works
•Additional training data.
•Semi-supervised learning
Meta Pseudo Labels (EfficientNet-L2), 90.2%,
PR-336
/ 24
2. Experimental Results & Methodology
2. Methods
2. Experimental Results & Methodology 11
Additive Study of Improvements
•Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성
•Baseline ResNet-200
•256x256
•90 epochs
•Stepwise learning rate decay
•Hyperparameter selection:
•held-out validation set comprising 2% of the
ImageNet training set (minival-set)
•Increase training epochs (90->350 epoch)
•Regularization을 적용한 후에는 유용함
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Training Regularization Architecture
/ 24
2. Methods
2. Experimental Results & Methodology 12
Additive Study of Improvements
•Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성
Regularization and Data Augmentation
•Dropout: after the global average pooling
•Decrease weight decay 적용 유무에 따라 결과 달라짐
•Stochastic depth (G. Huang et al., 2016, ECCV)
•RandAugment: random image transformations (e.g.
translate, shear, color distortions)
•Momentum with cosine learning rate schedule
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Baseline ResNet-200
•256x256, 90 epochs, Stepwise learning rate decay
Training Regularization Architecture
/ 24
2. Methods
2. Experimental Results & Methodology 13
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Additive Study of Improvements
•Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성
•간단한 architecture change (SE, ResNet-D) 를 추가해 Top-1 accuracy 83.4% (+4.4%) 달성
Baseline ResNet-200
•256x256, 90 epochs, Stepwise learning rate decay
Architecture changes
•ResNet-D (PR-201)
•Squeeze-and-Excitation (Hu, J. et al., 2020)
Training Regularization Architecture
/ 24
2. Methods
2. Experimental Results & Methodology 14
Top-1 accuracy: ImageNet validation-set, averaged over 2 runs
Importance of decreasing weight decay when combining regularization methods
•RA나 LS를 적용하면 default weight decay (1e-4)에서 변경 할 필요 없음
RA: RandAugment, LS: Label Smoothing, DO: Dropout on FC, SD: Stochastic Depth
•더 많은 regularization을 적용할 때 weight decay 수준을 낮추지 않으면 성능이 좋아지지 않는다.


-> 여러 regularization을 적용했을 때 모델을 과도하게 regularization이 되지 않도록 조절이 필요함.
/ 24
2. Methods
2. Experimental Results & Methodology 15
Extensive search on ImageNet over scaling strategies
•Extensive search on ImageNet over width multipliers in [0.25,0.5,1.0,1.5,2.0], depths of [26, 50, 101, 200, 300, 350,
400] and resolutions of [128,160,224,320,448].
•350 epochs, increase regularization with model size in an effort to limit overfitting
Smaller model - power law trend
동일 FLOPs에서 image resolution의 영향을 받음 ->
slow image resolution scaling의 motivation이 됨
•RandAugment:
•magnitude is set to 10 for filter multipliers in [0.25, 0.5]


or image resolution in [64, 160],
•15 for image resolution in [224, 320]
•20 otherwise.
•Stochastic depth
•Drop rate of 0.2 for image resolutions 224 and above.
•We do not apply stochastic depth filter multiplier 0.25 (or images smaller than 224).
•All models use a label smoothing of 0.1 and a weight decay of 4e-5.
/ 24
2. Methods
2. Experimental Results & Methodology 16
Depth Scaling in Regimes Where Overfitting Can Occur
•Depth scaling은 longer epoch 조건에서, width scaling은 short epoch 조건에서 성능 향상
•Width scaling하면 overfitting 발생: Model Width를 확장할 때 hyperparameter가 더 크게 증가
•이전 연구들은 40 epoch정도의 training regime에서도 width scaling을 택함을 지적
•BiT (2019) scales the width of ResNet-152 with 4x filter multiplier
•FNnet (2021) scales the width with ~1.5x filter multiplier
Image resolution [128, 160, 224, 320]
Depth [101, 200, 300, 400]
Width multiplier [1.0, 1.5, 2.0]
/ 24
2. Methods
2. Experimental Results & Methodology 17
Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve
•FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인
•GPU/TPU에서 memory access cost가 latency에 더 큰 영향을 주는데, FLOPs는 이를 반영하지 못함
•EfficientNet의 더 큰 activations을 활용하는 Depthwise convolution 으로 인해 ResNet보다 GPU/TPU에서 latency 손해를 봄
•더 많은 Memory 소비 -> memory access 증가 -> latency 증가
• RS-350의 FLOPs가 1.8배 많지만
TPU-v3 latency가 2.7배 낮음
• RS-350의 Params가 3.8배 많지만
2.3배 적은 메모리를 사용함
• Depth-resolution
/ 24
2. Methods
2. Experimental Results & Methodology 18
Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve
•Width, depth보다 천천히 scale-up되는 image resolution을 활용해 activation 수를 줄이고 그 결과 메모리 사용량을 줄이는 것이
ResNet-RS의 핵심
/ 24
2. Methods
2. Experimental Results & Methodology 19
Improving scaling of EfficientNet
•EfficientNets(EfficientNet-RS)의 성능 개선
•EfficientNet의 jointly increases model depth, width and resolution at a constant rate 전략이 차선책
이라고 주장
• Architecture - resolution
Slow image resolution scaling strategy를 EfficientNets에 적용 (width와 depth 변경 없음)
/ 24
2. Methods
2. Experimental Results & Methodology 20
ResNet-RS performance in a large scale semi-supervised learning setup
•Semi-supervised learning setup에서도 높은 정확도를 보이면서 EfficientNet보다 4.7x - 5.5x 더 빠른 inference
speed를 보인다.
•ResNets-RS 를 1.3M labeled ImageNet images 와 130M pseudo-labeled images로 학습 (Noisy Student 처럼)
•130M pseudo-labeled image는 Noisy Student 논문의 EfficientNet-L2 (88.4%)로 생성
/ 24
2. Methods
2. Experimental Results & Methodology 21
Transfer Learning to Downstream Tasks with ResNet-RS
•re-scale을 적용한 supervised learning이 SimCLR/SImCLRv2 (Self-supervised learning)을 outperform함.
•Self-supervised learning이 supervised learning보다 universal representations을 학습한다는 개념과 상반된 결과.
•SimCLR와의 공정한 비교를 위한 실험 셋팅:
•ResNet-RS: 400 epochs with cosine learning rate decay, data augmentation (RandAugment),
label smoothing, dropout and decreased weight decay but do not use stochastic depth or
exponential moving average (EMA) of the weights.
•SimCLR/SimCLRv2: longer training (800 epochs) with cosine learning rate decay, a tailored data
augmentation strategy, a tuned temperature parameter in the contrastive loss and a tuned
weight decay.
•Vanilla ResNet architecture pre-trained on ImageNet.
/ 24
2. Methods
2. Experimental Results & Methodology 22
Revised 3D ResNet for Video Classification
Kinetics-400 video classification task
•Image classification과 비슷한 결과. Architecture 변경 없이도 높은 성능 개선 효과가 있었음.
/ 24
3. Conclusion
2. Methods
3. Conclusions 24
• 복잡한 architectural change에서 얻을 수 있는 성능 향상만큼 scaling and training strategies만으로도
달성할 수 있다는 것을 보임
• 모델 아키텍처를 변경하지 않고도 강력한 성능(예: +3.2% top-1 ImageNet 정확도, +4.0% top-1
Kinetics-400 정확도)을 달성하는 regularization techniques와 training strategy 에 대한 실증적 연구
• 새로운 network scaling strategy:
• (1) Scale depth when overfitting can occur (scaling width can be preferable otherwise)
• (2) Scale the image resolution more slowly
• FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인
• Our work suggests that the field has myopically overemphasized architectural innovations at
the expense of experimental diligence, and we hope it encourages further scrutiny in
maintaining consistent methodology for both proposed innovations and baselines alike.
/ 24

Mais conteúdo relacionado

Mais procurados

Fully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data ValidationFully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data ValidationDatabricks
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowDatabricks
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Kai Wähner
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)Seongyun Byeon
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsEvgeniy Marinov
 
Machine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and KubernetesMachine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and KubernetesArun Gupta
 
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptxBRIJESH KUMAR
 
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...OpenSource Connections
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipsterJulien Dubois
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseHasan H Topcu
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaSpark Summit
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization MachinesPavel Kalaidin
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleDatabricks
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Datainside-BigData.com
 

Mais procurados (20)

Fully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data ValidationFully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data Validation
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender Systems
 
Machine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and KubernetesMachine Learning using Kubeflow and Kubernetes
Machine Learning using Kubeflow and Kubernetes
 
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
21- Self-Hosted Integration Runtime in Azure Data Factory.pptx
 
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
 
Intro to Factorization Machines
Intro to Factorization MachinesIntro to Factorization Machines
Intro to Factorization Machines
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
 
Power BI Dataflows
Power BI DataflowsPower BI Dataflows
Power BI Dataflows
 

Semelhante a Visions of Scaling: An Empirical Study of Training and Regularization Methods for Vision Models

ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksSeunghyun Hwang
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceLEE HOSEONG
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...Dongmin Choi
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
Mixed Precision Training Review
Mixed Precision Training ReviewMixed Precision Training Review
Mixed Precision Training ReviewLEE HOSEONG
 
"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper ReviewLEE HOSEONG
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...jaumebp
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
 
Centertrack and naver airush 2020 review
Centertrack and naver airush 2020 reviewCentertrack and naver airush 2020 review
Centertrack and naver airush 2020 review경훈 김
 
ANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTION
ANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTIONANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTION
ANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTIONRajatRoy60
 
A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution Mohammed Ashour
 
Deep Learning in Limited Resource Environments
Deep Learning in Limited Resource EnvironmentsDeep Learning in Limited Resource Environments
Deep Learning in Limited Resource EnvironmentsOguzVuruskaner
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compressionDavid Tung
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdfFEG
 

Semelhante a Visions of Scaling: An Empirical Study of Training and Regularization Methods for Vision Models (20)

ResNeSt: Split-Attention Networks
ResNeSt: Split-Attention NetworksResNeSt: Split-Attention Networks
ResNeSt: Split-Attention Networks
 
FixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidenceFixMatch:simplifying semi supervised learning with consistency and confidence
FixMatch:simplifying semi supervised learning with consistency and confidence
 
Bag of tricks for image classification with convolutional neural networks r...
Bag of tricks for image classification with convolutional neural networks   r...Bag of tricks for image classification with convolutional neural networks   r...
Bag of tricks for image classification with convolutional neural networks r...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
Mixed Precision Training Review
Mixed Precision Training ReviewMixed Precision Training Review
Mixed Precision Training Review
 
"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review"Revisiting self supervised visual representation learning" Paper Review
"Revisiting self supervised visual representation learning" Paper Review
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
Centertrack and naver airush 2020 review
Centertrack and naver airush 2020 reviewCentertrack and naver airush 2020 review
Centertrack and naver airush 2020 review
 
ANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTION
ANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTIONANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTION
ANALYSIS OF INSTANCE SEGMENTATION APPROACH FOR LANE DETECTION
 
A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution
 
Deep Learning in Limited Resource Environments
Deep Learning in Limited Resource EnvironmentsDeep Learning in Limited Resource Environments
Deep Learning in Limited Resource Environments
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
 

Mais de Sunghoon Joo

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterSunghoon Joo
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersSunghoon Joo
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfSunghoon Joo
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...Sunghoon Joo
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...Sunghoon Joo
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningSunghoon Joo
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningSunghoon Joo
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...Sunghoon Joo
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...Sunghoon Joo
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingSunghoon Joo
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...Sunghoon Joo
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesSunghoon Joo
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchSunghoon Joo
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesSunghoon Joo
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...Sunghoon Joo
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...Sunghoon Joo
 

Mais de Sunghoon Joo (19)

PR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But FasterPR-445: Token Merging: Your ViT But Faster
PR-445: Token Merging: Your ViT But Faster
 
PR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked AutoencodersPR-433: Test-time Training with Masked Autoencoders
PR-433: Test-time Training with Masked Autoencoders
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
 
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top R...
 
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental LearningPR-351: Adaptive Aggregation Networks for Class-Incremental Learning
PR-351: Adaptive Aggregation Networks for Class-Incremental Learning
 
PR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learningPR-339: Maintaining discrimination and fairness in class incremental learning
PR-339: Maintaining discrimination and fairness in class incremental learning
 
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
[PR-325] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Tran...
 
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
PR-313 Training BatchNorm and Only BatchNorm: On the Expressive Power of Rand...
 
PR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document rerankingPR-298 PARADE: Passage representation aggregation for document reranking
PR-298 PARADE: Passage representation aggregation for document reranking
 
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
PR-285 Leveraging Semantic and Lexical Matching to Improve the Recall of Docu...
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
PR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseasesPR-246: A deep learning system for differential diagnosis of skin diseases
PR-246: A deep learning system for differential diagnosis of skin diseases
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
PR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture SearchPR-218: MFAS: Multimodal Fusion Architecture Search
PR-218: MFAS: Multimodal Fusion Architecture Search
 
PR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of SamplesPR-203: Class-Balanced Loss Based on Effective Number of Samples
PR-203: Class-Balanced Loss Based on Effective Number of Samples
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representati...
 
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
PR-159 : Synergistic Image and Feature Adaptation: Towards Cross-Modality Dom...
 

Último

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 

Último (20)

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 

Visions of Scaling: An Empirical Study of Training and Regularization Methods for Vision Models

  • 1. PR-373 NeurIPS 2021 주성훈, VUNO Inc. 2022. 2. 27.
  • 3. 2. Methods 1. Research Background 3 The performance of a vision model •최신 training methodology를 적용한 새로운 architectures에서의 성능과, 
 이전 방식의 training methodology를 적용한 오래된 architectures성능과 비교하는 것을 지적함 •Architecture •Training methods •Scaling strategy / 24
  • 4. 2. Methods 1. Research Background 4 Objective: 
 Vision architecture를 scaling 하는데에 new perspectives와 practical advice 제공 •ResNet에 modern training and regularization techniques를 적용. Large Epoch. •Re-scaled (Depth, Width, Resolution) ResNet (ResNet-RS) / 24
  • 5. 2. Methods 1. Research Background 5 ResNeXt Automated architecture search를 활용한 구조 [67 (NASNET), 41 (AmoebaNet: 83.9), 55 (EfficientNet-B7, 84.4%, 2019)]. Adapting self-attention to the visual domain AA-ResNet-152, 79.1%, 2019 ViT-L/16 87.76±0.03%, 2020 LambdaResNet200 84.3%, 2021 Previous works •Architecture VGG ResNet Inception ViT-L/16 87.76±0.03%, 2020 / 24
  • 6. 2. Methods 1. Research Background 6 Previous works Compound Scaling •이 논문에서 지적하는 단점들 • Small Epoch • FLOPS 를 최적화해 설계된 baseline network (EfficientNet-B0) •Architecture - EfficientNet •EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, 2019, ICML 2019) •PR-169 / 24
  • 7. 2. Methods 1. Research Background 7 •Innovations in training (e.g. improved learning rate schedules) •regularization methods •dropout •label smoothing •stochastic depth •dropblock •data augmentation Regularization methods have become especially useful to prevent overfitting when training ever-increasingly larger models on limited data (e.g. 1.2M ImageNet images). Previous works •training/regularization methodology https://arxiv.org/pdf/2011.12562.pdf Label smoothing / 24
  • 8. 2. Methods 1. Research Background 8 Previous works NFNets : 89.2% (Brock, A. et al. High-Performance Large-Scale Image Recognition Without Normalization. (2021).) • ResNet • JFT-300M • Batch normalization -> Adaptive Gradient Clipping(AGC) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021): 88.6% • Transformer • ImageNet-21k (1,400만 개, 10배) • JFT-300M •Additional training data. •Pre-training on large-scale datasets 76.4 -> 79.2 Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. ICCV. 2017 / 24
  • 9. 2. Methods 1. Research Background 9 NoisyStudent (EfficientNet-L2) 88.4%, CVPR -Self-training with Noisy Student improves ImageNet classification Previous works •Additional training data. •Semi-supervised learning Meta Pseudo Labels (EfficientNet-L2), 90.2%, PR-336 / 24
  • 10. 2. Experimental Results & Methodology
  • 11. 2. Methods 2. Experimental Results & Methodology 11 Additive Study of Improvements •Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성 •Baseline ResNet-200 •256x256 •90 epochs •Stepwise learning rate decay •Hyperparameter selection: •held-out validation set comprising 2% of the ImageNet training set (minival-set) •Increase training epochs (90->350 epoch) •Regularization을 적용한 후에는 유용함 Top-1 accuracy: ImageNet validation-set, averaged over 2 runs Training Regularization Architecture / 24
  • 12. 2. Methods 2. Experimental Results & Methodology 12 Additive Study of Improvements •Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성 Regularization and Data Augmentation •Dropout: after the global average pooling •Decrease weight decay 적용 유무에 따라 결과 달라짐 •Stochastic depth (G. Huang et al., 2016, ECCV) •RandAugment: random image transformations (e.g. translate, shear, color distortions) •Momentum with cosine learning rate schedule Top-1 accuracy: ImageNet validation-set, averaged over 2 runs Baseline ResNet-200 •256x256, 90 epochs, Stepwise learning rate decay Training Regularization Architecture / 24
  • 13. 2. Methods 2. Experimental Results & Methodology 13 Top-1 accuracy: ImageNet validation-set, averaged over 2 runs Additive Study of Improvements •Architecture change 없이 Top-1 accuracy 82.2% (+3.2%) 달성 •간단한 architecture change (SE, ResNet-D) 를 추가해 Top-1 accuracy 83.4% (+4.4%) 달성 Baseline ResNet-200 •256x256, 90 epochs, Stepwise learning rate decay Architecture changes •ResNet-D (PR-201) •Squeeze-and-Excitation (Hu, J. et al., 2020) Training Regularization Architecture / 24
  • 14. 2. Methods 2. Experimental Results & Methodology 14 Top-1 accuracy: ImageNet validation-set, averaged over 2 runs Importance of decreasing weight decay when combining regularization methods •RA나 LS를 적용하면 default weight decay (1e-4)에서 변경 할 필요 없음 RA: RandAugment, LS: Label Smoothing, DO: Dropout on FC, SD: Stochastic Depth •더 많은 regularization을 적용할 때 weight decay 수준을 낮추지 않으면 성능이 좋아지지 않는다. 
 -> 여러 regularization을 적용했을 때 모델을 과도하게 regularization이 되지 않도록 조절이 필요함. / 24
  • 15. 2. Methods 2. Experimental Results & Methodology 15 Extensive search on ImageNet over scaling strategies •Extensive search on ImageNet over width multipliers in [0.25,0.5,1.0,1.5,2.0], depths of [26, 50, 101, 200, 300, 350, 400] and resolutions of [128,160,224,320,448]. •350 epochs, increase regularization with model size in an effort to limit overfitting Smaller model - power law trend 동일 FLOPs에서 image resolution의 영향을 받음 -> slow image resolution scaling의 motivation이 됨 •RandAugment: •magnitude is set to 10 for filter multipliers in [0.25, 0.5] 
 or image resolution in [64, 160], •15 for image resolution in [224, 320] •20 otherwise. •Stochastic depth •Drop rate of 0.2 for image resolutions 224 and above. •We do not apply stochastic depth filter multiplier 0.25 (or images smaller than 224). •All models use a label smoothing of 0.1 and a weight decay of 4e-5. / 24
  • 16. 2. Methods 2. Experimental Results & Methodology 16 Depth Scaling in Regimes Where Overfitting Can Occur •Depth scaling은 longer epoch 조건에서, width scaling은 short epoch 조건에서 성능 향상 •Width scaling하면 overfitting 발생: Model Width를 확장할 때 hyperparameter가 더 크게 증가 •이전 연구들은 40 epoch정도의 training regime에서도 width scaling을 택함을 지적 •BiT (2019) scales the width of ResNet-152 with 4x filter multiplier •FNnet (2021) scales the width with ~1.5x filter multiplier Image resolution [128, 160, 224, 320] Depth [101, 200, 300, 400] Width multiplier [1.0, 1.5, 2.0] / 24
  • 17. 2. Methods 2. Experimental Results & Methodology 17 Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve •FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인 •GPU/TPU에서 memory access cost가 latency에 더 큰 영향을 주는데, FLOPs는 이를 반영하지 못함 •EfficientNet의 더 큰 activations을 활용하는 Depthwise convolution 으로 인해 ResNet보다 GPU/TPU에서 latency 손해를 봄 •더 많은 Memory 소비 -> memory access 증가 -> latency 증가 • RS-350의 FLOPs가 1.8배 많지만 TPU-v3 latency가 2.7배 낮음 • RS-350의 Params가 3.8배 많지만 2.3배 적은 메모리를 사용함 • Depth-resolution / 24
  • 18. 2. Methods 2. Experimental Results & Methodology 18 Comparing EfficientNets against ResNet-RS(re-scaled) on a speed-accuracy Pareto curve •Width, depth보다 천천히 scale-up되는 image resolution을 활용해 activation 수를 줄이고 그 결과 메모리 사용량을 줄이는 것이 ResNet-RS의 핵심 / 24
  • 19. 2. Methods 2. Experimental Results & Methodology 19 Improving scaling of EfficientNet •EfficientNets(EfficientNet-RS)의 성능 개선 •EfficientNet의 jointly increases model depth, width and resolution at a constant rate 전략이 차선책 이라고 주장 • Architecture - resolution Slow image resolution scaling strategy를 EfficientNets에 적용 (width와 depth 변경 없음) / 24
  • 20. 2. Methods 2. Experimental Results & Methodology 20 ResNet-RS performance in a large scale semi-supervised learning setup •Semi-supervised learning setup에서도 높은 정확도를 보이면서 EfficientNet보다 4.7x - 5.5x 더 빠른 inference speed를 보인다. •ResNets-RS 를 1.3M labeled ImageNet images 와 130M pseudo-labeled images로 학습 (Noisy Student 처럼) •130M pseudo-labeled image는 Noisy Student 논문의 EfficientNet-L2 (88.4%)로 생성 / 24
  • 21. 2. Methods 2. Experimental Results & Methodology 21 Transfer Learning to Downstream Tasks with ResNet-RS •re-scale을 적용한 supervised learning이 SimCLR/SImCLRv2 (Self-supervised learning)을 outperform함. •Self-supervised learning이 supervised learning보다 universal representations을 학습한다는 개념과 상반된 결과. •SimCLR와의 공정한 비교를 위한 실험 셋팅: •ResNet-RS: 400 epochs with cosine learning rate decay, data augmentation (RandAugment), label smoothing, dropout and decreased weight decay but do not use stochastic depth or exponential moving average (EMA) of the weights. •SimCLR/SimCLRv2: longer training (800 epochs) with cosine learning rate decay, a tailored data augmentation strategy, a tuned temperature parameter in the contrastive loss and a tuned weight decay. •Vanilla ResNet architecture pre-trained on ImageNet. / 24
  • 22. 2. Methods 2. Experimental Results & Methodology 22 Revised 3D ResNet for Video Classification Kinetics-400 video classification task •Image classification과 비슷한 결과. Architecture 변경 없이도 높은 성능 개선 효과가 있었음. / 24
  • 24. 2. Methods 3. Conclusions 24 • 복잡한 architectural change에서 얻을 수 있는 성능 향상만큼 scaling and training strategies만으로도 달성할 수 있다는 것을 보임 • 모델 아키텍처를 변경하지 않고도 강력한 성능(예: +3.2% top-1 ImageNet 정확도, +4.0% top-1 Kinetics-400 정확도)을 달성하는 regularization techniques와 training strategy 에 대한 실증적 연구 • 새로운 network scaling strategy: • (1) Scale depth when overfitting can occur (scaling width can be preferable otherwise) • (2) Scale the image resolution more slowly • FLOPs라는 지표가 실제 training과 inference에 대한 latency times을 반영하지 못한다는 것을 확인 • Our work suggests that the field has myopically overemphasized architectural innovations at the expense of experimental diligence, and we hope it encourages further scrutiny in maintaining consistent methodology for both proposed innovations and baselines alike. / 24