Modern CNN Architectures and Training Methods Boost Self-Supervised Visual Representation Learning

2th February 2020
PR12 Paper Review
Ho Seong Lee (hoya012)
Cognex Deep Learning Lab KR
2019 CVPR
PR-222: Revisiting Self-Supervised Visual Representation Learning 1

Contents
• Introduction
• Self-Supervised Study Setup
• Architectures of CNN models
• Self-supervised techniques in this study
• Evaluation
• Datasets
• Experiments and Results
• Conclusion

Before Start..
[PR-208] Unsupervised Visual Representation Learning Overview: Toward Self-Supervision
• Video Link: https://youtu.be/eDDHsbMgOJQ
• I highly recommend watching the video above(PR-208) before listening to this presentation!!

Introduction
“Revisiting Self-Supervised Visual Representation Learning”, 2019 CVPR
• Many the pretext tasks for self-supervised learning have been studied
• But.. Still low performance than supervised setting
• Other important aspects, such as CNN architecture has not received equal attention

“Revisiting Self-Supervised Visual Representation Learning”, 2019 CVPR
• Other important aspects, such as CNN architecture has not received equal attention
• So, revisit previously proposed self-supervised models and conduct a large-scale study
Introduction

3.1. Architectures of CNN models
• A large part of the self-supervised techniques for visual representation approaches use AlexNet
• Employ modern network architectures
• ResNet50, pre-logits of size 512*k
• RevNet (The Reversible ResNet), but do not use G like real NVP paper
• VGG with batch-normalization, initial conv layer has 8*k channels, fc layer has 512*k channels
Self-Supervised Study Setup
Why use an old-fashioned architecture?!
reference: The Reversible Residual Network: Backpropagation Without Storing Activations, 2017 NIPS
ResNet RevNet
widening factor k, k ∈ {4, 8, 12, 16}

3.2. Self-supervised techniques in this study
• Use 4 self-supervised techniques for experiments
• Rotation
• Exemplar
• Jigsaw
• Relative Patch Location

3.3. Evaluation
• Follow common rule - Training a linear logistic regression model to solve multi-class classification task
• Exact the representation from the frozen network at the pre-logit level
• Train the logistic regression using L-BFGS except in Table 2
• For consistency and fair evaluation, use SGD with momentum, augmentation in Table 2
Table 2

3.4. Datasets
• ImageNet (Train + Validation)
• In order to avoid overfitting, use own validation split (50,000 random images from training split) for
all studies except in Table 2
• All self-supervised models are trained on ImageNet(without labels)
• Places205 (Validation only)
• Qualitatively different from ImageNet → good candidate for evaluating how well the learned
representations generalize to new unseen data of different nature
• Same procedure as for ImageNet regarding validation splits (random splitting)

4.1. Evaluation on ImageNet and Places205
• Measure the representation quality produced by 6 different CNN with various widening factors
• Increasing the number of channels improves performance of self-supervised models
Experiments and Results
Widening
factor
Random
Initialize
Without
ReLU before
GAP layer

4.1. Evaluation on ImageNet and Places205
• neither is the ranking of architectures consistent across different methods, nor is the ranking of
methods consistent across architectures
• Ranking of Places205 is consistent with that of ImageNet → generalized to new dataset
• VGG19-BN consistently demonstrates worst performance, even though it achieve performance similar to
ResNet 50 on standard vision benchmark (fully supervised setting)
Rotation → RevNet50
Exemplar → ResNet50 v1
Rel. Patch Loc. → ResNet50 v1
Jigsaw → ResNet50 v1
VGG19-BN → Worst performance in all case

4.2. Comparison to prior work
• For consistency and fair evaluation, use SGD with momentum, augmentation in Table 2
• As a result of selecting the right architecture, significantly outperform previous reported results
Prev. Result

4.3. A linear model is adequate for evaluation
• Consider an alternative evaluation scenario – use MLP for solving the evaluation task
• Add a single hidden layer with 1000 channels with ReLU, Dropout to become non-linear model
• MLP provides only marginal improvement over the linear evaluation

4.4. Better performance on the pretext task does not always translate to better
representations
• Performance on the pretext task is a good proxy, but not always..

4.5. Skip-connections prevent degradation of representation quality towards the end of
CNNs
• VGG-BN get worse towards the end of the network, but not ResNet, RevNet
• Model specialize to the pretext task and discard more general semantic features in the later layers
• Using skip-connections preserve information learned in intermediate layers

4.6. Model width and representation size strongly influence the representation quality
• Check whether the increase in performance is due to increased network capacity or the use of higher-
dimensional representations, or to the interplay of both
• Disentangle the network width from the representation size(pre-logits channels)
• Increasing the widening factor consistently boosts performance in both the full and low-data regimes.

4.7. SGD for training linear model takes long time to converge
• Previous works use short training time
• Investigate the importance of the SGD optimization schedule for training logistic regression
• The first decay has a large influence on the final accuracy

Revisit previously proposed self-supervised models and conduct a large-scale study
• Architecture design in the fully-supervised setting necessarily do not translate to the self-supervised
setting (VGG19-BN)
• Using skip-connections can achieve consistently good results in contrast to AlexNet
• Widening factor of CNNs has a drastic effect on performance of self-supervised techniques
• SGD training of linear logistic regression require very long time to converge
• Ranking of architectures  X → Ranking of methods
Conclusion

Modern CNN Architectures and Training Methods Boost Self-Supervised Visual Representation Learning

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Modern CNN Architectures and Training Methods Boost Self-Supervised Visual Representation Learning

Semelhante a Modern CNN Architectures and Training Methods Boost Self-Supervised Visual Representation Learning (20)

Mais de LEE HOSEONG

Mais de LEE HOSEONG (14)

Último

Último (20)

Modern CNN Architectures and Training Methods Boost Self-Supervised Visual Representation Learning