This document summarizes key points from papers on using cyclical learning rates for training neural networks. It discusses how cyclical learning rates can help address underfitting and overfitting by varying the learning rate over the course of training. The summary provides guidance on choosing learning rate ranges and cycle parameters to efficiently train models while balancing accuracy and convergence. It also discusses how other hyperparameters like batch size, momentum, and weight decay interact with cyclical learning rates.
2. DISCIPLINED APPROACH PAPER
• A disciplined approach to neural network hyperparameters: Part 1 – Learning Rate, Batch Size,
Momentum, and Weight Decay
• There is no Part 2
• https://arxiv.org/abs/1803.09820
• Collection of empirical observations spread out through the paper
3. CONVERGENCE / TEST-VAL LOSS
• Observe box in top-left corner of Figure 1(a)
• Shows training loss (red) decreasing and validation loss
(blue) decreasing then increasing.
• Plot to left of validation loss minima indicates
underfitting
• Plot to right of validation loss minima indicates
overfitting.
• Achieving the horizontal part of test/validation loss
(minima) is goal of hyperparameter tuning.
4. UNDERFITTING
• Underfitting is indicated by continuously decreasing
test loss rather than horizontal plateau (Fig 3(a)).
• Steepness of test loss curve indicates how well the
model is learning (Fig 3(b)).
5. OVERFITTING
• Increasing Learning Rate moves the model from underfitting
to overfitting.
• Blue curve (Fig 4a) shows steepest fall – indication that this
will produce better final accuracy.
• Yellow curve (Fig 4a) shows overfitting with LR > 0.006.
• More overfitting examples – blue curves in bottom figs.
• Blue curve (Fig 4b) shows underfitting.
• Red curve (Fig 4b) shows overfitting.
6. CYCLIC LEARNING RATE (CLR)
• Motivation: Underfitting if LR too low, overfitting if too high; requires grid search
• CLR
• Specify upper and lower bound for LR
• Specify step size == number of iterations or epochs used for each step
• Cycle consists of 2 steps – first step LR increases linearly from min to max, second step LR decreases linearly
from max to min.
• Other variants tried but no significant benefit observed.
7. CLR – CHOOSE MAX AND MIN LR
• LR upper bound == min value of LR that causes test / validation loss to increase (and accuracy to
decrease)
• LR lower bound, one of:
• Factor of 3 or 4 less than upper bound.
• Factor of 10 or 20 less than upper bound if only 1 cycle is used.
• Find experimentally using short test of ~1000 iterations, pick largest that allows convergence.
• Step size – if LR too high, training becomes unstable, increase step size to increase difference between
max and min LR bounds.
8. SUPER CONVERGENCE
• Super convergence – some networks remain stable under
high LR, so can be trained very quickly with CLR with high
upper bound.
• Fig 5a shows super convergence (orange curve) training
faster to higher accuracy using large LR than blue curve.
• 1-cycle policy – one cycle that is smaller than number of
iterations/epochs, then remaining iterations with LR
lowered by several order of magnitude.
9. REGULARIZATION
• Many forms of regularization
• Large Learning Rate
• Small batch size
• Weight decay (aka L2 regularization)
• Dropout
• Need to balance different regularizers for each dataset and architecture.
• Fig 5b (previous slide) shows tradeoff between weight decay (WD) and LR. Large LR for faster learning
needs to be balanced with lower WD.
• General guidance: reduce other forms of regularization and train with high LR makes training efficient.
10. BATCH SIZE
• Larger batch sizes permit larger LR using 1cycle schedule.
• Larger batch size may increase training time, so tradeoff
required.
• Tradeoff – use batch size so number of epochs is optimum
for data/model.
• Batch size limited by GPU memory.
• Fig 6a shows validation accuracy for different batch sizes.
Larger batch sizes better but effect tapers off (BS=1024
blue curve very close to BS=512 red curve).
11. (CYCLIC) MOMENTUM
• Set momentum as large as possible without causing instability.
• Constant LR => use large constant momentum (0.9 – 0.99)
• Cyclic LR => decrease cyclic momentum as cyclic LR increases
during early to middle part of training (0.95 – 0.85).
• Fig 8a – blue curve is constant momentum, red curve is
decreasing CM and yellow curve is increasing CM (with
increasing CLR).
• These observations also carry over to deep networks (Fig 8b).
12. WEIGHT DECAY
• Cyclical WD not useful, should remain constant throughout
training.
• Value should be found by grid search (ok with early
termination).
• Fig 9a shows loss plots for different values of WD (with LR=5e-
3, mom=0.95).
• Fig 9b shows equivalent accuracy plots.
13. CYCLIC LEARNING RATE PAPER
• Cyclical Learning Rates for Training Neural Networks
• https://arxiv.org/abs/1506.01186
• Describes CLR in depth and describes results of training common networks with CLR.
14. CYCLIC LEARNING RATE
• Successor to
• Learning rate schedules – varying LR exponentially over training.
• Adaptive Learning Rates (RMSProp, ADAM, etc) – change LR
based on values of gradients.
• Based on observation that increasing LR has short-term
negative effect but long-term positive effect.
• Let LR vary between range of values.
• Triangular LR (Fig 2) is usually good enough but other variants
also possible.
• Accuracy plot (Fig 1) shows CLR (red curve) is better compared
to Exponential LR.
15. ESTIMATING CLR PARAMETERS
• Step size
• Step size = 2 to 10 times * number of iterations per epoch
• Number of training iterations per epoch = number of training records /
batch size
• Upper and lower bounds for LR
• Run model for few epochs with some bounds (1e-4 to 2e-1 for
example)
• Upper bound == where accuracy stops increasing, becomes ragged, or
falls (~ 6e-3).
• Lower bound
• Either 1/3 or ¼ of upper bound (~ 2e-3)
• Point at which accuracy starts to increase (~ 1e-3)
16. LR FINDER USAGE
• LR Finder – first available in Fast.AI library.
• Upper bound – between 1e-3 and 1e-2 (10-3 and 10-2) where loss is
decreasing fastest.
• Can also be found using lr.plot_loss_change() – minimum point (here 1e-2).
• Lower bound is about 1-2 orders of magnitude lower.
• LR Finder (Keras) – https://github.com/surmenok/keras_lr_finder
• LR Finder (Pytorch) -- https://github.com/davidtvs/pytorch-lr-finder
• Keras example -- https://github.com/sujitpal/keras-tutorial-
odsc2020/blob/master/02_03_exercise_2_solved.ipynb
• Fast. AI example --
https://colab.research.google.com/github/fastai/fastbook/blob/master/16_ac
cel_sgd.ipynb