SlideShare uma empresa Scribd logo
1 de 16
LESLIE SMITH’S PAPERS
FOR DL JOURNAL CLUB
DISCIPLINED APPROACH PAPER
• A disciplined approach to neural network hyperparameters: Part 1 – Learning Rate, Batch Size,
Momentum, and Weight Decay
• There is no Part 2
• https://arxiv.org/abs/1803.09820
• Collection of empirical observations spread out through the paper
CONVERGENCE / TEST-VAL LOSS
• Observe box in top-left corner of Figure 1(a)
• Shows training loss (red) decreasing and validation loss
(blue) decreasing then increasing.
• Plot to left of validation loss minima indicates
underfitting
• Plot to right of validation loss minima indicates
overfitting.
• Achieving the horizontal part of test/validation loss
(minima) is goal of hyperparameter tuning.
UNDERFITTING
• Underfitting is indicated by continuously decreasing
test loss rather than horizontal plateau (Fig 3(a)).
• Steepness of test loss curve indicates how well the
model is learning (Fig 3(b)).
OVERFITTING
• Increasing Learning Rate moves the model from underfitting
to overfitting.
• Blue curve (Fig 4a) shows steepest fall – indication that this
will produce better final accuracy.
• Yellow curve (Fig 4a) shows overfitting with LR > 0.006.
• More overfitting examples – blue curves in bottom figs.
• Blue curve (Fig 4b) shows underfitting.
• Red curve (Fig 4b) shows overfitting.
CYCLIC LEARNING RATE (CLR)
• Motivation: Underfitting if LR too low, overfitting if too high; requires grid search
• CLR
• Specify upper and lower bound for LR
• Specify step size == number of iterations or epochs used for each step
• Cycle consists of 2 steps – first step LR increases linearly from min to max, second step LR decreases linearly
from max to min.
• Other variants tried but no significant benefit observed.
CLR – CHOOSE MAX AND MIN LR
• LR upper bound == min value of LR that causes test / validation loss to increase (and accuracy to
decrease)
• LR lower bound, one of:
• Factor of 3 or 4 less than upper bound.
• Factor of 10 or 20 less than upper bound if only 1 cycle is used.
• Find experimentally using short test of ~1000 iterations, pick largest that allows convergence.
• Step size – if LR too high, training becomes unstable, increase step size to increase difference between
max and min LR bounds.
SUPER CONVERGENCE
• Super convergence – some networks remain stable under
high LR, so can be trained very quickly with CLR with high
upper bound.
• Fig 5a shows super convergence (orange curve) training
faster to higher accuracy using large LR than blue curve.
• 1-cycle policy – one cycle that is smaller than number of
iterations/epochs, then remaining iterations with LR
lowered by several order of magnitude.
REGULARIZATION
• Many forms of regularization
• Large Learning Rate
• Small batch size
• Weight decay (aka L2 regularization)
• Dropout
• Need to balance different regularizers for each dataset and architecture.
• Fig 5b (previous slide) shows tradeoff between weight decay (WD) and LR. Large LR for faster learning
needs to be balanced with lower WD.
• General guidance: reduce other forms of regularization and train with high LR makes training efficient.
BATCH SIZE
• Larger batch sizes permit larger LR using 1cycle schedule.
• Larger batch size may increase training time, so tradeoff
required.
• Tradeoff – use batch size so number of epochs is optimum
for data/model.
• Batch size limited by GPU memory.
• Fig 6a shows validation accuracy for different batch sizes.
Larger batch sizes better but effect tapers off (BS=1024
blue curve very close to BS=512 red curve).
(CYCLIC) MOMENTUM
• Set momentum as large as possible without causing instability.
• Constant LR => use large constant momentum (0.9 – 0.99)
• Cyclic LR => decrease cyclic momentum as cyclic LR increases
during early to middle part of training (0.95 – 0.85).
• Fig 8a – blue curve is constant momentum, red curve is
decreasing CM and yellow curve is increasing CM (with
increasing CLR).
• These observations also carry over to deep networks (Fig 8b).
WEIGHT DECAY
• Cyclical WD not useful, should remain constant throughout
training.
• Value should be found by grid search (ok with early
termination).
• Fig 9a shows loss plots for different values of WD (with LR=5e-
3, mom=0.95).
• Fig 9b shows equivalent accuracy plots.
CYCLIC LEARNING RATE PAPER
• Cyclical Learning Rates for Training Neural Networks
• https://arxiv.org/abs/1506.01186
• Describes CLR in depth and describes results of training common networks with CLR.
CYCLIC LEARNING RATE
• Successor to
• Learning rate schedules – varying LR exponentially over training.
• Adaptive Learning Rates (RMSProp, ADAM, etc) – change LR
based on values of gradients.
• Based on observation that increasing LR has short-term
negative effect but long-term positive effect.
• Let LR vary between range of values.
• Triangular LR (Fig 2) is usually good enough but other variants
also possible.
• Accuracy plot (Fig 1) shows CLR (red curve) is better compared
to Exponential LR.
ESTIMATING CLR PARAMETERS
• Step size
• Step size = 2 to 10 times * number of iterations per epoch
• Number of training iterations per epoch = number of training records /
batch size
• Upper and lower bounds for LR
• Run model for few epochs with some bounds (1e-4 to 2e-1 for
example)
• Upper bound == where accuracy stops increasing, becomes ragged, or
falls (~ 6e-3).
• Lower bound
• Either 1/3 or ¼ of upper bound (~ 2e-3)
• Point at which accuracy starts to increase (~ 1e-3)
LR FINDER USAGE
• LR Finder – first available in Fast.AI library.
• Upper bound – between 1e-3 and 1e-2 (10-3 and 10-2) where loss is
decreasing fastest.
• Can also be found using lr.plot_loss_change() – minimum point (here 1e-2).
• Lower bound is about 1-2 orders of magnitude lower.
• LR Finder (Keras) – https://github.com/surmenok/keras_lr_finder
• LR Finder (Pytorch) -- https://github.com/davidtvs/pytorch-lr-finder
• Keras example -- https://github.com/sujitpal/keras-tutorial-
odsc2020/blob/master/02_03_exercise_2_solved.ipynb
• Fast. AI example --
https://colab.research.google.com/github/fastai/fastbook/blob/master/16_ac
cel_sgd.ipynb

Mais conteúdo relacionado

Semelhante a Leslie Smith's Papers discussion for DL Journal Club

Big Data Project - Final version
Big Data Project - Final versionBig Data Project - Final version
Big Data Project - Final version
Mihir Sanghavi
 
rbm_final_paper
rbm_final_paperrbm_final_paper
rbm_final_paper
Sam Bean
 

Semelhante a Leslie Smith's Papers discussion for DL Journal Club (20)

Big Data Project - Final version
Big Data Project - Final versionBig Data Project - Final version
Big Data Project - Final version
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx6 Evaluating Predictive Performance and ensemble.pptx
6 Evaluating Predictive Performance and ensemble.pptx
 
15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Paper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hourPaper Review: Training ImageNet in 1hour
Paper Review: Training ImageNet in 1hour
 
Setting Artificial Neural Networks parameters
Setting Artificial Neural Networks parametersSetting Artificial Neural Networks parameters
Setting Artificial Neural Networks parameters
 
rbm_final_paper
rbm_final_paperrbm_final_paper
rbm_final_paper
 
4.1.pptx
4.1.pptx4.1.pptx
4.1.pptx
 
Competition winning learning rates
Competition winning learning ratesCompetition winning learning rates
Competition winning learning rates
 
ML MODULE 5.pdf
ML MODULE 5.pdfML MODULE 5.pdf
ML MODULE 5.pdf
 
Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...Integer quantization for deep learning inference: principles and empirical ev...
Integer quantization for deep learning inference: principles and empirical ev...
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris..."A Framework for Developing Trading Models Based on Machine Learning" by Kris...
"A Framework for Developing Trading Models Based on Machine Learning" by Kris...
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Unit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptxUnit-4 PART-4 Overfitting.pptx
Unit-4 PART-4 Overfitting.pptx
 
Dataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdfDataset Augmentation and machine learning.pdf
Dataset Augmentation and machine learning.pdf
 
Tuning learning rate
Tuning learning rateTuning learning rate
Tuning learning rate
 

Mais de Sujit Pal

Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
Sujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
Sujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
Sujit Pal
 

Mais de Sujit Pal (20)

Supporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSupporting Concept Search using a Clinical Healthcare Knowledge Graph
Supporting Concept Search using a Clinical Healthcare Knowledge Graph
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Leslie Smith's Papers discussion for DL Journal Club

  • 1. LESLIE SMITH’S PAPERS FOR DL JOURNAL CLUB
  • 2. DISCIPLINED APPROACH PAPER • A disciplined approach to neural network hyperparameters: Part 1 – Learning Rate, Batch Size, Momentum, and Weight Decay • There is no Part 2 • https://arxiv.org/abs/1803.09820 • Collection of empirical observations spread out through the paper
  • 3. CONVERGENCE / TEST-VAL LOSS • Observe box in top-left corner of Figure 1(a) • Shows training loss (red) decreasing and validation loss (blue) decreasing then increasing. • Plot to left of validation loss minima indicates underfitting • Plot to right of validation loss minima indicates overfitting. • Achieving the horizontal part of test/validation loss (minima) is goal of hyperparameter tuning.
  • 4. UNDERFITTING • Underfitting is indicated by continuously decreasing test loss rather than horizontal plateau (Fig 3(a)). • Steepness of test loss curve indicates how well the model is learning (Fig 3(b)).
  • 5. OVERFITTING • Increasing Learning Rate moves the model from underfitting to overfitting. • Blue curve (Fig 4a) shows steepest fall – indication that this will produce better final accuracy. • Yellow curve (Fig 4a) shows overfitting with LR > 0.006. • More overfitting examples – blue curves in bottom figs. • Blue curve (Fig 4b) shows underfitting. • Red curve (Fig 4b) shows overfitting.
  • 6. CYCLIC LEARNING RATE (CLR) • Motivation: Underfitting if LR too low, overfitting if too high; requires grid search • CLR • Specify upper and lower bound for LR • Specify step size == number of iterations or epochs used for each step • Cycle consists of 2 steps – first step LR increases linearly from min to max, second step LR decreases linearly from max to min. • Other variants tried but no significant benefit observed.
  • 7. CLR – CHOOSE MAX AND MIN LR • LR upper bound == min value of LR that causes test / validation loss to increase (and accuracy to decrease) • LR lower bound, one of: • Factor of 3 or 4 less than upper bound. • Factor of 10 or 20 less than upper bound if only 1 cycle is used. • Find experimentally using short test of ~1000 iterations, pick largest that allows convergence. • Step size – if LR too high, training becomes unstable, increase step size to increase difference between max and min LR bounds.
  • 8. SUPER CONVERGENCE • Super convergence – some networks remain stable under high LR, so can be trained very quickly with CLR with high upper bound. • Fig 5a shows super convergence (orange curve) training faster to higher accuracy using large LR than blue curve. • 1-cycle policy – one cycle that is smaller than number of iterations/epochs, then remaining iterations with LR lowered by several order of magnitude.
  • 9. REGULARIZATION • Many forms of regularization • Large Learning Rate • Small batch size • Weight decay (aka L2 regularization) • Dropout • Need to balance different regularizers for each dataset and architecture. • Fig 5b (previous slide) shows tradeoff between weight decay (WD) and LR. Large LR for faster learning needs to be balanced with lower WD. • General guidance: reduce other forms of regularization and train with high LR makes training efficient.
  • 10. BATCH SIZE • Larger batch sizes permit larger LR using 1cycle schedule. • Larger batch size may increase training time, so tradeoff required. • Tradeoff – use batch size so number of epochs is optimum for data/model. • Batch size limited by GPU memory. • Fig 6a shows validation accuracy for different batch sizes. Larger batch sizes better but effect tapers off (BS=1024 blue curve very close to BS=512 red curve).
  • 11. (CYCLIC) MOMENTUM • Set momentum as large as possible without causing instability. • Constant LR => use large constant momentum (0.9 – 0.99) • Cyclic LR => decrease cyclic momentum as cyclic LR increases during early to middle part of training (0.95 – 0.85). • Fig 8a – blue curve is constant momentum, red curve is decreasing CM and yellow curve is increasing CM (with increasing CLR). • These observations also carry over to deep networks (Fig 8b).
  • 12. WEIGHT DECAY • Cyclical WD not useful, should remain constant throughout training. • Value should be found by grid search (ok with early termination). • Fig 9a shows loss plots for different values of WD (with LR=5e- 3, mom=0.95). • Fig 9b shows equivalent accuracy plots.
  • 13. CYCLIC LEARNING RATE PAPER • Cyclical Learning Rates for Training Neural Networks • https://arxiv.org/abs/1506.01186 • Describes CLR in depth and describes results of training common networks with CLR.
  • 14. CYCLIC LEARNING RATE • Successor to • Learning rate schedules – varying LR exponentially over training. • Adaptive Learning Rates (RMSProp, ADAM, etc) – change LR based on values of gradients. • Based on observation that increasing LR has short-term negative effect but long-term positive effect. • Let LR vary between range of values. • Triangular LR (Fig 2) is usually good enough but other variants also possible. • Accuracy plot (Fig 1) shows CLR (red curve) is better compared to Exponential LR.
  • 15. ESTIMATING CLR PARAMETERS • Step size • Step size = 2 to 10 times * number of iterations per epoch • Number of training iterations per epoch = number of training records / batch size • Upper and lower bounds for LR • Run model for few epochs with some bounds (1e-4 to 2e-1 for example) • Upper bound == where accuracy stops increasing, becomes ragged, or falls (~ 6e-3). • Lower bound • Either 1/3 or ¼ of upper bound (~ 2e-3) • Point at which accuracy starts to increase (~ 1e-3)
  • 16. LR FINDER USAGE • LR Finder – first available in Fast.AI library. • Upper bound – between 1e-3 and 1e-2 (10-3 and 10-2) where loss is decreasing fastest. • Can also be found using lr.plot_loss_change() – minimum point (here 1e-2). • Lower bound is about 1-2 orders of magnitude lower. • LR Finder (Keras) – https://github.com/surmenok/keras_lr_finder • LR Finder (Pytorch) -- https://github.com/davidtvs/pytorch-lr-finder • Keras example -- https://github.com/sujitpal/keras-tutorial- odsc2020/blob/master/02_03_exercise_2_solved.ipynb • Fast. AI example -- https://colab.research.google.com/github/fastai/fastbook/blob/master/16_ac cel_sgd.ipynb