SlideShare uma empresa Scribd logo
1 de 24
Optimization in Deep
Learning
Jeremy Nixon
Overview
1. Challenges in Neural Network Optimization
2. Gradient Descent
3. Stochastic Gradient Descent
4. Momentum
a. Nesterov Momentum
5. RMSProp
6. Adam
Challenges in Neural Network Optimization
1. Training Time
a. Model complexity (depth, width) is important to accuracy
b. Training time for state of the art can take weeks on a GPU
2. Hyperparameter Tuning
a. Learning rate tuning is important to accuracy
3. Local Minima
Neural Net Refresh + Gradient Descent
w2
w1
Hidden raw / relu
output_softmax
x_train
Stochastic Gradient Descent
Dramatic Speedup
Sub-linear returns to more data in each batch
Crucial Learning Rate Hyperparameter
Schedule to reduce learning rate during training
SGD introduces noise to the gradient
Gradient will almost never fully converge to 0
Stochastic Gradient Descent
Number hidden layers = 1
lr = 1.0 (normal is 0.01)
Dataset = Mnist
Momentum
Dramatically Accelerates Learning
1. Initialize learning rates & momentum matrix the size of the weights
2. At each SGD iteration, collect the gradient.
3. Update momentum matrix to be momentum rate times a momentum
hyperparameter plus the learning rate times the collected gradient.
s = .9 = momentum hyperparameter t.layers[i].moment1 = layer i’s momentum matrix lr = .01 gradient = sgd’s collected gradient
Number hidden layers = 2
Dataset = Mnist
Intuition for Momentum
Automatically cancels out noise in the gradient
Amplifies small but consistent gradients
“Momentum” derives from the physical analogy [momentum = mass * velocity]
Assumes unit mass
Velocity vector is the ‘particle's’ momentum
Deals well with heavy curvature
Momentum Accelerates the Gradient
Gradient that accumulates in the same direction can achieve velocities of up to
lr / (1-s). S = .9 => lr can max out at lr * 10 in the direction of accumulated gradient.
Asynchronous SGD similar to Momentum
In distributed SGD, asynchronous has workers update parameters as they return, instead of
waiting for all workers to finish
Creates a weighted average of previous gradients applied to the current weights
Nesterov Momentum
Evaluate the gradient with the momentum step taken into account
Number hidden layers = 2
Dataset = Mnist
Adaptive Learning Rate Algorithms
Adagrad
Duchi et al., 2011
RMSProp
Hinton, 2012
Adam
Kingma and Ba, 2014
Idea is to auto-tune the learning rate, making the network less sensitive to hyperparameters.
Adagrad
Shrinks the learning rate adaptively
Learning rate is the inverse of the historical squared gradient
r = squared gradient history g = gradient theta = weights epsilon = learning rate delta = small constant for numerical stability
Intuition for Adagrad
Instead of setting a single global learning rate, have a different learning rate for
every weight in the network
Parameters with the largest derivative have a rapid decrease in learning rate
Parameters with small derivatives have a small decrease in learning rate
We get much more progress in more gently sloped directions of parameter
space.
Downside - accumulating gradients from the beginning leads to extremely small
learning rates later in training
Downside - doesn’t deal well with differences in global and local structure
RMSProp
Collect exponentially weighted average of the gradient for the learning rate
Performs well in non-convex setting with differences between global and local
structure
Can be combined with momentum / nesterov momentum
Number hidden layers = 1
Dataset = Mnist
Number hidden layers = 1
Dataset = Mnist
Adam
Short for “Adaptive Moments”
Exponentially weighted average of gradient for momentum (first moment)
Exponentially weighted average of squared gradient for adapting learning rate
(second moment)
Bias Correction for both to adjust early in training
Adam
Number hidden layers = 5
Dataset = Mnist
Thank you!
Questions?
Bibliography
Adam paper - https://arxiv.org/abs/1412.6980
Adagrad - http://jmlr.org/papers/v12/duchi11a.html
RMSProp - http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Deep Learning Textbook - http://www.deeplearningbook.org/

Mais conteúdo relacionado

Mais procurados

Markov decision process
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Simplilearn
 

Mais procurados (20)

Basics of Soft Computing
Basics of Soft  Computing Basics of Soft  Computing
Basics of Soft Computing
 
Machine learning
Machine learningMachine learning
Machine learning
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Zero shot learning
Zero shot learning Zero shot learning
Zero shot learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introduction
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Deep learning
Deep learningDeep learning
Deep learning
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
Problem solving agents
Problem solving agentsProblem solving agents
Problem solving agents
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Random forest
Random forestRandom forest
Random forest
 

Destaque

DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
odsc
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Models
butest
 
Rattani - Ph.D. Defense Slides
Rattani - Ph.D. Defense SlidesRattani - Ph.D. Defense Slides
Rattani - Ph.D. Defense Slides
Pluribus One
 

Destaque (20)

Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
 
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
[AI07] Revolutionizing Image Processing with Cognitive Toolkit[AI07] Revolutionizing Image Processing with Cognitive Toolkit
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
 
Semi fragile watermarking
Semi fragile watermarkingSemi fragile watermarking
Semi fragile watermarking
 
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTECFace recognition and deep learning  โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
 
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
 
портфоліо Бабич О.А.
портфоліо Бабич О.А.портфоліо Бабич О.А.
портфоліо Бабич О.А.
 
Caffe - A deep learning framework (Ramin Fahimi)
Caffe - A deep learning framework (Ramin Fahimi)Caffe - A deep learning framework (Ramin Fahimi)
Caffe - A deep learning framework (Ramin Fahimi)
 
Using Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and LearningUsing Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and Learning
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 
Processor, Compiler and Python Programming Language
Processor, Compiler and Python Programming LanguageProcessor, Compiler and Python Programming Language
Processor, Compiler and Python Programming Language
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2
 
Facebook Deep face
Facebook Deep faceFacebook Deep face
Facebook Deep face
 
Caffe framework tutorial
Caffe framework tutorialCaffe framework tutorial
Caffe framework tutorial
 
Computer vision, machine, and deep learning
Computer vision, machine, and deep learningComputer vision, machine, and deep learning
Computer vision, machine, and deep learning
 
Center loss for Face Recognition
Center loss for Face RecognitionCenter loss for Face Recognition
Center loss for Face Recognition
 
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream) Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Models
 
Rattani - Ph.D. Defense Slides
Rattani - Ph.D. Defense SlidesRattani - Ph.D. Defense Slides
Rattani - Ph.D. Defense Slides
 
怖くない誤差逆伝播法 Chainerを添えて
怖くない誤差逆伝播法 Chainerを添えて怖くない誤差逆伝播法 Chainerを添えて
怖くない誤差逆伝播法 Chainerを添えて
 
Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3Pattern Recognition and Machine Learning: Section 3.3
Pattern Recognition and Machine Learning: Section 3.3
 

Semelhante a Optimization in deep learning

Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
David Tung
 

Semelhante a Optimization in deep learning (20)

Cheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricksCheatsheet deep-learning-tips-tricks
Cheatsheet deep-learning-tips-tricks
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Deep gradient compression
Deep gradient compressionDeep gradient compression
Deep gradient compression
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep Learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Everything You Wanted to Know About Optimization
Everything You Wanted to Know About OptimizationEverything You Wanted to Know About Optimization
Everything You Wanted to Know About Optimization
 
4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx4. OPTIMIZATION NN AND FL.pptx
4. OPTIMIZATION NN AND FL.pptx
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Optimization in deep learning

  • 2. Overview 1. Challenges in Neural Network Optimization 2. Gradient Descent 3. Stochastic Gradient Descent 4. Momentum a. Nesterov Momentum 5. RMSProp 6. Adam
  • 3. Challenges in Neural Network Optimization 1. Training Time a. Model complexity (depth, width) is important to accuracy b. Training time for state of the art can take weeks on a GPU 2. Hyperparameter Tuning a. Learning rate tuning is important to accuracy 3. Local Minima
  • 4. Neural Net Refresh + Gradient Descent w2 w1 Hidden raw / relu output_softmax x_train
  • 5. Stochastic Gradient Descent Dramatic Speedup Sub-linear returns to more data in each batch Crucial Learning Rate Hyperparameter Schedule to reduce learning rate during training SGD introduces noise to the gradient Gradient will almost never fully converge to 0
  • 7. Number hidden layers = 1 lr = 1.0 (normal is 0.01) Dataset = Mnist
  • 8. Momentum Dramatically Accelerates Learning 1. Initialize learning rates & momentum matrix the size of the weights 2. At each SGD iteration, collect the gradient. 3. Update momentum matrix to be momentum rate times a momentum hyperparameter plus the learning rate times the collected gradient. s = .9 = momentum hyperparameter t.layers[i].moment1 = layer i’s momentum matrix lr = .01 gradient = sgd’s collected gradient
  • 9. Number hidden layers = 2 Dataset = Mnist
  • 10. Intuition for Momentum Automatically cancels out noise in the gradient Amplifies small but consistent gradients “Momentum” derives from the physical analogy [momentum = mass * velocity] Assumes unit mass Velocity vector is the ‘particle's’ momentum Deals well with heavy curvature
  • 11. Momentum Accelerates the Gradient Gradient that accumulates in the same direction can achieve velocities of up to lr / (1-s). S = .9 => lr can max out at lr * 10 in the direction of accumulated gradient.
  • 12. Asynchronous SGD similar to Momentum In distributed SGD, asynchronous has workers update parameters as they return, instead of waiting for all workers to finish Creates a weighted average of previous gradients applied to the current weights
  • 13. Nesterov Momentum Evaluate the gradient with the momentum step taken into account
  • 14. Number hidden layers = 2 Dataset = Mnist
  • 15. Adaptive Learning Rate Algorithms Adagrad Duchi et al., 2011 RMSProp Hinton, 2012 Adam Kingma and Ba, 2014 Idea is to auto-tune the learning rate, making the network less sensitive to hyperparameters.
  • 16. Adagrad Shrinks the learning rate adaptively Learning rate is the inverse of the historical squared gradient r = squared gradient history g = gradient theta = weights epsilon = learning rate delta = small constant for numerical stability
  • 17. Intuition for Adagrad Instead of setting a single global learning rate, have a different learning rate for every weight in the network Parameters with the largest derivative have a rapid decrease in learning rate Parameters with small derivatives have a small decrease in learning rate We get much more progress in more gently sloped directions of parameter space. Downside - accumulating gradients from the beginning leads to extremely small learning rates later in training Downside - doesn’t deal well with differences in global and local structure
  • 18. RMSProp Collect exponentially weighted average of the gradient for the learning rate Performs well in non-convex setting with differences between global and local structure Can be combined with momentum / nesterov momentum
  • 19. Number hidden layers = 1 Dataset = Mnist
  • 20. Number hidden layers = 1 Dataset = Mnist
  • 21. Adam Short for “Adaptive Moments” Exponentially weighted average of gradient for momentum (first moment) Exponentially weighted average of squared gradient for adapting learning rate (second moment) Bias Correction for both to adjust early in training
  • 22. Adam
  • 23. Number hidden layers = 5 Dataset = Mnist
  • 24. Thank you! Questions? Bibliography Adam paper - https://arxiv.org/abs/1412.6980 Adagrad - http://jmlr.org/papers/v12/duchi11a.html RMSProp - http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Deep Learning Textbook - http://www.deeplearningbook.org/