SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
HYPERPARAMETER OPTIMIZATION WITH
APPROXIMATE GRADIENT
Fabian Pedregosa
Chaire Havas-Dauphine
Paris-Dauphine / École Normale
Supérieure
HYPERPARAMETERS
Most machine learning models depend on at least one
hyperparameter to control for model complexity. Examples
include:
Amount of regularization.
Kernel parameters.
Architecture of a neural network.
Model parameters
Estimated using some
(regularized) goodness of
t on the data.
Hyperparameters
Cannot be estimated using
the same criteria as model
parameters (over tting).
HYPERPARAMETER SELECTION
Criterion to for hyperparameter selection:
Optimize loss on unseen data: cross-validation.
Minimize risk estimator: SURE, AIC/BIC, etc.
Example: least squares with regularization.ℓ2
loss =
Costly evaluation function,
non-convex.
Common methods: grid
search, random search, SMBO.
( − X(λ)∑n
i=1
bi ai )
2
GRADIENT-BASED HYPERPARAMETER OPTIMIZATION
Compute gradients with respect to hyperparameters
[Larsen 1996, 1998, Bengio 2000].
Hyperparameter optimization as nested or bi-level
optimization:
arg min
λ∈
s.t.  X(λ)
⏟model parameters
  f (λ) ≜ g(X(λ), λ)
  loss on test set
∈  arg min
x∈ℝp
h(x, λ)
⏟loss on train set
GOAL: COMPUTE ∇f (λ)
By chain rule,
Two main approaches: implicit differentiation and iterative
differentiation [Domke et al. 2012, Macaulin 2015]
Implicit differentiation [Larsen 1996, Bengio 2000]:
formulate inner optimization as implicit equation.
∇f = ⋅+
∂g
∂λ
∂g
∂X
  known
∂X
∂λ
⏟unknown
X(λ) ∈ arg min h(x, λ) ⟺ h(X(λ), λ) = 0∇1
  implicit equation for X
GRADIENT-BASED HYPERPARAMETER OPTIMIZATION
∇f = g − g∇2 ( h)∇2
1,2
T
( h)∇2
1
−1
∇1
Possible to compute gradient w.r.t. hyperparameters, given
Solution to the inner optimization
Solution to linear system
X(λ)
g( h)∇2
1
−1
∇1
computationally expensive.⟹
HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE
GRADIENT
Loose approximation
Cheap iterations, might
diverge.
Precise approximation
Costly iterations,
convergence to stationary
Replace by an approximate solution of the inner
optimization.
Approximately solve linear system.
Update using
Tradeoff
X(λ)
λ    ≈ ∇fpk
point.
HOAG At iteration perform the following:k = 1, 2, …
i) Solve the inner optimization problem up to tolerance , i.e. nd
such that
ii) Solve the linear system up to tolerance . That is, nd such
that
iii) Compute approximate gradient as
iv) Update hyperparameters:
εk
∈xk ℝp
∥X( ) − ∥ ≤ .λk xk εk
εk qk
∥ h( , ) − g( , )∥ ≤ .∇2
1
xk λk qk ∇1 xk λk εk
pk
= g( , ) − h( , ,pk ∇2 xk λk ∇2
1,2
xk λk )
T
qk
=
(
− )
.λk+1 P λk
1
L
pk
ANALYSIS - GLOBAL CONVERGENCE
Assumptions:
(A1). Lipschits and .
(A2). non-singular
(A3). Domain is bounded.
∇g h∇2
h(X(λ), λ)∇2
1

Corollary: If , then converges to a
stationary point :
if is in the interior of then
< ∞∑∞
i=1
εi λk
λ∗
⟨∇f ( ), α − ⟩ ≥ 0 , ∀α ∈ λ∗
λ∗
⟹ λ∗ 
∇f ( ) = 0λ∗
EXPERIMENTS
How to choose tolerance ?εk
Different strategies for the tolerance decrease. Quadratic:
, Cubic: , Exponential:= 0.1/εk k
2
0.1/k
3
0.1 × 0.9
k
Approximate-gradient strategies achieve much faster
decrease in early iterations.
EXPERIMENTS I
Model: -regularized
logistic regression.
1 Hyperparameter.
Datasets:
20news (18k 130k )
real-sim (73k 20k)
ℓ2
×
×
EXPERIMENTS II
Kernel ridge regression.
2 hyperparameters.
Parkinson dataset: 654
17
Multinomial Logistic
regression with one
hyperparameter per feature
[Maclaurin et al. 2015]
784 10
hyperparameters
MNIST dataset: 60k
784
×
×
×
CONCLUSION
Hyperparameter optimization with inexact gradient:
can update hyperparameters before model parameters
have fully converged.
independent of inner optimization algorithm.
convergence guarantees under smoothness
assumptions.
Open questions.
Non-smooth inner optimization (e.g. sparse models)?
Stochastic / online approximation?
REFERENCES
[Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization of
hyperparameters." Neural computation 12.8 (2000): 1889-1900.
[J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Random
search for hyper-parameter optimization." The Journal of Machine
Learning Research 13.1 (2012): 281-305.
[J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization Using
Deep Neural Networks. (2015). at
[K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-Thaw
Bayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at
[F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. An
evaluation of sequential model-based optimization for expensive blackbox
functions.
http://arxiv.org/abs/1502.05700a
http://arxiv.org/abs/1406.3896
REFERENCES 2
[M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing nite
sums with the stochastic average gradient. arXiv Prepr. arXiv1309.2388
1–45 (2013). at
[J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-Based
Modeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012).
[M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. Hybrid
Deterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput.
34, A1380–A1405 (2012).
http://arxiv.org/abs/1309.2388
EXPERIMENTS - COST FUNCTION
EXPERIMENTS
Comparison with other hyperparameter optimization
methods
Random = Random search, SMBO = Sequential Model-Based
Optimization (Gaussian process), Iterdiff = reverse-mode
differentiation .
EXPERIMENTS
Comparison in terms of a validation loss.
Random = Random search, SMBO = Sequential Model-Based
Optimization (Gaussian process), Iterdiff = reverse-mode
differentiation .

Mais conteúdo relacionado

Mais procurados

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
Fast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeFast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeRakuten Group, Inc.
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingRakuten Group, Inc.
 
Recommendation System --Theory and Practice
Recommendation System --Theory and PracticeRecommendation System --Theory and Practice
Recommendation System --Theory and PracticeKimikazu Kato
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsDmitriy Selivanov
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetaggingTakashi Abe
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
Recsys matrix-factorizations
Recsys matrix-factorizationsRecsys matrix-factorizations
Recsys matrix-factorizationsDmitriy Selivanov
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsJagadeeswaran Rathinavel
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismDS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostJaroslaw Szymczak
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryFrancesco Tudisco
 

Mais procurados (20)

Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Fast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeFast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in Practice
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
 
Recommendation System --Theory and Practice
Recommendation System --Theory and PracticeRecommendation System --Theory and Practice
Recommendation System --Theory and Practice
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Matrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender SystemsMatrix Factorizations for Recommender Systems
Matrix Factorizations for Recommender Systems
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetagging
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
 
Recsys matrix-factorizations
Recsys matrix-factorizationsRecsys matrix-factorizations
Recsys matrix-factorizations
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismDS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
 
Gradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboostGradient boosting in practice: a deep dive into xgboost
Gradient boosting in practice: a deep dive into xgboost
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 

Semelhante a Hyperparameter optimization with approximate gradient

Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleHakka Labs
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...IJCNCJournal
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural NetworksBhaskar Mitra
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerIan Dewancker
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Convex optmization in communications
Convex optmization in communicationsConvex optmization in communications
Convex optmization in communicationsDeepshika Reddy
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoringharmonylab
 
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Jumlesha Shaik
 
15.sp.dictionary_draft.pdf
15.sp.dictionary_draft.pdf15.sp.dictionary_draft.pdf
15.sp.dictionary_draft.pdfAllanKelvinSales
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
 

Semelhante a Hyperparameter optimization with approximate gradient (20)

Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at GoogleDataEngConf: Feature Extraction: Modern Questions and Challenges at Google
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural Networks
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
SigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_PrimerSigOpt_Bayesian_Optimization_Primer
SigOpt_Bayesian_Optimization_Primer
 
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
 
40120130406008
4012013040600840120130406008
40120130406008
 
A04230105
A04230105A04230105
A04230105
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Convex optmization in communications
Convex optmization in communicationsConvex optmization in communications
Convex optmization in communications
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
 
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...
 
15.sp.dictionary_draft.pdf
15.sp.dictionary_draft.pdf15.sp.dictionary_draft.pdf
15.sp.dictionary_draft.pdf
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 

Mais de Fabian Pedregosa

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimationFabian Pedregosa
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator SplittingFabian Pedregosa
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you needFabian Pedregosa
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in pythonFabian Pedregosa
 

Mais de Fabian Pedregosa (10)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 

Último

Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Ai in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxAi in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxsubscribeus100
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 

Último (20)

Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Ai in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxAi in communication electronicss[1].pptx
Ai in communication electronicss[1].pptx
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 

Hyperparameter optimization with approximate gradient

  • 1. HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE GRADIENT Fabian Pedregosa Chaire Havas-Dauphine Paris-Dauphine / École Normale Supérieure
  • 2. HYPERPARAMETERS Most machine learning models depend on at least one hyperparameter to control for model complexity. Examples include: Amount of regularization. Kernel parameters. Architecture of a neural network. Model parameters Estimated using some (regularized) goodness of t on the data. Hyperparameters Cannot be estimated using the same criteria as model parameters (over tting).
  • 3. HYPERPARAMETER SELECTION Criterion to for hyperparameter selection: Optimize loss on unseen data: cross-validation. Minimize risk estimator: SURE, AIC/BIC, etc. Example: least squares with regularization.ℓ2 loss = Costly evaluation function, non-convex. Common methods: grid search, random search, SMBO. ( − X(λ)∑n i=1 bi ai ) 2
  • 4. GRADIENT-BASED HYPERPARAMETER OPTIMIZATION Compute gradients with respect to hyperparameters [Larsen 1996, 1998, Bengio 2000]. Hyperparameter optimization as nested or bi-level optimization: arg min λ∈ s.t.  X(λ) ⏟model parameters   f (λ) ≜ g(X(λ), λ)   loss on test set ∈  arg min x∈ℝp h(x, λ) ⏟loss on train set
  • 5. GOAL: COMPUTE ∇f (λ) By chain rule, Two main approaches: implicit differentiation and iterative differentiation [Domke et al. 2012, Macaulin 2015] Implicit differentiation [Larsen 1996, Bengio 2000]: formulate inner optimization as implicit equation. ∇f = ⋅+ ∂g ∂λ ∂g ∂X   known ∂X ∂λ ⏟unknown X(λ) ∈ arg min h(x, λ) ⟺ h(X(λ), λ) = 0∇1   implicit equation for X
  • 6. GRADIENT-BASED HYPERPARAMETER OPTIMIZATION ∇f = g − g∇2 ( h)∇2 1,2 T ( h)∇2 1 −1 ∇1 Possible to compute gradient w.r.t. hyperparameters, given Solution to the inner optimization Solution to linear system X(λ) g( h)∇2 1 −1 ∇1 computationally expensive.⟹
  • 7. HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE GRADIENT Loose approximation Cheap iterations, might diverge. Precise approximation Costly iterations, convergence to stationary Replace by an approximate solution of the inner optimization. Approximately solve linear system. Update using Tradeoff X(λ) λ    ≈ ∇fpk
  • 8. point. HOAG At iteration perform the following:k = 1, 2, … i) Solve the inner optimization problem up to tolerance , i.e. nd such that ii) Solve the linear system up to tolerance . That is, nd such that iii) Compute approximate gradient as iv) Update hyperparameters: εk ∈xk ℝp ∥X( ) − ∥ ≤ .λk xk εk εk qk ∥ h( , ) − g( , )∥ ≤ .∇2 1 xk λk qk ∇1 xk λk εk pk = g( , ) − h( , ,pk ∇2 xk λk ∇2 1,2 xk λk ) T qk = ( − ) .λk+1 P λk 1 L pk
  • 9. ANALYSIS - GLOBAL CONVERGENCE Assumptions: (A1). Lipschits and . (A2). non-singular (A3). Domain is bounded. ∇g h∇2 h(X(λ), λ)∇2 1  Corollary: If , then converges to a stationary point : if is in the interior of then < ∞∑∞ i=1 εi λk λ∗ ⟨∇f ( ), α − ⟩ ≥ 0 , ∀α ∈ λ∗ λ∗ ⟹ λ∗  ∇f ( ) = 0λ∗
  • 10. EXPERIMENTS How to choose tolerance ?εk Different strategies for the tolerance decrease. Quadratic: , Cubic: , Exponential:= 0.1/εk k 2 0.1/k 3 0.1 × 0.9 k Approximate-gradient strategies achieve much faster decrease in early iterations.
  • 11. EXPERIMENTS I Model: -regularized logistic regression. 1 Hyperparameter. Datasets: 20news (18k 130k ) real-sim (73k 20k) ℓ2 × ×
  • 12. EXPERIMENTS II Kernel ridge regression. 2 hyperparameters. Parkinson dataset: 654 17 Multinomial Logistic regression with one hyperparameter per feature [Maclaurin et al. 2015] 784 10 hyperparameters MNIST dataset: 60k 784 × × ×
  • 13. CONCLUSION Hyperparameter optimization with inexact gradient: can update hyperparameters before model parameters have fully converged. independent of inner optimization algorithm. convergence guarantees under smoothness assumptions. Open questions. Non-smooth inner optimization (e.g. sparse models)? Stochastic / online approximation?
  • 14. REFERENCES [Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization of hyperparameters." Neural computation 12.8 (2000): 1889-1900. [J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." The Journal of Machine Learning Research 13.1 (2012): 281-305. [J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization Using Deep Neural Networks. (2015). at [K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-Thaw Bayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at [F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. An evaluation of sequential model-based optimization for expensive blackbox functions. http://arxiv.org/abs/1502.05700a http://arxiv.org/abs/1406.3896
  • 15. REFERENCES 2 [M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing nite sums with the stochastic average gradient. arXiv Prepr. arXiv1309.2388 1–45 (2013). at [J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-Based Modeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012). [M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. Hybrid Deterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput. 34, A1380–A1405 (2012). http://arxiv.org/abs/1309.2388
  • 16. EXPERIMENTS - COST FUNCTION
  • 17. EXPERIMENTS Comparison with other hyperparameter optimization methods Random = Random search, SMBO = Sequential Model-Based Optimization (Gaussian process), Iterdiff = reverse-mode differentiation .
  • 18. EXPERIMENTS Comparison in terms of a validation loss. Random = Random search, SMBO = Sequential Model-Based Optimization (Gaussian process), Iterdiff = reverse-mode differentiation .