SlideShare uma empresa Scribd logo
1 de 65
Baixar para ler offline
Parallel Optimization in Machine
Learning
Fabian Pedregosa
December 19, 2017 Huawei Paris Research Center
About me
• Engineer (2010-2012), Inria Saclay
(scikit-learn kickstart).
• PhD (2012-2015), Inria Saclay.
• Postdoc (2015-2016),
Dauphine–ENS–Inria Paris.
• Postdoc (2017-present), UC Berkeley
- ETH Zurich (Marie-Curie fellowship,
European Commission)
Hacker at heart ... trapped in a
researcher’s body.
1/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2006 = no longer mentions to speed of processors.
2/32
Motivation
Computer add in 1993 Computer add in 2006
What has changed?
2006 = no longer mentions to speed of processors.
Primary feature: number of cores.
2/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
3/32
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• Multi-core architectures are here to stay.
Parallel algorithms needed to take advantage of modern CPUs.
3/32
Parallel optimization
Parallel algorithms can be divided into two large categories:
synchronous and asynchronous. Image credits: (Peng et al. 2016)
Synchronous methods
 Easy to implement (i.e.,
developed software packages).
 Well understood.
 Limited speedup due to
synchronization costs.
Asynchronous methods
 Faster, typically larger
speedups.
 Not well understood, large gap
between theory and practice.
 No mature software solutions.
4/32
Outline
Synchronous methods
• Synchronous (stochastic) gradient descent.
Asynchronous methods
• Asynchronous stochastic gradient descent (Hogwild) (Niu et al.
2011)
• Asynchronous variance-reduced stochastic methods (Leblond, P.,
and Lacoste-Julien 2017), (Pedregosa, Leblond, and
Lacoste-Julien 2017).
• Analysis of asynchronous methods.
• Codes and implementation aspects.
Leaving out many parallel synchronous methods: ADMM (Glowinski
and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro,
and Zhang 2014), to name a few.
5/32
Outline
Most of the following is joint work with Rémi Leblond and Simon
Lacoste-Julien
Rémi Leblond Simon Lacoste–Julien
6/32
Synchronous algorithms
Optimization for machine learning
Large part of problems in machine learning can be framed as
optimization problems of the form
minimize
x
f(x)
def
=
1
n
n∑
i=1
fi(x)
Gradient descent (Cauchy 1847). Descend
along steepest direction (−∇f(x))
x+
= x − γ∇f(x)
Stochastic gradient descent (SGD)
(Robbins and Monro 1951). Select a
random index i and descent along
− ∇fi(x):
x+
= x − γ∇fi(x) images source: Francis Bach
7/32
Parallel synchronous gradient descent
Computation of gradient is distributed among k workers
• Workers can be: different computers, CPUs
or GPUs
• Popular frameworks: Spark, Tensorflow,
PyTorch, neHadoop.
8/32
Parallel synchronous gradient descent
1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
9/32
Parallel synchronous gradient descent
1. Choose n1, . . . nk that sum to n.
2. Distribute computation of ∇f(x) among k nodes
∇f(x) =
1
n
∑
i=1
∇fi(x)
=
1
k
(
1
n1
n1∑
i=1
∇fi(x)
done by worker 1
+ . . . +
1
n1
nk∑
i=nk−1
∇fi(x)
done by worker k
)
3. Perform the gradient descent update by a master node
x+
= x − γ∇f(x)
 Trivial parallelization, same analysis as gradient descent.
 Synchronization step every iteration (3.).
9/32
Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
10/32
Parallel synchronous SGD
Can also be extended to stochastic gradient descent.
1. Select k samples i0, . . . , ik uniformly at random.
2. Compute in parallel ∇fit
on worker t
3. Perform the (mini-batch) stochastic gradient descent update
x+
= x − γ
1
k
k∑
t=1
∇fit
(x)
 Trivial parallelization, same analysis as (mini-batch) stochastic
gradient descent.
 The kind of parallelization that is implemented in deep learning
libraries (tensorflow, PyTorch, Thano, etc.).
 Synchronization step every iteration (3.).
10/32
Asynchronous algorithms
Asynchronous SGD
Synchronization is the bottleneck.
 What if we just ignore it?
11/32
Asynchronous SGD
Synchronization is the bottleneck.
 What if we just ignore it?
Hogwild (Niu et al. 2011): each core runs SGD in parallel, without
synchronization, and updates the same vector of coefficients.
In theory: convergence under very strong assumptions.
In practice: just works.
11/32
Hogwild in more detail
Each core follows the same procedure
1. Read the information from shared memory ˆx.
2. Sample i ∈ {1, . . . , n} uniformly at random.
3. Compute partial gradient ∇fi(ˆx).
4. Write the SGD update to shared memory x = x − γ∇fi(ˆx).
12/32
Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives (Emilie already mentioned some)
13/32
Looking for excitement? ...
analyze asynchronous methods!
Analysis of asynchronous methods
Simple things become counter-intuitive, e.g, how to name the
iterates?
 Iterates will change depending on the speed of processors
14/32
Naming scheme in Hogwild
Simple, intuitive and wrong
Each time a core has finished writing to shared memory, increment
iteration counter.
⇐⇒ ˆxt = (t + 1)-th succesfull update to shared memory.
Value of ˆxt and it are not determined until the iteration has finished.
=⇒ ˆxt and it are not necessarily independent.
15/32
Unbiased gradient estimate
SGD-like algorithms crucially rely on the unbiased property
Ei[∇fi(x)] = ∇f(x).
For synchronous algorithms, follows from the uniform sampling of i
Ei[∇fi(x)] =
n∑
i=1
Proba(selecting i)∇fi(x)
uniform sampling
=
n∑
i=1
1
n
∇fi(x) = ∇f(x)
16/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
17/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
17/32
A problematic example
This labeling scheme is incompatible with unbiasedness assumption
used in proofs.
Illustration: problem with two samples and two cores f = 1
2 (f1 + f2).
Computing ∇f1 is much expensive than ∇f2.
Start at x0. Because of the random sampling there are 4 possible
scenarios:
1. Core 1 selects f1, Core 2 selects f1 =⇒ x1 = x0 − γ∇f1(x)
2. Core 1 selects f1, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x)
3. Core 1 selects f2, Core 2 selects f1 =⇒ x1 = x0 − γ∇f2(x)
4. Core 1 selects f2, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x)
So we have
Ei [∇fi] =
1
4
f1 +
3
4
f2
̸=
1
2
f1 +
1
2
f2 !!
17/32
The Art of Naming Things
A new labeling scheme
 New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefficients from shared
memory.
18/32
A new labeling scheme
 New way to name iterates.
“After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment
counter each time we read the vector of coefficients from shared
memory.
 No dependency between it and the cost of computing ∇fit
.
 Full analysis of Hogwild and other asynchronous methods in
“Improved parallel stochastic optimization analysis for incremental
methods”, Leblond, P., and Lacoste-Julien (submitted).
18/32
Asynchronous SAGA
The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with fixed step size (1/3L)
19/32
The SAGA algorithm
Setting:
minimize
x
1
n
n∑
i=1
fi(x)
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014).
Select i ∈ {1, . . . , n} and compute (x+
, α+
) as
x+
= x − γ(∇fi(x) − αi + α) ; α+
i = ∇fi(x)
• Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x).
• Unlike SGD, because of memory terms α, variance → 0.
• Unlike SGD, converges with fixed step size (1/3L)
Super easy to use in scikit-learn
19/32
Sparse SAGA
Need for a sparse variant of SAGA
• A large part of large scale datasets are sparse.
• For sparse datasets and generalized linear models (e.g., least
squares, logistic regression, etc.), partial gradients ∇fi are sparse
too.
• Asynchronous algorithms work best when updates are sparse.
SAGA update is inefficient for sparse data
x+
= x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
) ; α+
i = ∇fi(x)
[scikit-learn uses many tricks to make it efficient that we cannot use
in asynchronous version]
20/32
Sparse SAGA
Sparse variant of SAGA. Relies on
• Diagonal matrix Pi = projection onto the support of ∇fi
• Diagonal matrix D defined as
Dj,j = n/number of times ∇jfi is nonzero.
Sparse SAGA algorithm (Leblond, P., and Lacoste-Julien 2017)
x+
= x − γ(∇fi(x) − αi + PiDα) ; α+
i = ∇fi(x)
• All operations are sparse, cost per iteration is
O(nonzeros in ∇fi).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
• Crucial property: Ei[PiD] = I.
21/32
Asynchronous SAGA (ASAGA)
• Each core runs an instance of Sparse SAGA.
• Updates the same vector of coefficients α, α.
Theory: Under standard assumptions (bounded dalays), same
convergence rate than sequential version.
=⇒ theoretical linear speedup with respect to number of cores.
22/32
Experiments
• Improved convergence of variance-reduced methods wrt SGD.
• Significant improvement between 1 and 10 cores.
• Speedup is significant, but far from ideal.
23/32
Non-smooth problems
Composite objective
Previous methods assume objective function is smooth.
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
Objective: minimize composite objective function:
minimize
x
1
n
n∑
i=1
fi(x) + ∥x∥1
where fi is smooth (and ∥ · ∥1 is not). For simplicity we consider the
nonsmooth term to be ℓ1 norm, but this is general to any convex
function for which we have access to its proximal operator.
24/32
(Prox)SAGA
The ProxSAGA update is inefficient
x+
= proxγh
dense!
(x − γ(∇fi(x)
sparse
− αi
sparse
+ α
dense!
)) ; α+
i = ∇fi(x)
=⇒ a sparse variant is needed as a prerequisite for a practical
parallel method.
25/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate
vi=∇fi(x) − αi + DPiα ;
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
26/32
Sparse Proximal SAGA
Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017)
Extension of Sparse SAGA to composite optimization problems
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + DPiα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Where Pi, D are as in Sparse SAGA and φi
def
=
∑d
j (PiD)i,i|xj|.
φi has two key properties: i) support of φi = support of ∇fi (sparse
updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness)
Convergence: same linear convergence rate as SAGA, with cheaper
updates in presence of sparsity.
26/32
Proximal Asynchronous SAGA (ProxASAGA)
Each core runs Sparse Proximal SAGA asynchronously without locks
and updates x, α and α in shared memory.
 All read/write operations to shared memory are inconsistent, i.e.,
no performance destroying vector-level locks while reading/writing.
Convergence: under sparsity assumptions, ProxASAGA converges
with the same rate as the sequential algorithm =⇒ theoretical
linear speedup with respect to the number of cores.
27/32
Empirical results
ProxASAGA vs competing methods on 3 large-scale datasets,
ℓ1-regularized logistic regression
Dataset n p density L ∆
KDD 2010 19,264,097 1,163,024 10−6
28.12 0.15
KDD 2012 149,639,105 54,686,452 2 × 10−7
1.25 0.85
Criteo 45,840,617 1,000,000 4 × 10−5
1.25 0.89
0 20 40 60 80 100
Time (in minutes)
10 12
10 9
10 6
10 3
100
Objectiveminusoptimum
KDD10 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
KDD12 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
100 Criteo dataset
ProxASAGA (1 core)
ProxASAGA (10 cores)
AsySPCD (1 core)
AsySPCD (10 cores)
FISTA (1 core)
FISTA (10 cores)
28/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
29/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
29/32
Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
29/32
Perspectives
• Scale above 20 cores.
• Asynchronous optimization on the GPU.
• Acceleration.
• Software development.
30/32
Codes
 Code is in github: https://github.com/fabianp/ProxASAGA.
Computational code is C++ (use of atomic type) but wrapped in
Python.
A very efficient implementation of SAGA can be found in the
scikit-learn and lightning
(https://github.com/scikit-learn-contrib/lightning) libraries.
31/32
References
Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes d’équations
simultanées”. In: Comp. Rend. Sci. Paris.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives”. In: Advances in Neural
Information Processing Systems.
Glowinski, Roland and A Marroco (1975). “Sur l’approximation, par éléments finis d’ordre un, et la
résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. In:
Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique.
Jaggi, Martin et al. (2014). “Communication-Efficient Distributed Dual Coordinate Ascent”. In:
Advances in Neural Information Processing Systems 27.
Leblond, Rémi, Fabian P., and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In:
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
(AISTATS 2017).
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”.
In: Advances in Neural Information Processing Systems.
31/32
Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth
Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural
Information Processing Systems 30.
Peng, Zhimin et al. (2016). “ARock: an algorithmic framework for asynchronous parallel coordinate
updates”. In: SIAM Journal on Scientific Computing.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math.
Statist.
Shamir, Ohad, Nati Srebro, and Tong Zhang (2014). “Communication-efficient distributed
optimization using an approximate newton-type method”. In: International conference on
machine learning.
32/32
Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Input
layer
Hidden
layer
Output
layer
a1
a2
a3
a4
a5
Ouput
Supervised Machine Learning
Data: n observations (ai, bi) ∈ Rp
× R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT
a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2σ(xT
1 a))
Minimize some distance (e.g., quadratic) between the prediction
minimize
x
1
n
n∑
i=1
ℓ(bi, h(ai, x))
notation
=
1
n
n∑
i=1
fi(x)
where popular examples of ℓ are
• Squared loss, ℓ(bi, h(ai, x))
def
= (bi − h(ai, x))2
• Logistic (softmax), ℓ(bi, h(ai, x))
def
= log(1 + exp(−bih(ai, x)))
Sparse Proximal SAGA
For step size γ = 1
5L and f µ-strongly convex (µ > 0), Sparse Proximal
SAGA converges geometrically in expectation. At iteration t we have
E∥xt − x∗
∥2
≤ (1 − 1
5 min{ 1
n , 1
κ })t
C0 ,
with C0 = ∥x0 − x∗
∥2
+ 1
5L2
∑n
i=1 ∥α0
i − ∇fi(x∗
)∥2
and κ = L
µ (condition
number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.
Convergence ProxASAGA
Suppose τ ≤ 1
10
√
∆
. Then:
• If κ ≥ n, then with step size γ = 1
36L , ProxASAGA converges
geometrically with rate factor Ω( 1
κ ).
• If κ < n, then using the step size γ = 1
36nµ , ProxASAGA converges
geometrically with rate factor Ω( 1
n ).
In both cases, the convergence rate is the same as Sparse Proximal
SAGA =⇒ ProxASAGA is linearly faster up to constant factor. In both
cases the step size does not depend on τ.
If τ ≤ 6κ, a universal step size of Θ(1
L ) achieves a similar rate than
Sparse Proximal SAGA, making it adaptive to local strong convexity
(knowledge of κ not required).
ASAGA algorithm
ProxASAGA algorithm
Atomic vs non-atomic

Mais conteúdo relacionado

Mais procurados

Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmKatsuki Ohto
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVictor Pillac
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnGilles Louppe
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian ProcessHa Phuong
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...Nesreen K. Ahmed
 
Sampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying FrameworkSampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying FrameworkNesreen K. Ahmed
 

Mais procurados (20)

Introduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithmIntroduction of "TrailBlazer" algorithm
Introduction of "TrailBlazer" algorithm
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Europy17_dibernardo
Europy17_dibernardoEuropy17_dibernardo
Europy17_dibernardo
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
The Power of Motif Counting Theory, Algorithms, and Applications for Large Gr...
 
Sampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying FrameworkSampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying Framework
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
 

Semelhante a Parallel Optimization in Machine Learning

Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues listsJames Wong
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
StacksqueueslistsFraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsYoung Alista
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsTony Nguyen
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsHarry Potter
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Metaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisMetaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisXin-She Yang
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Xin-She Yang
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Association for Computational Linguistics
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...Viach Kakovskyi
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
Differential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow ModellingDifferential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow ModellingSerge Hoogendoorn
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxFreefireGarena30
 

Semelhante a Parallel Optimization in Machine Learning (20)

Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Chapter 1 - Introduction
Chapter 1 - IntroductionChapter 1 - Introduction
Chapter 1 - Introduction
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Metaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisMetaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical Analysis
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
 
50120140503004
5012014050300450120140503004
50120140503004
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
 
How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...How to easily find the optimal solution without exhaustive search using Genet...
How to easily find the optimal solution without exhaustive search using Genet...
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Differential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow ModellingDifferential game theory for Traffic Flow Modelling
Differential game theory for Traffic Flow Modelling
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
 

Mais de Fabian Pedregosa

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimationFabian Pedregosa
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator SplittingFabian Pedregosa
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you needFabian Pedregosa
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in pythonFabian Pedregosa
 

Mais de Fabian Pedregosa (8)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 

Último

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 

Último (20)

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 

Parallel Optimization in Machine Learning

  • 1. Parallel Optimization in Machine Learning Fabian Pedregosa December 19, 2017 Huawei Paris Research Center
  • 2. About me • Engineer (2010-2012), Inria Saclay (scikit-learn kickstart). • PhD (2012-2015), Inria Saclay. • Postdoc (2015-2016), Dauphine–ENS–Inria Paris. • Postdoc (2017-present), UC Berkeley - ETH Zurich (Marie-Curie fellowship, European Commission) Hacker at heart ... trapped in a researcher’s body. 1/32
  • 3. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2/32
  • 4. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. 2/32
  • 5. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. Primary feature: number of cores. 2/32
  • 6. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. 3/32
  • 7. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32
  • 8. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32
  • 9. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. Parallel algorithms needed to take advantage of modern CPUs. 3/32
  • 10. Parallel optimization Parallel algorithms can be divided into two large categories: synchronous and asynchronous. Image credits: (Peng et al. 2016) Synchronous methods  Easy to implement (i.e., developed software packages).  Well understood.  Limited speedup due to synchronization costs. Asynchronous methods  Faster, typically larger speedups.  Not well understood, large gap between theory and practice.  No mature software solutions. 4/32
  • 11. Outline Synchronous methods • Synchronous (stochastic) gradient descent. Asynchronous methods • Asynchronous stochastic gradient descent (Hogwild) (Niu et al. 2011) • Asynchronous variance-reduced stochastic methods (Leblond, P., and Lacoste-Julien 2017), (Pedregosa, Leblond, and Lacoste-Julien 2017). • Analysis of asynchronous methods. • Codes and implementation aspects. Leaving out many parallel synchronous methods: ADMM (Glowinski and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro, and Zhang 2014), to name a few. 5/32
  • 12. Outline Most of the following is joint work with Rémi Leblond and Simon Lacoste-Julien Rémi Leblond Simon Lacoste–Julien 6/32
  • 14. Optimization for machine learning Large part of problems in machine learning can be framed as optimization problems of the form minimize x f(x) def = 1 n n∑ i=1 fi(x) Gradient descent (Cauchy 1847). Descend along steepest direction (−∇f(x)) x+ = x − γ∇f(x) Stochastic gradient descent (SGD) (Robbins and Monro 1951). Select a random index i and descent along − ∇fi(x): x+ = x − γ∇fi(x) images source: Francis Bach 7/32
  • 15. Parallel synchronous gradient descent Computation of gradient is distributed among k workers • Workers can be: different computers, CPUs or GPUs • Popular frameworks: Spark, Tensorflow, PyTorch, neHadoop. 8/32
  • 16. Parallel synchronous gradient descent 1. Choose n1, . . . nk that sum to n. 2. Distribute computation of ∇f(x) among k nodes ∇f(x) = 1 n ∑ i=1 ∇fi(x) = 1 k ( 1 n1 n1∑ i=1 ∇fi(x) done by worker 1 + . . . + 1 n1 nk∑ i=nk−1 ∇fi(x) done by worker k ) 3. Perform the gradient descent update by a master node x+ = x − γ∇f(x) 9/32
  • 17. Parallel synchronous gradient descent 1. Choose n1, . . . nk that sum to n. 2. Distribute computation of ∇f(x) among k nodes ∇f(x) = 1 n ∑ i=1 ∇fi(x) = 1 k ( 1 n1 n1∑ i=1 ∇fi(x) done by worker 1 + . . . + 1 n1 nk∑ i=nk−1 ∇fi(x) done by worker k ) 3. Perform the gradient descent update by a master node x+ = x − γ∇f(x)  Trivial parallelization, same analysis as gradient descent.  Synchronization step every iteration (3.). 9/32
  • 18. Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i0, . . . , ik uniformly at random. 2. Compute in parallel ∇fit on worker t 3. Perform the (mini-batch) stochastic gradient descent update x+ = x − γ 1 k k∑ t=1 ∇fit (x) 10/32
  • 19. Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i0, . . . , ik uniformly at random. 2. Compute in parallel ∇fit on worker t 3. Perform the (mini-batch) stochastic gradient descent update x+ = x − γ 1 k k∑ t=1 ∇fit (x)  Trivial parallelization, same analysis as (mini-batch) stochastic gradient descent.  The kind of parallelization that is implemented in deep learning libraries (tensorflow, PyTorch, Thano, etc.).  Synchronization step every iteration (3.). 10/32
  • 21. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? 11/32
  • 22. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? Hogwild (Niu et al. 2011): each core runs SGD in parallel, without synchronization, and updates the same vector of coefficients. In theory: convergence under very strong assumptions. In practice: just works. 11/32
  • 23. Hogwild in more detail Each core follows the same procedure 1. Read the information from shared memory ˆx. 2. Sample i ∈ {1, . . . , n} uniformly at random. 3. Compute partial gradient ∇fi(ˆx). 4. Write the SGD update to shared memory x = x − γ∇fi(ˆx). 12/32
  • 24. Hogwild is fast Hogwild can be very fast. But its still SGD... • With constant step size, bounces around the optimum. • With decreasing step size, slow convergence. • There are better alternatives (Emilie already mentioned some) 13/32
  • 25. Looking for excitement? ... analyze asynchronous methods!
  • 26. Analysis of asynchronous methods Simple things become counter-intuitive, e.g, how to name the iterates?  Iterates will change depending on the speed of processors 14/32
  • 27. Naming scheme in Hogwild Simple, intuitive and wrong Each time a core has finished writing to shared memory, increment iteration counter. ⇐⇒ ˆxt = (t + 1)-th succesfull update to shared memory. Value of ˆxt and it are not determined until the iteration has finished. =⇒ ˆxt and it are not necessarily independent. 15/32
  • 28. Unbiased gradient estimate SGD-like algorithms crucially rely on the unbiased property Ei[∇fi(x)] = ∇f(x). For synchronous algorithms, follows from the uniform sampling of i Ei[∇fi(x)] = n∑ i=1 Proba(selecting i)∇fi(x) uniform sampling = n∑ i=1 1 n ∇fi(x) = ∇f(x) 16/32
  • 29. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. 17/32
  • 30. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. Illustration: problem with two samples and two cores f = 1 2 (f1 + f2). Computing ∇f1 is much expensive than ∇f2. 17/32
  • 31. A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. Illustration: problem with two samples and two cores f = 1 2 (f1 + f2). Computing ∇f1 is much expensive than ∇f2. Start at x0. Because of the random sampling there are 4 possible scenarios: 1. Core 1 selects f1, Core 2 selects f1 =⇒ x1 = x0 − γ∇f1(x) 2. Core 1 selects f1, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x) 3. Core 1 selects f2, Core 2 selects f1 =⇒ x1 = x0 − γ∇f2(x) 4. Core 1 selects f2, Core 2 selects f2 =⇒ x1 = x0 − γ∇f2(x) So we have Ei [∇fi] = 1 4 f1 + 3 4 f2 ̸= 1 2 f1 + 1 2 f2 !! 17/32
  • 32. The Art of Naming Things
  • 33. A new labeling scheme  New way to name iterates. “After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment counter each time we read the vector of coefficients from shared memory. 18/32
  • 34. A new labeling scheme  New way to name iterates. “After read” labeling (Leblond, P., and Lacoste-Julien 2017). Increment counter each time we read the vector of coefficients from shared memory.  No dependency between it and the cost of computing ∇fit .  Full analysis of Hogwild and other asynchronous methods in “Improved parallel stochastic optimization analysis for incremental methods”, Leblond, P., and Lacoste-Julien (submitted). 18/32
  • 36. The SAGA algorithm Setting: minimize x 1 n n∑ i=1 fi(x) The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014). Select i ∈ {1, . . . , n} and compute (x+ , α+ ) as x+ = x − γ(∇fi(x) − αi + α) ; α+ i = ∇fi(x) • Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x). • Unlike SGD, because of memory terms α, variance → 0. • Unlike SGD, converges with fixed step size (1/3L) 19/32
  • 37. The SAGA algorithm Setting: minimize x 1 n n∑ i=1 fi(x) The SAGA algorithm (Defazio, Bach, and Lacoste-Julien 2014). Select i ∈ {1, . . . , n} and compute (x+ , α+ ) as x+ = x − γ(∇fi(x) − αi + α) ; α+ i = ∇fi(x) • Like SGD, update is unbiased, i.e., Ei[∇fi(x) − αi + α)] = ∇f(x). • Unlike SGD, because of memory terms α, variance → 0. • Unlike SGD, converges with fixed step size (1/3L) Super easy to use in scikit-learn 19/32
  • 38. Sparse SAGA Need for a sparse variant of SAGA • A large part of large scale datasets are sparse. • For sparse datasets and generalized linear models (e.g., least squares, logistic regression, etc.), partial gradients ∇fi are sparse too. • Asynchronous algorithms work best when updates are sparse. SAGA update is inefficient for sparse data x+ = x − γ(∇fi(x) sparse − αi sparse + α dense! ) ; α+ i = ∇fi(x) [scikit-learn uses many tricks to make it efficient that we cannot use in asynchronous version] 20/32
  • 39. Sparse SAGA Sparse variant of SAGA. Relies on • Diagonal matrix Pi = projection onto the support of ∇fi • Diagonal matrix D defined as Dj,j = n/number of times ∇jfi is nonzero. Sparse SAGA algorithm (Leblond, P., and Lacoste-Julien 2017) x+ = x − γ(∇fi(x) − αi + PiDα) ; α+ i = ∇fi(x) • All operations are sparse, cost per iteration is O(nonzeros in ∇fi). • Same convergence properties than SAGA, but with cheaper iterations in presence of sparsity. • Crucial property: Ei[PiD] = I. 21/32
  • 40. Asynchronous SAGA (ASAGA) • Each core runs an instance of Sparse SAGA. • Updates the same vector of coefficients α, α. Theory: Under standard assumptions (bounded dalays), same convergence rate than sequential version. =⇒ theoretical linear speedup with respect to number of cores. 22/32
  • 41. Experiments • Improved convergence of variance-reduced methods wrt SGD. • Significant improvement between 1 and 10 cores. • Speedup is significant, but far from ideal. 23/32
  • 43. Composite objective Previous methods assume objective function is smooth. Cannot be applied to Lasso, Group Lasso, box constraints, etc. Objective: minimize composite objective function: minimize x 1 n n∑ i=1 fi(x) + ∥x∥1 where fi is smooth (and ∥ · ∥1 is not). For simplicity we consider the nonsmooth term to be ℓ1 norm, but this is general to any convex function for which we have access to its proximal operator. 24/32
  • 44. (Prox)SAGA The ProxSAGA update is inefficient x+ = proxγh dense! (x − γ(∇fi(x) sparse − αi sparse + α dense! )) ; α+ i = ∇fi(x) =⇒ a sparse variant is needed as a prerequisite for a practical parallel method. 25/32
  • 45. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems 26/32
  • 46. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate vi=∇fi(x) − αi + DPiα ; 26/32
  • 47. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) 26/32
  • 48. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) Where Pi, D are as in Sparse SAGA and φi def = ∑d j (PiD)i,i|xj|. φi has two key properties: i) support of φi = support of ∇fi (sparse updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness) 26/32
  • 49. Sparse Proximal SAGA Sparse Proximal SAGA. (Pedregosa, Leblond, and Lacoste-Julien 2017) Extension of Sparse SAGA to composite optimization problems Like SAGA, it relies on unbiased gradient estimate and proximal step vi=∇fi(x) − αi + DPiα ; x+ = proxγφi (x − γvi) ; α+ i = ∇fi(x) Where Pi, D are as in Sparse SAGA and φi def = ∑d j (PiD)i,i|xj|. φi has two key properties: i) support of φi = support of ∇fi (sparse updates) and ii) Ei[φi] = ∥x∥1 (unbiasedness) Convergence: same linear convergence rate as SAGA, with cheaper updates in presence of sparsity. 26/32
  • 50. Proximal Asynchronous SAGA (ProxASAGA) Each core runs Sparse Proximal SAGA asynchronously without locks and updates x, α and α in shared memory.  All read/write operations to shared memory are inconsistent, i.e., no performance destroying vector-level locks while reading/writing. Convergence: under sparsity assumptions, ProxASAGA converges with the same rate as the sequential algorithm =⇒ theoretical linear speedup with respect to the number of cores. 27/32
  • 51. Empirical results ProxASAGA vs competing methods on 3 large-scale datasets, ℓ1-regularized logistic regression Dataset n p density L ∆ KDD 2010 19,264,097 1,163,024 10−6 28.12 0.15 KDD 2012 149,639,105 54,686,452 2 × 10−7 1.25 0.85 Criteo 45,840,617 1,000,000 4 × 10−5 1.25 0.89 0 20 40 60 80 100 Time (in minutes) 10 12 10 9 10 6 10 3 100 Objectiveminusoptimum KDD10 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 KDD12 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 100 Criteo dataset ProxASAGA (1 core) ProxASAGA (10 cores) AsySPCD (1 core) AsySPCD (10 cores) FISTA (1 core) FISTA (10 cores) 28/32
  • 52. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA 29/32
  • 53. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. 29/32
  • 54. Empirical results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. • As predicted by theory, there is a high correlation between degree of sparsity and speedup. 29/32
  • 55. Perspectives • Scale above 20 cores. • Asynchronous optimization on the GPU. • Acceleration. • Software development. 30/32
  • 56. Codes  Code is in github: https://github.com/fabianp/ProxASAGA. Computational code is C++ (use of atomic type) but wrapped in Python. A very efficient implementation of SAGA can be found in the scikit-learn and lightning (https://github.com/scikit-learn-contrib/lightning) libraries. 31/32
  • 57. References Cauchy, Augustin (1847). “Méthode générale pour la résolution des systemes d’équations simultanées”. In: Comp. Rend. Sci. Paris. Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”. In: Advances in Neural Information Processing Systems. Glowinski, Roland and A Marroco (1975). “Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires”. In: Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique. Jaggi, Martin et al. (2014). “Communication-Efficient Distributed Dual Coordinate Ascent”. In: Advances in Neural Information Processing Systems 27. Leblond, Rémi, Fabian P., and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In: Advances in Neural Information Processing Systems. 31/32
  • 58. Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural Information Processing Systems 30. Peng, Zhimin et al. (2016). “ARock: an algorithmic framework for asynchronous parallel coordinate updates”. In: SIAM Journal on Scientific Computing. Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math. Statist. Shamir, Ohad, Nati Srebro, and Tong Zhang (2014). “Communication-efficient distributed optimization using an approximate newton-type method”. In: International conference on machine learning. 32/32
  • 59. Supervised Machine Learning Data: n observations (ai, bi) ∈ Rp × R Prediction function: h(a, x) ∈ R Motivating examples: • Linear prediction: h(a, x) = xT a • Neural networks: h(a, x) = xT mσ(xm−1σ(· · · xT 2σ(xT 1 a)) Input layer Hidden layer Output layer a1 a2 a3 a4 a5 Ouput
  • 60. Supervised Machine Learning Data: n observations (ai, bi) ∈ Rp × R Prediction function: h(a, x) ∈ R Motivating examples: • Linear prediction: h(a, x) = xT a • Neural networks: h(a, x) = xT mσ(xm−1σ(· · · xT 2σ(xT 1 a)) Minimize some distance (e.g., quadratic) between the prediction minimize x 1 n n∑ i=1 ℓ(bi, h(ai, x)) notation = 1 n n∑ i=1 fi(x) where popular examples of ℓ are • Squared loss, ℓ(bi, h(ai, x)) def = (bi − h(ai, x))2 • Logistic (softmax), ℓ(bi, h(ai, x)) def = log(1 + exp(−bih(ai, x)))
  • 61. Sparse Proximal SAGA For step size γ = 1 5L and f µ-strongly convex (µ > 0), Sparse Proximal SAGA converges geometrically in expectation. At iteration t we have E∥xt − x∗ ∥2 ≤ (1 − 1 5 min{ 1 n , 1 κ })t C0 , with C0 = ∥x0 − x∗ ∥2 + 1 5L2 ∑n i=1 ∥α0 i − ∇fi(x∗ )∥2 and κ = L µ (condition number). Implications • Same convergence rate than SAGA with cheaper updates. • In the “big data regime” (n ≥ κ): rate in O(1/n). • In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ). • Adaptivity to strong convexity, i.e., no need to know strong convexity parameter to obtain linear convergence.
  • 62. Convergence ProxASAGA Suppose τ ≤ 1 10 √ ∆ . Then: • If κ ≥ n, then with step size γ = 1 36L , ProxASAGA converges geometrically with rate factor Ω( 1 κ ). • If κ < n, then using the step size γ = 1 36nµ , ProxASAGA converges geometrically with rate factor Ω( 1 n ). In both cases, the convergence rate is the same as Sparse Proximal SAGA =⇒ ProxASAGA is linearly faster up to constant factor. In both cases the step size does not depend on τ. If τ ≤ 6κ, a universal step size of Θ(1 L ) achieves a similar rate than Sparse Proximal SAGA, making it adaptive to local strong convexity (knowledge of κ not required).