SlideShare uma empresa Scribd logo
1 de 68
Baixar para ler offline
Asynchronous Stochastic
Optimization
New Analysis and Algorithms
Fabian Pedregosa
May 25, 2018. University of Washington
Where I Come From
ML/Optimization/Software Guy
Engineer (2010–2012)
First contact with ML: develop ML
library (scikit-learn).
ML and NeuroScience (2012–2015)
PhD applying ML to neuroscience.
ML and Optimization (2015–)
Stochastic / Parallel / Constrained /
Hyperparameter optimization.
1/33
Outline
Goal: Review recent work in asynchronous parallel optimization for
machine learning1,2.
1. Asynchronous parallel optimization, Asynchronous SGD.
2. Asynchronous variance-reduced optimization.
3. Analysis of asynchronous methods: What we can prove.
1
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
2
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
2/33
1. Asynchronous Optimization
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/33
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
3/33
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
3/33
40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
Parallel algorithms needed to take advantage of modern
CPUs. 3/33
Parallel Optimization: Not a new topic
• Most of the principles and
methods already in
(Bertsekas and Tsitsiklis,
1989).
• For linear systems it can be
traced even earlier (Arrow
and Hurwicz, 1958).
4/33
Asynchronous vs Synchronous methods
Synchronous methods
• Wait for slowest worker.
• Limited speedup due to
synchronization cost.
Asynchronous methods
• Workers receive work as
needed.
• Minimize idle time.
• Challenging analysis.
t0 t1 t2
Worker 4
Worker 3
Worker 2
Worker 1
idle
idle
idle
idle
idle
idle
t0 t1t2t3 t4 t5t6 t7 t8
Worker 4
Worker 3
Worker 2
Worker 1
Time
5/33
Optimization for machine learning
Many problems in machine learning can be framed as
minimize
x∈Rp
f (x)
def
=
1
n
n
i=1
fi (x)
Gradient descent (Cauchy, 1847).
Descend along steepest direction
x+
= x − γ f (x)
Stochastic gradient descent (SGD)
(Robbins and Monro, 1951). Select
random i, descent along − fi (x):
x+
= x − γ fi (x) Figure source: Francis Bach
6/33
Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and
Athans, 1986)
Recent revival due to applications in machine learning, (Niu et al.,
2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild.
Problem: minimize
x
f (x)
def
=
1
n
n
i=1
fi (x)
General Algorithm
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i ∈ {1, . . . , n} and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
7/33
Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and
Athans, 1986)
Recent revival due to applications in machine learning, (Niu et al.,
2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild.
Problem: minimize
x
f (x)
def
=
1
n
n
i=1
fi (x)
General Algorithm
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i ∈ {1, . . . , n} and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
x and ˆx might be different. 7/33
Asynchronous SGD
• Write is performed with old version of coefficients.
• Update requires a lock on the vector of coefficients.
8/33
Hogwild! (Niu et al., 2011): Lock-free Async. SGD
Algorithm 1 Hogwild
1: loop
2: ˆx = inconsistent read of x
3: Sample i uniformly in {1, ..., n}
4: Let Si be fi ’s support
5: [δx]Si := −γ fi (ˆx)
6: for v in Si do
7: [x]v ← [x]v + [δx]v // atomic
8: end for
9: end loop
• All read/write operations to shared memory are
inconsistent, i.e., no vector-level locks while updating shared
memory.
• Key assumption. Sparse gradients (|Si | dimension).
9/33
Hogwild: when does it converge?
Sparse fi . Is this a reasonable assumption?
• If fi (x) = ϕ(aT
i x) then fi (x) = ai ϕ (aT
i x).
• Gradients are sparse whenever data ai is sparse.
• This is the case for generalized linear models (least squares,
logistic regression, linear SVMs, etc.).
In this class of models, Hogwild enjoys almost linear speedups.
Figure 1: Speedup of Hogwild. Image source: (Niu et al., 2011)
10/33
Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives 11/33
2. Asynchronous (Proximal) SAGA
Variance-reduced Stochastic Optimization
Problem: Finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) , where n < ∞
12/33
Variance-reduced Stochastic Optimization
Problem: Finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) , where n < ∞
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien,
2014)
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= x − γ ( fi (x) − αi + α)
gradient estimate
; α+
i = fi (x)
Variance-reduction technique known under different names, e.g.,
control variates in Monte Carlo methods. 12/33
The SAGA Algorithm
Theory: Linear (i.e., exponential
convergence) on strongly convex
problems.
Practical algorithm: converges
with a fixed step-size 1/(3L).
0 20 40 60 80 100
Time
10 12
10 10
10 8
10 6
10 4
10 2
100
functionsuboptimality
SAGA
SGD constant step size
SGD decreasing step size
13/33
The SAGA Algorithm
Theory: Linear (i.e., exponential
convergence) on strongly convex
problems.
Practical algorithm: converges
with a fixed step-size 1/(3L).
0 20 40 60 80 100
Time
10 12
10 10
10 8
10 6
10 4
10 2
100
functionsuboptimality
SAGA
SGD constant step size
SGD decreasing step size
Already used in scikit-learn
13/33
Asynchronous SAGA
Motivation: Can we design asynchronous version of SAGA?
14/33
Asynchronous SAGA
Motivation: Can we design asynchronous version of SAGA?
SAGA update is inefficient (without tricks) for sparse gradients.
x+
= x − γ( fi (x)
sparse
− αi
sparse
+ α
dense!
) ;
Need for a sparse variant of SAGA
• Many large scale datasets are sparse.
• Asynchronous algorithms work best when updates are sparse.
14/33
Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
• Let Pi be the projection onto support( fi )
• Let Di = Pi /(1
n
n
i=1 Pi )
• Crucial property: Ei [Di ] = I
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
• Let Pi be the projection onto support( fi )
• Let Di = Pi /(1
n
n
i=1 Pi )
• Crucial property: Ei [Di ] = I
Sparse SAGA algorithm3
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= x − γ( fi (x) − αi + Di α) ; α+
i = fi (x)
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
Sparse SAGA
• All operations are sparse, cost per iteration is
O(—nonzeros in fi —).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
16/33
Sparse SAGA
• All operations are sparse, cost per iteration is
O(—nonzeros in fi —).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
16/33
Proximal Sparse SAGA
Problem: Composite finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) + g(x) , where
• g is potentially nonsmooth (think λ · 1 or indicator) but we
have access to proxγg (x) = arg minz g(z) + 1
2 x − z 2.
• For some g, its proximal operator is available in closed form.
Examples: 1 norm (soft thresholding), indicator function
(projection).
17/33
Sparse Proximal SAGA
We can extend Sparse SAGA to incorporate one proximal term.
• Assume g separable: g(x) = p
j=1 gj (xj )
• Let ϕi = d
j (Di )j,j gj (xj )
• Crucial property: Ei [Di ] = I, Ei [ϕi ] = h
4
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
18/33
Sparse Proximal SAGA
We can extend Sparse SAGA to incorporate one proximal term.
• Assume g separable: g(x) = p
j=1 gj (xj )
• Let ϕi = d
j (Di )j,j gj (xj )
• Crucial property: Ei [Di ] = I, Ei [ϕi ] = h
Sparse SAGA algorithm4
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= proxγϕi
(x − γ( fi (x) − αi + Di α)) ; α+
i = fi (x)
4
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
18/33
Sparse Proximal SAGA
As SAGA, linear convergence under strong convexity.
Theorem
For step size γ = 1
5L and f L-smooth and µ-strongly convex
(µ > 0), at iteration t we have
E xt − x∗ 2
≤ (1 − 1
5 min{1
n , µ
L })t C0 ,
with C0 = x0 − x∗ 2 + 1
5L2
n
i=1 α0
i − fi (x∗) 2.
Implications
• Same convergence than SAGA with cheaper updates in
presence of sparsity.
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.
19/33
Asynchronous Proximal SAGA
ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien,
2017)
1. Read the information in shared memory (ˆx, ˆα, ˆα).
2. Sample i and compute fi (ˆx).
3. Perform Sparse SAGA update on shared memory
x = proxγϕi
(x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx)
20/33
Asynchronous Proximal SAGA
ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien,
2017)
1. Read the information in shared memory (ˆx, ˆα, ˆα).
2. Sample i and compute fi (ˆx).
3. Perform Sparse SAGA update on shared memory
x = proxγϕi
(x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx)
• As Hogwild!, inconsistent read and writes.
• Same convergence rate than sequential version under sparsity
of the gradients (delays ≤ 1
10
√
sparsity
.)
20/33
Empirical Results
ProxASAGA vs competing methods on 3 large-scale datasets,
1-regularized logistic regression
Dataset n p density L ∆
KDD 2010 19,264,097 1,163,024 10−6
28.12 0.15
KDD 2012 149,639,105 54,686,452 2 × 10−7
1.25 0.85
Criteo 45,840,617 1,000,000 4 × 10−5
1.25 0.89
0 20 40 60 80 100
Time (in minutes)
10 12
10 9
10 6
10 3
100
Objectiveminusoptimum
KDD10 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
KDD12 dataset
0 10 20 30 40
Time (in minutes)
10 12
10 9
10 6
10 3
100 Criteo dataset
ProxASAGA (1 core)
ProxASAGA (10 cores)
AsySPCD (1 core)
AsySPCD (10 cores)
FISTA (1 core)
FISTA (10 cores)
21/33
Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
22/33
Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20
cores architecture.
22/33
Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20
cores architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup. 22/33
3. Analysis or The Art of Naming
Analysis
Active Research Topic
• Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011)
• Stochastic Approximation (Duchi, Chaturapruek, and R´e,
2015)
• Nonconvex losses (De Sa et al., 2015; Lian et al., 2015)
• Variance-reduced stochastic methods (Reddi et al., 2015)
23/33
Analysis
Active Research Topic
• Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011)
• Stochastic Approximation (Duchi, Chaturapruek, and R´e,
2015)
• Nonconvex losses (De Sa et al., 2015; Lian et al., 2015)
• Variance-reduced stochastic methods (Reddi et al., 2015)
Claim #1
There are fundamental flaws in these analysis.
23/33
Analysis
Analysis of optimization algorithms requires to prove progress from
one iterate to the next.
How to define an iterate?
24/33
Analysis
Analysis of optimization algorithms requires to prove progress from
one iterate to the next.
How to define an iterate?
Asynchronous SGD
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
24/33
Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
25/33
Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
Unbiasedness Assumption
Asynchronous SGD-like algorithms crucially rely on the unbiased
property
Ei [ fi (x)] = f (x) .
25/33
Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
Unbiasedness Assumption
Asynchronous SGD-like algorithms crucially rely on the unbiased
property
Ei [ fi (x)] = f (x) .
Issue
The naming scheme and unbiased assumption are incompatible.
25/33
A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
26/33
A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
26/33
A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
In all, Ei0 [ fi0 (ˆx0)] =
3
4
f1(ˆx0) +
1
4
f2(ˆx0) = f (ˆx0)
26/33
A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
In all, Ei0 [ fi0 (ˆx0)] =
3
4
f1(ˆx0) +
1
4
f2(ˆx0) = f (ˆx0)
26/33
• This scheme does not satisfy the
crucial unbiasedness condition.
• Can we fix it?
A New Labeling Scheme
After read labeling scheme
Each time a worker has finished reading from shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful read from shared memory.
5
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
27/33
A New Labeling Scheme
After read labeling scheme
Each time a worker has finished reading from shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful read from shared memory.
No dependency between it and the cost of computing fit .
Full analysis of Hogwild, Asynchronous SVRG and
Asynchronous SAGA in5.
5
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
27/33
Convergence results – preliminaries
Some notation.
• ∆ = maxj∈1,...,d |{j : j ∈ supp( fi )}|/n. We always have
1/n ≤ ∆ ≤ 1.
• τ = Number of updates between the time that the vector of
coefficients is read to memory and the time the update is
finished.
28/33
A rigorous analysis of Hogwild (Niu et al., 2011)
• Inconsistent reads.
• Unlike (Niu et al., 2011), allow for inconsistent writes.
• Unlike (Niu et al., 2011; Mania et al., 2017), no global bound
on gradient.
Main result for Hogwild (handwaiving)
Let f be µ-strongly convex and L-smooth and assume (for
simplicity)
√
∆ ≤ µ
L . Then Hogwild converges with the same rate
as SGD with step size γ = a
L with
a ≤ min
1
5(1 + 2τ
√
∆)
,
L
µ∆
.
=⇒ theoretical linear speedup.
29/33
Main result for ASAGA
Main result for ASAGA (handwaiving)
Let f be µ-strongly convex and L-smooth and assume (for
simplicity)
√
∆ ≤ µ
L . Then ASAGA converges with the same rate
as SAGA with step size γ = a
L with
a ≤
1
32(1 + τ
√
∆)
.
=⇒ theoretical linear speedup, step size independent of µ.
30/33
Perspectives
• Better scalability ⇐⇒ communication efficiency.
• Tighter analysis with better constants / step-size independent
of ∆.
• Large gap between theory and practice.
• Interplay with generalization and momentum
Thanks for your attention!
31/33
References
Arrow, Kenneth Joseph and Leonid Hurwicz (1958). Decentralization and computation
in resource allocation. Stanford University, Department of Economics.
Bertsekas, Dimitri P. and John N. Tsitsiklis (1989). Parallel and Distributed
Computation: Numerical Methods. Athena Scientific.
Cauchy, Augustin (1847). “M´ethode g´en´erale pour la r´esolution des systemes
d´equations simultan´ees”. In: Comp. Rend. Sci. Paris.
De Sa, Christopher M et al. (2015). “Taming the wild: A unified analysis of
hogwild-style algorithms”. In: Advances in neural information processing systems.
Dean, Jeffrey et al. (2012). “Large scale distributed deep networks”. In: Advances in
neural information processing systems.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast
incremental gradient method with support for non-strongly convex composite
objectives”. In: Advances in Neural Information Processing Systems.
Duchi, John C, Sorathan Chaturapruek, and Christopher R´e (2015). “Asynchronous
stochastic convex optimization”. In: arXiv preprint arXiv:1508.00882.
31/33
Leblond, R´emi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA:
synchronous parallel SAGA”. In: Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics (AISTATS 2017).
— (2018). “Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
Lian, Xiangru et al. (2015). “Asynchronous parallel stochastic gradient for nonconvex
optimization”. In: Advances in Neural Information Processing Systems.
Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic
optimization”. In: SIAM Journal on Optimization.
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic
gradient descent”. In: Advances in Neural Information Processing Systems.
Pedregosa, Fabian, R´emi Leblond, and Simon Lacoste-Julien (2017). “Breaking the
Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In:
Advances in Neural Information Processing Systems 30 (NIPS).
Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent
and its asynchronous variants”. In: Advances in Neural Information Processing
Systems.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”.
In: Ann. Math. Statist.
32/33
Tsitsiklis, John, Dimitri Bertsekas, and Michael Athans (1986). “Distributed
asynchronous deterministic and stochastic gradient optimization algorithms”. In:
IEEE transactions on automatic control.
33/33
Supervised Machine Learning
Data: n observations (ai , bi ) ∈ Rp × R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2 σ(xT
1 a))
Sparse Proximal SAGA
For step size γ = 1
5L and f be gradient L-Lipschitz and µ-strongly
convex (µ > 0), Sparse Proximal SAGA converges geometrically in
expectation. At iteration t we have
E xt − x∗ 2
≤ (1 − 1
5 min{1
n , 1
κ })t C0 ,
with C0 = x0 − x∗ 2 + 1
5L2
n
i=1 α0
i − fi (x∗) 2 and κ = L
µ
(condition number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).
ASAGA algorithm
ProxASAGA algorithm
Atomic vs non-atomic

Mais conteúdo relacionado

Mais procurados

Cryptography Baby Step Giant Step
Cryptography Baby Step Giant StepCryptography Baby Step Giant Step
Cryptography Baby Step Giant StepSAUVIK BISWAS
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論岳華 杜
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsJulyan Arbel
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDValentin De Bortoli
 
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...Kohei Hayashi
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsYoonho Lee
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsJulyan Arbel
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesisValentin De Bortoli
 
Lec09- AI
Lec09- AILec09- AI
Lec09- AIdrmbalu
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
 
RBM from Scratch
RBM from Scratch RBM from Scratch
RBM from Scratch Hadi Sinaee
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodYoonho Lee
 
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML
 

Mais procurados (20)

Cryptography Baby Step Giant Step
Cryptography Baby Step Giant StepCryptography Baby Step Giant Step
Cryptography Baby Step Giant Step
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian Nonparametrics
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGD
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
Deep Learning Opening Workshop - Deep ReLU Networks Viewed as a Statistical M...
 
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
A Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian NonparametricsA Gentle Introduction to Bayesian Nonparametrics
A Gentle Introduction to Bayesian Nonparametrics
 
Macrocanonical models for texture synthesis
Macrocanonical models for texture synthesisMacrocanonical models for texture synthesis
Macrocanonical models for texture synthesis
 
Lec09- AI
Lec09- AILec09- AI
Lec09- AI
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functions
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
RBM from Scratch
RBM from Scratch RBM from Scratch
RBM from Scratch
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
 
Wg qcolorable
Wg qcolorableWg qcolorable
Wg qcolorable
 

Semelhante a Asynchronous Stochastic Optimization, New Analysis and Algorithms

block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Metaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisMetaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisXin-She Yang
 
Reading revue of "Inferring Multiple Graphical Structures"
Reading revue of "Inferring Multiple Graphical Structures"Reading revue of "Inferring Multiple Graphical Structures"
Reading revue of "Inferring Multiple Graphical Structures"tuxette
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber SecurityAltoros
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbolsAxel de Romblay
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrisonComputer Science Club
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep LearningShajun Nisha
 
Cuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionCuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionXin-She Yang
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues listsJames Wong
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
StacksqueueslistsFraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsYoung Alista
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsTony Nguyen
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsHarry Potter
 
Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...butest
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
My PhD defence
My PhD defenceMy PhD defence
My PhD defenceJialin LIU
 

Semelhante a Asynchronous Stochastic Optimization, New Analysis and Algorithms (20)

block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
 
Metaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical AnalysisMetaheuristic Algorithms: A Critical Analysis
Metaheuristic Algorithms: A Critical Analysis
 
Reading revue of "Inferring Multiple Graphical Structures"
Reading revue of "Inferring Multiple Graphical Structures"Reading revue of "Inferring Multiple Graphical Structures"
Reading revue of "Inferring Multiple Graphical Structures"
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber Security
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep Learning
 
Cuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An IntroductionCuckoo Search Algorithm: An Introduction
Cuckoo Search Algorithm: An Introduction
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...Learning for Optimization: EDAs, probabilistic modelling, or ...
Learning for Optimization: EDAs, probabilistic modelling, or ...
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
My PhD defence
My PhD defenceMy PhD defence
My PhD defence
 

Mais de Fabian Pedregosa

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimationFabian Pedregosa
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator SplittingFabian Pedregosa
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you needFabian Pedregosa
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in pythonFabian Pedregosa
 

Mais de Fabian Pedregosa (10)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 

Último

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 

Último (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 

Asynchronous Stochastic Optimization, New Analysis and Algorithms

  • 1. Asynchronous Stochastic Optimization New Analysis and Algorithms Fabian Pedregosa May 25, 2018. University of Washington
  • 2. Where I Come From ML/Optimization/Software Guy Engineer (2010–2012) First contact with ML: develop ML library (scikit-learn). ML and NeuroScience (2012–2015) PhD applying ML to neuroscience. ML and Optimization (2015–) Stochastic / Parallel / Constrained / Hyperparameter optimization. 1/33
  • 3. Outline Goal: Review recent work in asynchronous parallel optimization for machine learning1,2. 1. Asynchronous parallel optimization, Asynchronous SGD. 2. Asynchronous variance-reduced optimization. 3. Analysis of asynchronous methods: What we can prove. 1 R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018). “Improved asynchronous parallel optimization analysis for stochastic incremental methods”. In: to appear in Journal of Machine Learning Research. 2 Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural Information Processing Systems 30 (NIPS). 2/33
  • 5. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. 3/33
  • 6. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • At the same time, the number of cores increases exponentially. 3/33
  • 7. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • At the same time, the number of cores increases exponentially. 3/33
  • 8. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • At the same time, the number of cores increases exponentially. Parallel algorithms needed to take advantage of modern CPUs. 3/33
  • 9. Parallel Optimization: Not a new topic • Most of the principles and methods already in (Bertsekas and Tsitsiklis, 1989). • For linear systems it can be traced even earlier (Arrow and Hurwicz, 1958). 4/33
  • 10. Asynchronous vs Synchronous methods Synchronous methods • Wait for slowest worker. • Limited speedup due to synchronization cost. Asynchronous methods • Workers receive work as needed. • Minimize idle time. • Challenging analysis. t0 t1 t2 Worker 4 Worker 3 Worker 2 Worker 1 idle idle idle idle idle idle t0 t1t2t3 t4 t5t6 t7 t8 Worker 4 Worker 3 Worker 2 Worker 1 Time 5/33
  • 11. Optimization for machine learning Many problems in machine learning can be framed as minimize x∈Rp f (x) def = 1 n n i=1 fi (x) Gradient descent (Cauchy, 1847). Descend along steepest direction x+ = x − γ f (x) Stochastic gradient descent (SGD) (Robbins and Monro, 1951). Select random i, descent along − fi (x): x+ = x − γ fi (x) Figure source: Francis Bach 6/33
  • 12. Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and Athans, 1986) Recent revival due to applications in machine learning, (Niu et al., 2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild. Problem: minimize x f (x) def = 1 n n i=1 fi (x) General Algorithm All workers do in parallel: 1. Read the information in shared memory (ˆx). 2. Sample i ∈ {1, . . . , n} and compute fi (ˆx). 3. Perform SGD update on shared memory x = x − γ fi (ˆx). 7/33
  • 13. Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and Athans, 1986) Recent revival due to applications in machine learning, (Niu et al., 2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild. Problem: minimize x f (x) def = 1 n n i=1 fi (x) General Algorithm All workers do in parallel: 1. Read the information in shared memory (ˆx). 2. Sample i ∈ {1, . . . , n} and compute fi (ˆx). 3. Perform SGD update on shared memory x = x − γ fi (ˆx). x and ˆx might be different. 7/33
  • 14. Asynchronous SGD • Write is performed with old version of coefficients. • Update requires a lock on the vector of coefficients. 8/33
  • 15. Hogwild! (Niu et al., 2011): Lock-free Async. SGD Algorithm 1 Hogwild 1: loop 2: ˆx = inconsistent read of x 3: Sample i uniformly in {1, ..., n} 4: Let Si be fi ’s support 5: [δx]Si := −γ fi (ˆx) 6: for v in Si do 7: [x]v ← [x]v + [δx]v // atomic 8: end for 9: end loop • All read/write operations to shared memory are inconsistent, i.e., no vector-level locks while updating shared memory. • Key assumption. Sparse gradients (|Si | dimension). 9/33
  • 16. Hogwild: when does it converge? Sparse fi . Is this a reasonable assumption? • If fi (x) = ϕ(aT i x) then fi (x) = ai ϕ (aT i x). • Gradients are sparse whenever data ai is sparse. • This is the case for generalized linear models (least squares, logistic regression, linear SVMs, etc.). In this class of models, Hogwild enjoys almost linear speedups. Figure 1: Speedup of Hogwild. Image source: (Niu et al., 2011) 10/33
  • 17. Hogwild is fast Hogwild can be very fast. But its still SGD... • With constant step size, bounces around the optimum. • With decreasing step size, slow convergence. • There are better alternatives 11/33
  • 19. Variance-reduced Stochastic Optimization Problem: Finite sum minimize x∈Rp 1 n n i=1 fi (x) , where n < ∞ 12/33
  • 20. Variance-reduced Stochastic Optimization Problem: Finite sum minimize x∈Rp 1 n n i=1 fi (x) , where n < ∞ The SAGA algorithm (Defazio, Bach, and Lacoste-Julien, 2014) Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as x+ = x − γ ( fi (x) − αi + α) gradient estimate ; α+ i = fi (x) Variance-reduction technique known under different names, e.g., control variates in Monte Carlo methods. 12/33
  • 21. The SAGA Algorithm Theory: Linear (i.e., exponential convergence) on strongly convex problems. Practical algorithm: converges with a fixed step-size 1/(3L). 0 20 40 60 80 100 Time 10 12 10 10 10 8 10 6 10 4 10 2 100 functionsuboptimality SAGA SGD constant step size SGD decreasing step size 13/33
  • 22. The SAGA Algorithm Theory: Linear (i.e., exponential convergence) on strongly convex problems. Practical algorithm: converges with a fixed step-size 1/(3L). 0 20 40 60 80 100 Time 10 12 10 10 10 8 10 6 10 4 10 2 100 functionsuboptimality SAGA SGD constant step size SGD decreasing step size Already used in scikit-learn 13/33
  • 23. Asynchronous SAGA Motivation: Can we design asynchronous version of SAGA? 14/33
  • 24. Asynchronous SAGA Motivation: Can we design asynchronous version of SAGA? SAGA update is inefficient (without tricks) for sparse gradients. x+ = x − γ( fi (x) sparse − αi sparse + α dense! ) ; Need for a sparse variant of SAGA • Many large scale datasets are sparse. • Asynchronous algorithms work best when updates are sparse. 14/33
  • 25. Sparse SAGA We can get away with “sparsifying” the gradient estimate. 3 R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). 15/33
  • 26. Sparse SAGA We can get away with “sparsifying” the gradient estimate. • Let Pi be the projection onto support( fi ) • Let Di = Pi /(1 n n i=1 Pi ) • Crucial property: Ei [Di ] = I 3 R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). 15/33
  • 27. Sparse SAGA We can get away with “sparsifying” the gradient estimate. • Let Pi be the projection onto support( fi ) • Let Di = Pi /(1 n n i=1 Pi ) • Crucial property: Ei [Di ] = I Sparse SAGA algorithm3 Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as x+ = x − γ( fi (x) − αi + Di α) ; α+ i = fi (x) 3 R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). 15/33
  • 28. Sparse SAGA • All operations are sparse, cost per iteration is O(—nonzeros in fi —). • Same convergence properties than SAGA, but with cheaper iterations in presence of sparsity. 16/33
  • 29. Sparse SAGA • All operations are sparse, cost per iteration is O(—nonzeros in fi —). • Same convergence properties than SAGA, but with cheaper iterations in presence of sparsity. 16/33
  • 30. Proximal Sparse SAGA Problem: Composite finite sum minimize x∈Rp 1 n n i=1 fi (x) + g(x) , where • g is potentially nonsmooth (think λ · 1 or indicator) but we have access to proxγg (x) = arg minz g(z) + 1 2 x − z 2. • For some g, its proximal operator is available in closed form. Examples: 1 norm (soft thresholding), indicator function (projection). 17/33
  • 31. Sparse Proximal SAGA We can extend Sparse SAGA to incorporate one proximal term. • Assume g separable: g(x) = p j=1 gj (xj ) • Let ϕi = d j (Di )j,j gj (xj ) • Crucial property: Ei [Di ] = I, Ei [ϕi ] = h 4 Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural Information Processing Systems 30 (NIPS). 18/33
  • 32. Sparse Proximal SAGA We can extend Sparse SAGA to incorporate one proximal term. • Assume g separable: g(x) = p j=1 gj (xj ) • Let ϕi = d j (Di )j,j gj (xj ) • Crucial property: Ei [Di ] = I, Ei [ϕi ] = h Sparse SAGA algorithm4 Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as x+ = proxγϕi (x − γ( fi (x) − αi + Di α)) ; α+ i = fi (x) 4 Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural Information Processing Systems 30 (NIPS). 18/33
  • 33. Sparse Proximal SAGA As SAGA, linear convergence under strong convexity. Theorem For step size γ = 1 5L and f L-smooth and µ-strongly convex (µ > 0), at iteration t we have E xt − x∗ 2 ≤ (1 − 1 5 min{1 n , µ L })t C0 , with C0 = x0 − x∗ 2 + 1 5L2 n i=1 α0 i − fi (x∗) 2. Implications • Same convergence than SAGA with cheaper updates in presence of sparsity. • Adaptivity to strong convexity, i.e., no need to know strong convexity parameter to obtain linear convergence. 19/33
  • 34. Asynchronous Proximal SAGA ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien, 2017) 1. Read the information in shared memory (ˆx, ˆα, ˆα). 2. Sample i and compute fi (ˆx). 3. Perform Sparse SAGA update on shared memory x = proxγϕi (x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx) 20/33
  • 35. Asynchronous Proximal SAGA ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien, 2017) 1. Read the information in shared memory (ˆx, ˆα, ˆα). 2. Sample i and compute fi (ˆx). 3. Perform Sparse SAGA update on shared memory x = proxγϕi (x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx) • As Hogwild!, inconsistent read and writes. • Same convergence rate than sequential version under sparsity of the gradients (delays ≤ 1 10 √ sparsity .) 20/33
  • 36. Empirical Results ProxASAGA vs competing methods on 3 large-scale datasets, 1-regularized logistic regression Dataset n p density L ∆ KDD 2010 19,264,097 1,163,024 10−6 28.12 0.15 KDD 2012 149,639,105 54,686,452 2 × 10−7 1.25 0.85 Criteo 45,840,617 1,000,000 4 × 10−5 1.25 0.89 0 20 40 60 80 100 Time (in minutes) 10 12 10 9 10 6 10 3 100 Objectiveminusoptimum KDD10 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 KDD12 dataset 0 10 20 30 40 Time (in minutes) 10 12 10 9 10 6 10 3 100 Criteo dataset ProxASAGA (1 core) ProxASAGA (10 cores) AsySPCD (1 core) AsySPCD (10 cores) FISTA (1 core) FISTA (10 cores) 21/33
  • 37. Empirical Results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA 22/33
  • 38. Empirical Results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. 22/33
  • 39. Empirical Results - Speedup Speedup = Time to 10−10 suboptimality on one core Time to same suboptimality on k cores 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Timespeedup KDD10 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 KDD12 dataset 2 4 6 8 10 12 14 16 18 20 Number of cores 2 4 6 8 10 12 14 16 18 20 Criteo dataset Ideal ProxASAGA AsySPCD FISTA • ProxASAGA achieves speedups between 6x and 12x on a 20 cores architecture. • As predicted by theory, there is a high correlation between degree of sparsity and speedup. 22/33
  • 40. 3. Analysis or The Art of Naming
  • 41. Analysis Active Research Topic • Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011) • Stochastic Approximation (Duchi, Chaturapruek, and R´e, 2015) • Nonconvex losses (De Sa et al., 2015; Lian et al., 2015) • Variance-reduced stochastic methods (Reddi et al., 2015) 23/33
  • 42. Analysis Active Research Topic • Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011) • Stochastic Approximation (Duchi, Chaturapruek, and R´e, 2015) • Nonconvex losses (De Sa et al., 2015; Lian et al., 2015) • Variance-reduced stochastic methods (Reddi et al., 2015) Claim #1 There are fundamental flaws in these analysis. 23/33
  • 43. Analysis Analysis of optimization algorithms requires to prove progress from one iterate to the next. How to define an iterate? 24/33
  • 44. Analysis Analysis of optimization algorithms requires to prove progress from one iterate to the next. How to define an iterate? Asynchronous SGD All workers do in parallel: 1. Read the information in shared memory (ˆx). 2. Sample i and compute fi (ˆx). 3. Perform SGD update on shared memory x = x − γ fi (ˆx). 24/33
  • 45. Naming Scheme and Unbiasedness Assumption “After Write” Labeling (Niu et al., 2011) Each time a worker has finished writing to shared memory, increment iteration counter. ⇐⇒ ˆxt = (t + 1)-th successful update to shared memory. 25/33
  • 46. Naming Scheme and Unbiasedness Assumption “After Write” Labeling (Niu et al., 2011) Each time a worker has finished writing to shared memory, increment iteration counter. ⇐⇒ ˆxt = (t + 1)-th successful update to shared memory. Unbiasedness Assumption Asynchronous SGD-like algorithms crucially rely on the unbiased property Ei [ fi (x)] = f (x) . 25/33
  • 47. Naming Scheme and Unbiasedness Assumption “After Write” Labeling (Niu et al., 2011) Each time a worker has finished writing to shared memory, increment iteration counter. ⇐⇒ ˆxt = (t + 1)-th successful update to shared memory. Unbiasedness Assumption Asynchronous SGD-like algorithms crucially rely on the unbiased property Ei [ fi (x)] = f (x) . Issue The naming scheme and unbiased assumption are incompatible. 25/33
  • 48. A Problematic Example Problem: minimizex 1 2(f1(x) + f2(x)) with 2 workers. Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]? 26/33
  • 49. A Problematic Example Problem: minimizex 1 2(f1(x) + f2(x)) with 2 workers. Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]? f1 f2 worker 1 × worker 2 × f1(ˆx0) 26/33
  • 50. A Problematic Example Problem: minimizex 1 2(f1(x) + f2(x)) with 2 workers. Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]? f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) 26/33
  • 51. A Problematic Example Problem: minimizex 1 2(f1(x) + f2(x)) with 2 workers. Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]? f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) 26/33
  • 52. A Problematic Example Problem: minimizex 1 2(f1(x) + f2(x)) with 2 workers. Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]? f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f2(ˆx0) 26/33
  • 53. A Problematic Example Problem: minimizex 1 2(f1(x) + f2(x)) with 2 workers. Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]? f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f2(ˆx0) In all, Ei0 [ fi0 (ˆx0)] = 3 4 f1(ˆx0) + 1 4 f2(ˆx0) = f (ˆx0) 26/33
  • 54. A Problematic Example Problem: minimizex 1 2(f1(x) + f2(x)) with 2 workers. Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]? f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f1(ˆx0) f1 f2 worker 1 × worker 2 × f2(ˆx0) In all, Ei0 [ fi0 (ˆx0)] = 3 4 f1(ˆx0) + 1 4 f2(ˆx0) = f (ˆx0) 26/33 • This scheme does not satisfy the crucial unbiasedness condition. • Can we fix it?
  • 55. A New Labeling Scheme After read labeling scheme Each time a worker has finished reading from shared memory, increment iteration counter. ⇐⇒ ˆxt = (t + 1)-th successful read from shared memory. 5 R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018). “Improved asynchronous parallel optimization analysis for stochastic incremental methods”. In: to appear in Journal of Machine Learning Research. 27/33
  • 56. A New Labeling Scheme After read labeling scheme Each time a worker has finished reading from shared memory, increment iteration counter. ⇐⇒ ˆxt = (t + 1)-th successful read from shared memory. No dependency between it and the cost of computing fit . Full analysis of Hogwild, Asynchronous SVRG and Asynchronous SAGA in5. 5 R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018). “Improved asynchronous parallel optimization analysis for stochastic incremental methods”. In: to appear in Journal of Machine Learning Research. 27/33
  • 57. Convergence results – preliminaries Some notation. • ∆ = maxj∈1,...,d |{j : j ∈ supp( fi )}|/n. We always have 1/n ≤ ∆ ≤ 1. • τ = Number of updates between the time that the vector of coefficients is read to memory and the time the update is finished. 28/33
  • 58. A rigorous analysis of Hogwild (Niu et al., 2011) • Inconsistent reads. • Unlike (Niu et al., 2011), allow for inconsistent writes. • Unlike (Niu et al., 2011; Mania et al., 2017), no global bound on gradient. Main result for Hogwild (handwaiving) Let f be µ-strongly convex and L-smooth and assume (for simplicity) √ ∆ ≤ µ L . Then Hogwild converges with the same rate as SGD with step size γ = a L with a ≤ min 1 5(1 + 2τ √ ∆) , L µ∆ . =⇒ theoretical linear speedup. 29/33
  • 59. Main result for ASAGA Main result for ASAGA (handwaiving) Let f be µ-strongly convex and L-smooth and assume (for simplicity) √ ∆ ≤ µ L . Then ASAGA converges with the same rate as SAGA with step size γ = a L with a ≤ 1 32(1 + τ √ ∆) . =⇒ theoretical linear speedup, step size independent of µ. 30/33
  • 60. Perspectives • Better scalability ⇐⇒ communication efficiency. • Tighter analysis with better constants / step-size independent of ∆. • Large gap between theory and practice. • Interplay with generalization and momentum Thanks for your attention! 31/33
  • 61. References Arrow, Kenneth Joseph and Leonid Hurwicz (1958). Decentralization and computation in resource allocation. Stanford University, Department of Economics. Bertsekas, Dimitri P. and John N. Tsitsiklis (1989). Parallel and Distributed Computation: Numerical Methods. Athena Scientific. Cauchy, Augustin (1847). “M´ethode g´en´erale pour la r´esolution des systemes d´equations simultan´ees”. In: Comp. Rend. Sci. Paris. De Sa, Christopher M et al. (2015). “Taming the wild: A unified analysis of hogwild-style algorithms”. In: Advances in neural information processing systems. Dean, Jeffrey et al. (2012). “Large scale distributed deep networks”. In: Advances in neural information processing systems. Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives”. In: Advances in Neural Information Processing Systems. Duchi, John C, Sorathan Chaturapruek, and Christopher R´e (2015). “Asynchronous stochastic convex optimization”. In: arXiv preprint arXiv:1508.00882. 31/33
  • 62. Leblond, R´emi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). — (2018). “Improved asynchronous parallel optimization analysis for stochastic incremental methods”. In: to appear in Journal of Machine Learning Research. Lian, Xiangru et al. (2015). “Asynchronous parallel stochastic gradient for nonconvex optimization”. In: Advances in Neural Information Processing Systems. Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic optimization”. In: SIAM Journal on Optimization. Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”. In: Advances in Neural Information Processing Systems. Pedregosa, Fabian, R´emi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural Information Processing Systems 30 (NIPS). Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent and its asynchronous variants”. In: Advances in Neural Information Processing Systems. Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”. In: Ann. Math. Statist. 32/33
  • 63. Tsitsiklis, John, Dimitri Bertsekas, and Michael Athans (1986). “Distributed asynchronous deterministic and stochastic gradient optimization algorithms”. In: IEEE transactions on automatic control. 33/33
  • 64. Supervised Machine Learning Data: n observations (ai , bi ) ∈ Rp × R Prediction function: h(a, x) ∈ R Motivating examples: • Linear prediction: h(a, x) = xT a • Neural networks: h(a, x) = xT mσ(xm−1σ(· · · xT 2 σ(xT 1 a))
  • 65. Sparse Proximal SAGA For step size γ = 1 5L and f be gradient L-Lipschitz and µ-strongly convex (µ > 0), Sparse Proximal SAGA converges geometrically in expectation. At iteration t we have E xt − x∗ 2 ≤ (1 − 1 5 min{1 n , 1 κ })t C0 , with C0 = x0 − x∗ 2 + 1 5L2 n i=1 α0 i − fi (x∗) 2 and κ = L µ (condition number). Implications • Same convergence rate than SAGA with cheaper updates. • In the “big data regime” (n ≥ κ): rate in O(1/n). • In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).