This document provides an overview of asynchronous stochastic optimization methods and algorithms. It discusses asynchronous parallel stochastic gradient descent (SGD) and how it can minimize idle time. It also introduces asynchronous variance-reduced optimization methods like asynchronous SAGA that provide faster convergence than SGD. The document analyzes the convergence properties of asynchronous optimization methods and presents empirical results demonstrating the speedups achieved by asynchronous proximal SAGA (ProxASAGA) on large datasets.
2. Where I Come From
ML/Optimization/Software Guy
Engineer (2010–2012)
First contact with ML: develop ML
library (scikit-learn).
ML and NeuroScience (2012–2015)
PhD applying ML to neuroscience.
ML and Optimization (2015–)
Stochastic / Parallel / Constrained /
Hyperparameter optimization.
1/33
3. Outline
Goal: Review recent work in asynchronous parallel optimization for
machine learning1,2.
1. Asynchronous parallel optimization, Asynchronous SGD.
2. Asynchronous variance-reduced optimization.
3. Analysis of asynchronous methods: What we can prove.
1
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
2
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
2/33
5. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
3/33
6. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
3/33
7. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
3/33
8. 40 years of CPU trends
• Speed of CPUs has stagnated since 2005.
• At the same time, the number of cores increases exponentially.
Parallel algorithms needed to take advantage of modern
CPUs. 3/33
9. Parallel Optimization: Not a new topic
• Most of the principles and
methods already in
(Bertsekas and Tsitsiklis,
1989).
• For linear systems it can be
traced even earlier (Arrow
and Hurwicz, 1958).
4/33
10. Asynchronous vs Synchronous methods
Synchronous methods
• Wait for slowest worker.
• Limited speedup due to
synchronization cost.
Asynchronous methods
• Workers receive work as
needed.
• Minimize idle time.
• Challenging analysis.
t0 t1 t2
Worker 4
Worker 3
Worker 2
Worker 1
idle
idle
idle
idle
idle
idle
t0 t1t2t3 t4 t5t6 t7 t8
Worker 4
Worker 3
Worker 2
Worker 1
Time
5/33
11. Optimization for machine learning
Many problems in machine learning can be framed as
minimize
x∈Rp
f (x)
def
=
1
n
n
i=1
fi (x)
Gradient descent (Cauchy, 1847).
Descend along steepest direction
x+
= x − γ f (x)
Stochastic gradient descent (SGD)
(Robbins and Monro, 1951). Select
random i, descent along − fi (x):
x+
= x − γ fi (x) Figure source: Francis Bach
6/33
12. Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and
Athans, 1986)
Recent revival due to applications in machine learning, (Niu et al.,
2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild.
Problem: minimize
x
f (x)
def
=
1
n
n
i=1
fi (x)
General Algorithm
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i ∈ {1, . . . , n} and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
7/33
13. Example: Asynchronous SGD (Tsitsiklis, Bertsekas, and
Athans, 1986)
Recent revival due to applications in machine learning, (Niu et al.,
2011; Dean et al., 2012). Other names: Downpour SGD, Hogwild.
Problem: minimize
x
f (x)
def
=
1
n
n
i=1
fi (x)
General Algorithm
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i ∈ {1, . . . , n} and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
x and ˆx might be different. 7/33
14. Asynchronous SGD
• Write is performed with old version of coefficients.
• Update requires a lock on the vector of coefficients.
8/33
15. Hogwild! (Niu et al., 2011): Lock-free Async. SGD
Algorithm 1 Hogwild
1: loop
2: ˆx = inconsistent read of x
3: Sample i uniformly in {1, ..., n}
4: Let Si be fi ’s support
5: [δx]Si := −γ fi (ˆx)
6: for v in Si do
7: [x]v ← [x]v + [δx]v // atomic
8: end for
9: end loop
• All read/write operations to shared memory are
inconsistent, i.e., no vector-level locks while updating shared
memory.
• Key assumption. Sparse gradients (|Si | dimension).
9/33
16. Hogwild: when does it converge?
Sparse fi . Is this a reasonable assumption?
• If fi (x) = ϕ(aT
i x) then fi (x) = ai ϕ (aT
i x).
• Gradients are sparse whenever data ai is sparse.
• This is the case for generalized linear models (least squares,
logistic regression, linear SVMs, etc.).
In this class of models, Hogwild enjoys almost linear speedups.
Figure 1: Speedup of Hogwild. Image source: (Niu et al., 2011)
10/33
17. Hogwild is fast
Hogwild can be very fast. But its still SGD...
• With constant step size, bounces around the optimum.
• With decreasing step size, slow convergence.
• There are better alternatives 11/33
20. Variance-reduced Stochastic Optimization
Problem: Finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) , where n < ∞
The SAGA algorithm (Defazio, Bach, and Lacoste-Julien,
2014)
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= x − γ ( fi (x) − αi + α)
gradient estimate
; α+
i = fi (x)
Variance-reduction technique known under different names, e.g.,
control variates in Monte Carlo methods. 12/33
21. The SAGA Algorithm
Theory: Linear (i.e., exponential
convergence) on strongly convex
problems.
Practical algorithm: converges
with a fixed step-size 1/(3L).
0 20 40 60 80 100
Time
10 12
10 10
10 8
10 6
10 4
10 2
100
functionsuboptimality
SAGA
SGD constant step size
SGD decreasing step size
13/33
22. The SAGA Algorithm
Theory: Linear (i.e., exponential
convergence) on strongly convex
problems.
Practical algorithm: converges
with a fixed step-size 1/(3L).
0 20 40 60 80 100
Time
10 12
10 10
10 8
10 6
10 4
10 2
100
functionsuboptimality
SAGA
SGD constant step size
SGD decreasing step size
Already used in scikit-learn
13/33
24. Asynchronous SAGA
Motivation: Can we design asynchronous version of SAGA?
SAGA update is inefficient (without tricks) for sparse gradients.
x+
= x − γ( fi (x)
sparse
− αi
sparse
+ α
dense!
) ;
Need for a sparse variant of SAGA
• Many large scale datasets are sparse.
• Asynchronous algorithms work best when updates are sparse.
14/33
25. Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
26. Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
• Let Pi be the projection onto support( fi )
• Let Di = Pi /(1
n
n
i=1 Pi )
• Crucial property: Ei [Di ] = I
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
27. Sparse SAGA
We can get away with “sparsifying” the gradient estimate.
• Let Pi be the projection onto support( fi )
• Let Di = Pi /(1
n
n
i=1 Pi )
• Crucial property: Ei [Di ] = I
Sparse SAGA algorithm3
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= x − γ( fi (x) − αi + Di α) ; α+
i = fi (x)
3
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2017).
“ASAGA: synchronous parallel SAGA”. In: Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics (AISTATS
2017).
15/33
28. Sparse SAGA
• All operations are sparse, cost per iteration is
O(—nonzeros in fi —).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
16/33
29. Sparse SAGA
• All operations are sparse, cost per iteration is
O(—nonzeros in fi —).
• Same convergence properties than SAGA, but with cheaper
iterations in presence of sparsity.
16/33
30. Proximal Sparse SAGA
Problem: Composite finite sum
minimize
x∈Rp
1
n
n
i=1
fi (x) + g(x) , where
• g is potentially nonsmooth (think λ · 1 or indicator) but we
have access to proxγg (x) = arg minz g(z) + 1
2 x − z 2.
• For some g, its proximal operator is available in closed form.
Examples: 1 norm (soft thresholding), indicator function
(projection).
17/33
31. Sparse Proximal SAGA
We can extend Sparse SAGA to incorporate one proximal term.
• Assume g separable: g(x) = p
j=1 gj (xj )
• Let ϕi = d
j (Di )j,j gj (xj )
• Crucial property: Ei [Di ] = I, Ei [ϕi ] = h
4
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
18/33
32. Sparse Proximal SAGA
We can extend Sparse SAGA to incorporate one proximal term.
• Assume g separable: g(x) = p
j=1 gj (xj )
• Let ϕi = d
j (Di )j,j gj (xj )
• Crucial property: Ei [Di ] = I, Ei [ϕi ] = h
Sparse SAGA algorithm4
Sample uniformly i ∈ {1, . . . , n} and compute (x+, α+) as
x+
= proxγϕi
(x − γ( fi (x) − αi + Di α)) ; α+
i = fi (x)
4
Fabian Pedregosa, R´emi Leblond, and Simon Lacoste-Julien (2017).
“Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite
Optimization”. In: Advances in Neural Information Processing Systems 30
(NIPS).
18/33
33. Sparse Proximal SAGA
As SAGA, linear convergence under strong convexity.
Theorem
For step size γ = 1
5L and f L-smooth and µ-strongly convex
(µ > 0), at iteration t we have
E xt − x∗ 2
≤ (1 − 1
5 min{1
n , µ
L })t C0 ,
with C0 = x0 − x∗ 2 + 1
5L2
n
i=1 α0
i − fi (x∗) 2.
Implications
• Same convergence than SAGA with cheaper updates in
presence of sparsity.
• Adaptivity to strong convexity, i.e., no need to know strong
convexity parameter to obtain linear convergence.
19/33
34. Asynchronous Proximal SAGA
ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien,
2017)
1. Read the information in shared memory (ˆx, ˆα, ˆα).
2. Sample i and compute fi (ˆx).
3. Perform Sparse SAGA update on shared memory
x = proxγϕi
(x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx)
20/33
35. Asynchronous Proximal SAGA
ProxASAGA (Pedregosa, Leblond, and Lacoste-Julien,
2017)
1. Read the information in shared memory (ˆx, ˆα, ˆα).
2. Sample i and compute fi (ˆx).
3. Perform Sparse SAGA update on shared memory
x = proxγϕi
(x − γ( fi (ˆx) − ˆαi + Di ˆα)) ; αi = fi (ˆx)
• As Hogwild!, inconsistent read and writes.
• Same convergence rate than sequential version under sparsity
of the gradients (delays ≤ 1
10
√
sparsity
.)
20/33
37. Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
22/33
38. Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20
cores architecture.
22/33
39. Empirical Results - Speedup
Speedup =
Time to 10−10 suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20
cores architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup. 22/33
41. Analysis
Active Research Topic
• Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011)
• Stochastic Approximation (Duchi, Chaturapruek, and R´e,
2015)
• Nonconvex losses (De Sa et al., 2015; Lian et al., 2015)
• Variance-reduced stochastic methods (Reddi et al., 2015)
23/33
42. Analysis
Active Research Topic
• Lock-free Asynchronous SGD: Hogwild! (Niu et al., 2011)
• Stochastic Approximation (Duchi, Chaturapruek, and R´e,
2015)
• Nonconvex losses (De Sa et al., 2015; Lian et al., 2015)
• Variance-reduced stochastic methods (Reddi et al., 2015)
Claim #1
There are fundamental flaws in these analysis.
23/33
43. Analysis
Analysis of optimization algorithms requires to prove progress from
one iterate to the next.
How to define an iterate?
24/33
44. Analysis
Analysis of optimization algorithms requires to prove progress from
one iterate to the next.
How to define an iterate?
Asynchronous SGD
All workers do in parallel:
1. Read the information in shared memory (ˆx).
2. Sample i and compute fi (ˆx).
3. Perform SGD update on shared memory x = x − γ fi (ˆx).
24/33
45. Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
25/33
46. Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
Unbiasedness Assumption
Asynchronous SGD-like algorithms crucially rely on the unbiased
property
Ei [ fi (x)] = f (x) .
25/33
47. Naming Scheme and Unbiasedness Assumption
“After Write” Labeling (Niu et al., 2011)
Each time a worker has finished writing to shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful update to shared memory.
Unbiasedness Assumption
Asynchronous SGD-like algorithms crucially rely on the unbiased
property
Ei [ fi (x)] = f (x) .
Issue
The naming scheme and unbiased assumption are incompatible.
25/33
48. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
26/33
49. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
50. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
51. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
26/33
52. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
26/33
53. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
In all, Ei0 [ fi0 (ˆx0)] =
3
4
f1(ˆx0) +
1
4
f2(ˆx0) = f (ˆx0)
26/33
54. A Problematic Example
Problem: minimizex
1
2(f1(x) + f2(x)) with 2 workers.
Suppose f1 takes less than f2. What is Ei0 [ fi0 (ˆx0)]?
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f1(ˆx0)
f1 f2
worker 1 ×
worker 2 ×
f2(ˆx0)
In all, Ei0 [ fi0 (ˆx0)] =
3
4
f1(ˆx0) +
1
4
f2(ˆx0) = f (ˆx0)
26/33
• This scheme does not satisfy the
crucial unbiasedness condition.
• Can we fix it?
55. A New Labeling Scheme
After read labeling scheme
Each time a worker has finished reading from shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful read from shared memory.
5
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
27/33
56. A New Labeling Scheme
After read labeling scheme
Each time a worker has finished reading from shared memory,
increment iteration counter.
⇐⇒ ˆxt = (t + 1)-th successful read from shared memory.
No dependency between it and the cost of computing fit .
Full analysis of Hogwild, Asynchronous SVRG and
Asynchronous SAGA in5.
5
R´emi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien (2018).
“Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
27/33
57. Convergence results – preliminaries
Some notation.
• ∆ = maxj∈1,...,d |{j : j ∈ supp( fi )}|/n. We always have
1/n ≤ ∆ ≤ 1.
• τ = Number of updates between the time that the vector of
coefficients is read to memory and the time the update is
finished.
28/33
58. A rigorous analysis of Hogwild (Niu et al., 2011)
• Inconsistent reads.
• Unlike (Niu et al., 2011), allow for inconsistent writes.
• Unlike (Niu et al., 2011; Mania et al., 2017), no global bound
on gradient.
Main result for Hogwild (handwaiving)
Let f be µ-strongly convex and L-smooth and assume (for
simplicity)
√
∆ ≤ µ
L . Then Hogwild converges with the same rate
as SGD with step size γ = a
L with
a ≤ min
1
5(1 + 2τ
√
∆)
,
L
µ∆
.
=⇒ theoretical linear speedup.
29/33
59. Main result for ASAGA
Main result for ASAGA (handwaiving)
Let f be µ-strongly convex and L-smooth and assume (for
simplicity)
√
∆ ≤ µ
L . Then ASAGA converges with the same rate
as SAGA with step size γ = a
L with
a ≤
1
32(1 + τ
√
∆)
.
=⇒ theoretical linear speedup, step size independent of µ.
30/33
60. Perspectives
• Better scalability ⇐⇒ communication efficiency.
• Tighter analysis with better constants / step-size independent
of ∆.
• Large gap between theory and practice.
• Interplay with generalization and momentum
Thanks for your attention!
31/33
61. References
Arrow, Kenneth Joseph and Leonid Hurwicz (1958). Decentralization and computation
in resource allocation. Stanford University, Department of Economics.
Bertsekas, Dimitri P. and John N. Tsitsiklis (1989). Parallel and Distributed
Computation: Numerical Methods. Athena Scientific.
Cauchy, Augustin (1847). “M´ethode g´en´erale pour la r´esolution des systemes
d´equations simultan´ees”. In: Comp. Rend. Sci. Paris.
De Sa, Christopher M et al. (2015). “Taming the wild: A unified analysis of
hogwild-style algorithms”. In: Advances in neural information processing systems.
Dean, Jeffrey et al. (2012). “Large scale distributed deep networks”. In: Advances in
neural information processing systems.
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast
incremental gradient method with support for non-strongly convex composite
objectives”. In: Advances in Neural Information Processing Systems.
Duchi, John C, Sorathan Chaturapruek, and Christopher R´e (2015). “Asynchronous
stochastic convex optimization”. In: arXiv preprint arXiv:1508.00882.
31/33
62. Leblond, R´emi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA:
synchronous parallel SAGA”. In: Proceedings of the 20th International Conference
on Artificial Intelligence and Statistics (AISTATS 2017).
— (2018). “Improved asynchronous parallel optimization analysis for stochastic
incremental methods”. In: to appear in Journal of Machine Learning Research.
Lian, Xiangru et al. (2015). “Asynchronous parallel stochastic gradient for nonconvex
optimization”. In: Advances in Neural Information Processing Systems.
Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic
optimization”. In: SIAM Journal on Optimization.
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic
gradient descent”. In: Advances in Neural Information Processing Systems.
Pedregosa, Fabian, R´emi Leblond, and Simon Lacoste-Julien (2017). “Breaking the
Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization”. In:
Advances in Neural Information Processing Systems 30 (NIPS).
Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent
and its asynchronous variants”. In: Advances in Neural Information Processing
Systems.
Robbins, Herbert and Sutton Monro (1951). “A Stochastic Approximation Method”.
In: Ann. Math. Statist.
32/33
63. Tsitsiklis, John, Dimitri Bertsekas, and Michael Athans (1986). “Distributed
asynchronous deterministic and stochastic gradient optimization algorithms”. In:
IEEE transactions on automatic control.
33/33
64. Supervised Machine Learning
Data: n observations (ai , bi ) ∈ Rp × R
Prediction function: h(a, x) ∈ R
Motivating examples:
• Linear prediction: h(a, x) = xT a
• Neural networks: h(a, x) = xT
mσ(xm−1σ(· · · xT
2 σ(xT
1 a))
65. Sparse Proximal SAGA
For step size γ = 1
5L and f be gradient L-Lipschitz and µ-strongly
convex (µ > 0), Sparse Proximal SAGA converges geometrically in
expectation. At iteration t we have
E xt − x∗ 2
≤ (1 − 1
5 min{1
n , 1
κ })t C0 ,
with C0 = x0 − x∗ 2 + 1
5L2
n
i=1 α0
i − fi (x∗) 2 and κ = L
µ
(condition number).
Implications
• Same convergence rate than SAGA with cheaper updates.
• In the “big data regime” (n ≥ κ): rate in O(1/n).
• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).