2. 読む論文
• Scaling Up Coordinate Descent Algorithms for
Large L1 regularization Problems
– by C. Scherrer, M. Halappanavar, A. Tewari, D.
Haglin
• Coordinate Descent の並列計算
– [Bradley+ 11] Parallel Coordinate Descent for L1-
Regularized Loss Minimization (ICML2011) とか
2
9. Step 1: Select
• Selecting 𝐽 coordinates
• The selection criteria differs for variations of CD
techniques
– cyclic CD (CCD)
– stochastic CD (SCD)
• selection of a singlton
– fully greedy CD
• 𝐽 = {1, … , 𝑘}
– Shotgun [Bradley+ 11]
• selects a random subset of a given size
9
10. Step 2: Propose
• Propose step computes a proposed increment 𝛿 𝑗 for
each 𝑗 ∈ 𝐽.
– this step does not actually change the weights
• In Step 2, we maintain a vector 𝝋 ∈ ℝ 𝑘 , where 𝝋 𝑗 is a
proxy for the objective function evaluated at 𝒘 + 𝜹 𝑗 𝒆 𝑗
– update 𝝋 𝑗 whenever a new proposal is calculated for j
– 𝝋 is not necessary if the algorithm will accepts all
proposals
10
11. Step 3: Accept
• In Accept step, the algorithm accepts 𝐽′ ⊆ 𝐽
– [Bradley+ 11] show correlations among features can
lead to divergence if too many coordinates are updated at
once (see below figure)
• In CCD, SCD, Shotgun, the algorithm allows all
proposals to be accepted
– No need to calculate 𝝋
11
12. Step 4: Update
• In Update step, the algorithm updates
according to the set 𝐽′
𝑿𝒘 を保持
12
13. Approximate Minimization (1/2)
• Propose step calculates a proposed increment
𝜹 𝑗 for each 𝑗 ∈ 𝐽
𝛿 = argmin 𝛿 𝐹 𝒘 + 𝛿𝒆 𝑗 + 𝜆|𝒘 𝑗 + 𝛿|
1 𝑛
where, 𝐹 𝒘 = 𝑖=1 ℓ 𝒚 𝑖 , 𝑿𝒘 𝑖
𝑛
• For a general loss function, there is no
closed-form solution along a given coordinate.
– Thus, consider approximate minimization
13
14. Approximate Minimization (2/2)
• Well known minimizer (e.g., [Yuan and Lin 10])
𝛻𝑗 𝐹 𝒘 − 𝜆 𝛻𝑗 𝐹 𝒘 + 𝜆
𝛿 = −𝜓 𝒘𝑗; ,
𝛽 𝛽
𝑎 if 𝑥 < 𝑎
where, 𝜓 𝑥; 𝑎, 𝑏 = 𝑏 if 𝑥 > 𝑏
𝑥 otherwise
for squared loss 𝛽 = 1, logistic loss 𝛽 = 1/4.
14
17. Algorithms (conventional)
• SHOTGUN [Bradley+ 11]
– Select step: random subset of the columns
– Accept step: accepts every proposal
• No need to compute a proxy for the objective
– convergence is guaranteed only if the # of coordinates selected
is at most 𝑃 ∗ = 𝑘 (*1)
2𝜌
• GREEDY
– Select step: all coordinates
– Propose step: each thread generating proposals for some subset
of the coordinates using approximation
– Accept step: Only accepts the single best among the all threads.
(*1) 𝜌 is the matrix eigenvalue of 𝑿 𝑇 𝑿 17
19. Algorithms (proposed)
• THREAD-GREEDY
– Select step: random set of coordinates (?)
– Propose step: each thread generating proposals for some subset of the
coordinates using approximation
– Accept step: Each thread accepts the best of the proposals
– no proof for convergence (however, empirical results are encouraging)
• COLORING
– Preprocessing: structurally independent features are identified via
partial distance-2 coloring
– Select step: a random color is selected
– Accept step: accepts every proposal
• since the features are disjoint.
19
20. Implementation and Platform
• Implementation
– gcc with OpenMP
• -O3 -fopenmp flags
• parallel for pragma
• static scheduling
– Given n iterations and p threads, each thread gets n/p iterations
• Platform
– AMD Opteron (Magny-Cours)
• with 48 cores (12 cores x 4 sockets)
– 256GB Memory
20
24. Summary
• Presented GenCD, a generic framework for
expressing parallel coordinate descent
– Select, Propose, Accept, Upadte
• Performs convergence and scalability tests for the
four algorithms
– but the authors do not favor any of these algorithms
over the others
• The condition for convergence of the THREAD-
GREEDY algorithm is an open question
24
25. References
• [Yuan and Lin 10] G. Yuan, C. Lin, “A Comparison of Opitmization Methods
and Software for Large-scale L1-regularized Linear Classification”, Journal
of Machine Learning Research, vol.11, pp.3183-3234, 2010.
• [Bradley+ 11] J. K. Bradley, A. Kyrola, D. Bickson, C. Guestrin, “Parallel
Coordinate Descent for L1-Regularized Loss Minimization”, In Proc. ICML
‘11, 2011.
25