3. Introduction
q Problems to be addressed in high dimensional data:
1) Predictive performance
2) Interpretability
3) Highly correlated variables
Sparsity assumption: # of nonzero coeffs 𝛽6
K
𝑠 and/or interaction 𝜃6:
K
𝑠 is very
few.
5. Introduction
q Shrinkage methods based on regularization
𝛽M = 𝑎𝑟𝑔𝑚𝑖𝑛R
𝑙 𝛽 + 𝜆 U
|𝛽|$,
𝐿 𝑎𝑠𝑠𝑜
||𝛽||&
&
,
𝑅 𝑖𝑑𝑔𝑒
Where 𝑙 𝛽 is the loss function wr.t. 𝛽, e.g., square, logistic, hinge
losses
1) Ridge: prevent overfitting but not variable selection
2) Lasso: variable selection but only select one for each group of
correlated variables.
6. Group selection
q Group Lasso (Yuan et al., 2006)
Coefficients are organized into K groups (known in advance):
𝑔$, 𝑔&, …, 𝑔
⊆ 1,2,… , 𝑝 , disjoint and then the Group-Lasso pelnaty:
𝜆 ∑ 𝑑:||𝛽_`
||&: ,
where ||𝛽_`
||& = ∑ 𝛽,
&
,∈_`
q Properties:
1) Group-size = 1 -> LASSO
2) Convex penalty
3) Encourage to select or remove the entire group
How
to
do
group
selection
without
prior
knowledge
of
group
structures?
7. Group selection: automatic feature group
q Elastic Net (Zou et al., 2005)
A linear combination of ridge and LASSO penalties for group selection
via the penalty:
𝛼 c |𝛽6|
-
68$
+ (1 − 𝛼) c 𝛽6
&
-
68$
q Properties:
1) L1 term leads to a sparse solution
2) L2 term forces highly correlated variables to be averaged
8. Group selection: automatic feature group
(cont. )
q OSCAR (Bondell et al., 2008)
A combination of LASSO penalties and 𝐿e for
each
pair
of
vars
c |𝛽6|
-
68$
+ 𝑐 c max
{|𝛽6|, |𝛽:|}
6;:
q Properties:
1) Encourage equality of coeffs
9. Group selection: automatic feature group
(cont.)
q Fused LASSO (Friedman et al., 2007)
A lasso term + fused penalty
𝛼 c |𝛽6|
-
68$
+ (1 − 𝛼) c |𝛽6 − 𝛽6o$|
-
68&
q Properties:
1) Encourage sparsity in the differences of coffs.
2) Introduced to account for 1-d correlation of predictors
10. Group selection: automatic feature group
(cont.)
q HORSE (Friedman et al., 2007)
Extension of fused LASSO
𝛼 c |𝛽6|
-
68$
+ (1 − 𝛼) c |𝛽6 − 𝛽6o$|
6;:
q Properties:
1) Encourage sparsity in the differences of coffs.
2) Fused lasso for pairs of vars
12. Hierarchy selection
q SHIM (Choi et al., 2010)
Simply reparameterize the coeffsof 2-way interaction model:
𝑦, = 𝛽3 + c 𝛽6 𝑥,6 + c 𝜃6: 𝑥,6 𝑥,: + 𝜀
6;:
-
68$
become: 𝑦, = 𝛽3 + ∑ 𝛽6 𝑥,6 + ∑ 𝛾6: 𝛽6 𝛽: 𝑥,6 𝑥,: + 𝜀6;:
-
68$
q Properties:
1) satisfy “strong hierarchy”
2) but “Non-convex”, alternative minimization strategy for optimization.
13. Hierarchy selection
q Composite Absolute Penalties (CAP) (Zhao et al., 2009)
Use overlapping group selection to induce hierarchy selection.
Consider X1, X2. Hierarchy X1->X2 can be induced by:
𝑇 𝛽 = ||(𝛽$, 𝛽&)||vw
+ ||(𝛽&)||vx
14. Hierarchy selection
q Composite Absolute Penalties (Zhao et al., 2009)
Hiearchical structured sparsity for 2-way interaction model can be
obtained by:
𝑇(𝛽, 𝜃) = ∑ {|𝜃6:| + ||(𝛽6, 𝛽:, 𝜃6:)||vy`
}6z:
𝛽6 𝛽:
𝜃6:
𝛽s
𝜃:s
15. Hierarchy selection
q Hierarchicalinteraction LASSO (Bien et al., 2013)
Addition of convex constraints to the lasso to produce sparse interaction
models inducing hierarchicalconditions. Start with the following:
𝑚𝑖𝑛R,{ 𝑙 𝛽, 𝜃 + 𝜆||𝛽||$ +
𝜆
2
||𝜃||$
s.t. |
𝜃 = 𝜃*
||𝜃6||$ ≤ |𝛽6|
q Properties:
1) Automatically satisfy “strong hierarchy” (𝜃,6 ≠ 0 −> 𝛽, ≠ 0
& 𝛽6 ≠ 0)
2) But “Non-convex”
16. Hierarchy selection
q Hierarchical interaction LASSO (Bien et al., 2013)
Convex relaxation: replace 𝛽 by 𝛽€
− 𝛽o
(𝛽€
, 𝛽o
≥ 0), then:
𝑚 𝑖𝑛R‚
,Rƒ
,{ 𝑙 𝛽€
− 𝛽o
, 𝜃 + 𝜆1*
(𝛽€
+ 𝛽o
) +
𝜆
2
||𝜃||$
s.t.
𝜃 = 𝜃*
||𝜃6||$ ≤ 𝛽6
€
+ 𝛽6
o
𝛽6
€
, 𝛽6
o
≥ 0
q Properties:
1) Still satisfy “strong hierarchy” (𝜃6: ≠ 0 −> 𝛽6 ≠ 0
& 𝛽: ≠ 0)
2) Equivalent to : 𝜆
∑ 𝑚𝑎𝑥( 𝛽6 ,|𝜃6|)
-
68$ +
„
&
||𝜃||$
3) Optimization is bit hard due to symmetry constraint, but can use AMMD
17. Hierarchy selection
q Hierarchicalinteraction LASSO (Bien et al., 2013)
Removing symmetry constraint, then:
𝑚 𝑖𝑛R,{ 𝑙 𝛽€
− 𝛽o
, 𝜃 + 𝜆1*
(𝛽€
+ 𝛽o
) +
𝜆
2
||𝜃||$
s.t. …
||𝜃6||$ ≤ 𝛽6
€
+ 𝛽6
o
𝛽6
€
, 𝛽6
o
≥ 0
q Properties:
1) Now only satisfy “weak hierarchy” (𝜃,6 ≠ 0 −> 𝛽, ≠ 0
& 𝛽6 ≠ 0)
2) “convex”
3) Optimization is easy because of separate 𝛽6
€
+ 𝛽6
o
(Proximal Operator)
18. Hierarchy selection
q VANISH (Zhao et al., 2009)
1) Linear model: 𝑌 = ∑ 𝛽6 𝑋6 + ∑ 𝜃6: 𝑋6 ∘ 𝑋: +6;: 𝜀
-
68$
2) Nonlinear: 𝑌 = ∑ 𝑓6 + ∑ 𝑓6: +6;: 𝜀
-
68$
3) penalty: 𝑃 𝑓 = 𝜆$ ∑ (||𝑓6||&
+ ∑ ||𝑓6:||&
:z6 )
w
x+𝜆&
-
68$
∑ ||𝑓6:||6;:
Remark: if 𝑓6 = 𝑤6 𝑋6
, 𝑗 = 1, … , 𝑝, and X is normalized, then penalty
becomes:
𝑃 𝑤, 𝜃 = 𝜆$ c ||(𝛽6, 𝜃6)||& + 𝜆&
-
68$
c |𝜃6:|
6;:
19. Hierarchy selection
q GRESH (She et al., 2013)
Proposed a general model of previously mentioned regularization of the
following form:
min
•8[R,{]
𝑙 𝛽, 𝜃 + 𝜆$|𝜃|$ + 𝜆& c ||𝛽6, 𝑧(𝜃6)||‘
-
68$
s.t. 𝜃*
= 𝜃
q Remark:
1) If 𝑧 𝜃6 = 𝜃6
*
and 𝑞 = 2, 𝑡ℎ𝑒𝑛
it becomes VANISH
2) If 𝑧 𝜃6 = |𝜃6|$ and 𝑞 = ∞, 𝑡ℎ𝑒𝑛
it becomes HiLASSO