Hierarchical selection

Group and Hierarchical
Variable Selection
Hai Nguyen
Bioinformatics center, Kyoto University
hai@kuicr.kyoto-u.ac.jp
haidnguyen0909@gmail.com

Introduction
q Response: 𝑦 = (𝑦$, 𝑦&, … , 𝑦()
*
q predictors : 𝑥, = (𝑥,$, 𝑥,&, …, 𝑥,-)
*
, 𝑖 = 1, . . , 𝑛
qLinear model: 𝑦, = 𝛽3 + ∑ 𝛽6 𝑥66 + 𝜀
-
68$
q2-way interaction model: 𝑦, = 𝛽3 + ∑ 𝛽6 𝑥,6 + ∑ 𝜃6: 𝑥,6 𝑥,: + 𝜀6;:
-
68$
1) ∑ 𝛽6 𝑥,6
-
68$ :
main
effect
term,
𝛽 ∈ ℝ-
2) ∑ 𝜃6: 𝑥,6 𝑥,:6;: : interaction term, 𝜃 ∈ ℝ-J-

Introduction
q Problems to be addressed in high dimensional data:
1) Predictive performance
2) Interpretability
3) Highly correlated variables
Sparsity assumption: # of nonzero coeffs 𝛽6
K
𝑠 and/or interaction 𝜃6:
K
𝑠 is very
few.

Introduction
Variable selection Group selection Hierarchical selection
LASSO GROUP
LASSO HIERARCHICAL
LASSO

Introduction
q Shrinkage methods based on regularization
𝛽M = 𝑎𝑟𝑔𝑚𝑖𝑛R
𝑙 𝛽 + 𝜆 U
|𝛽|$,

𝐿 𝑎𝑠𝑠𝑜
||𝛽||&
&
,

𝑅 𝑖𝑑𝑔𝑒
Where 𝑙 𝛽 is the loss function wr.t. 𝛽, e.g., square, logistic, hinge
losses
1) Ridge: prevent overfitting but not variable selection
2) Lasso: variable selection but only select one for each group of
correlated variables.

Group selection
q Group Lasso (Yuan et al., 2006)
Coefficients are organized into K groups (known in advance):
𝑔$, 𝑔&, …, 𝑔
⊆ 1,2,… , 𝑝 , disjoint and then the Group-Lasso pelnaty:
𝜆 ∑ 𝑑:||𝛽_`
||&: ,

where ||𝛽_`
||& = ∑ 𝛽,
&
,∈_`
q Properties:
1) Group-size = 1 -> LASSO
2) Convex penalty
3) Encourage to select or remove the entire group
How
to
do
group
selection
without
prior
knowledge
of
group
structures?

Group selection: automatic feature group
q Elastic Net (Zou et al., 2005)
A linear combination of ridge and LASSO penalties for group selection
via the penalty:

𝛼 c |𝛽6|
-
68$
+ (1 − 𝛼) c 𝛽6
&
-
68$
q Properties:
1) L1 term leads to a sparse solution
2) L2 term forces highly correlated variables to be averaged

(cont. )
q OSCAR (Bondell et al., 2008)
A combination of LASSO penalties and 𝐿e for
each
pair
of
vars
c |𝛽6|
-
68$
+ 𝑐 c max
{|𝛽6|, |𝛽:|}
6;:
q Properties:
1) Encourage equality of coeffs

(cont.)
q Fused LASSO (Friedman et al., 2007)
A lasso term + fused penalty

𝛼 c |𝛽6|
-
68$
+ (1 − 𝛼) c |𝛽6 − 𝛽6o$|
-
68&
q Properties:
1) Encourage sparsity in the differences of coffs.
2) Introduced to account for 1-d correlation of predictors

(cont.)
q HORSE (Friedman et al., 2007)
Extension of fused LASSO

𝛼 c |𝛽6|
-
68$
+ (1 − 𝛼) c |𝛽6 − 𝛽6o$|
6;:
q Properties:
1) Encourage sparsity in the differences of coffs.
2) Fused lasso for pairs of vars

Hierarchy selection
q Hierarchy restriction for interaction models
1) Strong hierarchy: 𝜃6: ≠ 0 → 𝛽6 ≠ 0 and 𝛽: ≠ 0 (SH)
2) Weak hierarchy: 𝜃6: ≠ 0 → 𝛽6 ≠ 0 or
𝛽: ≠ 0 (WH)
𝛽6 𝛽:
𝜃6:
𝛽s
𝜃:s

Hierarchy selection
q SHIM (Choi et al., 2010)
Simply reparameterize the coeffsof 2-way interaction model:
𝑦, = 𝛽3 + c 𝛽6 𝑥,6 + c 𝜃6: 𝑥,6 𝑥,: + 𝜀
6;:
-
68$
become: 𝑦, = 𝛽3 + ∑ 𝛽6 𝑥,6 + ∑ 𝛾6: 𝛽6 𝛽: 𝑥,6 𝑥,: + 𝜀6;:
-
68$
q Properties:
1) satisfy “strong hierarchy”
2) but “Non-convex”, alternative minimization strategy for optimization.

Hierarchy selection
q Composite Absolute Penalties (CAP) (Zhao et al., 2009)
Use overlapping group selection to induce hierarchy selection.
Consider X1, X2. Hierarchy X1->X2 can be induced by:
𝑇 𝛽 = ||(𝛽$, 𝛽&)||vw
+ ||(𝛽&)||vx

Hierarchy selection
q Composite Absolute Penalties (Zhao et al., 2009)
Hiearchical structured sparsity for 2-way interaction model can be
obtained by:
𝑇(𝛽, 𝜃) = ∑ {|𝜃6:| + ||(𝛽6, 𝛽:, 𝜃6:)||vy`
}6z:
𝛽6 𝛽:
𝜃6:
𝛽s
𝜃:s

Hierarchy selection
q Hierarchicalinteraction LASSO (Bien et al., 2013)
Addition of convex constraints to the lasso to produce sparse interaction
models inducing hierarchicalconditions. Start with the following:

𝑚𝑖𝑛R,{ 𝑙 𝛽, 𝜃 + 𝜆||𝛽||$ +
𝜆
2
||𝜃||$
s.t. |
𝜃 = 𝜃*
||𝜃6||$ ≤ |𝛽6|
q Properties:
1) Automatically satisfy “strong hierarchy” (𝜃,6 ≠ 0 −> 𝛽, ≠ 0
& 𝛽6 ≠ 0)
2) But “Non-convex”

Hierarchy selection
q Hierarchical interaction LASSO (Bien et al., 2013)
Convex relaxation: replace 𝛽 by 𝛽€
− 𝛽o
(𝛽€
, 𝛽o
≥ 0), then:

𝑚 𝑖𝑛R‚
,Rƒ
,{ 𝑙 𝛽€
− 𝛽o
, 𝜃 + 𝜆1*
(𝛽€
+ 𝛽o
) +
𝜆
2
||𝜃||$
s.t.
𝜃 = 𝜃*
||𝜃6||$ ≤ 𝛽6
€
+ 𝛽6
o
𝛽6
€
, 𝛽6
o
≥ 0
q Properties:
1) Still satisfy “strong hierarchy” (𝜃6: ≠ 0 −> 𝛽6 ≠ 0
& 𝛽: ≠ 0)
2) Equivalent to : 𝜆
∑ 𝑚𝑎𝑥( 𝛽6 ,|𝜃6|)
-
68$ +
„
&
||𝜃||$
3) Optimization is bit hard due to symmetry constraint, but can use AMMD

Hierarchy selection
q Hierarchicalinteraction LASSO (Bien et al., 2013)
Removing symmetry constraint, then:

𝑚 𝑖𝑛R,{ 𝑙 𝛽€
− 𝛽o
, 𝜃 + 𝜆1*
(𝛽€
+ 𝛽o
) +
𝜆
2
||𝜃||$
s.t. …
||𝜃6||$ ≤ 𝛽6
€
+ 𝛽6
o
𝛽6
€
, 𝛽6
o
≥ 0
q Properties:
1) Now only satisfy “weak hierarchy” (𝜃,6 ≠ 0 −> 𝛽, ≠ 0
& 𝛽6 ≠ 0)
2) “convex”
3) Optimization is easy because of separate 𝛽6
€
+ 𝛽6
o

(Proximal Operator)

Hierarchy selection
q VANISH (Zhao et al., 2009)
1) Linear model: 𝑌 = ∑ 𝛽6 𝑋6 + ∑ 𝜃6: 𝑋6 ∘ 𝑋: +6;: 𝜀
-
68$
2) Nonlinear: 𝑌 = ∑ 𝑓6 + ∑ 𝑓6: +6;: 𝜀
-
68$
3) penalty: 𝑃 𝑓 = 𝜆$ ∑ (||𝑓6||&
+ ∑ ||𝑓6:||&
:z6 )
w
x+𝜆&
-
68$
∑ ||𝑓6:||6;:
Remark: if 𝑓6 = 𝑤6 𝑋6
, 𝑗 = 1, … , 𝑝, and X is normalized, then penalty
becomes:
𝑃 𝑤, 𝜃 = 𝜆$ c ||(𝛽6, 𝜃6)||& + 𝜆&
-
68$
c |𝜃6:|
6;:

Hierarchy selection
q GRESH (She et al., 2013)
Proposed a general model of previously mentioned regularization of the
following form:
min
•8[R,{]
𝑙 𝛽, 𝜃 + 𝜆$|𝜃|$ + 𝜆& c ||𝛽6, 𝑧(𝜃6)||‘
-
68$
s.t. 𝜃*
= 𝜃
q Remark:
1) If 𝑧 𝜃6 = 𝜃6
*
and 𝑞 = 2, 𝑡ℎ𝑒𝑛
it becomes VANISH
2) If 𝑧 𝜃6 = |𝜃6|$ and 𝑞 = ∞, 𝑡ℎ𝑒𝑛
it becomes HiLASSO

Conclusion
• Group
Selection
• Hierarchical
selection

Hierarchical selection

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Hierarchical selection

Semelhante a Hierarchical selection (20)

Mais de Dai-Hai Nguyen

Mais de Dai-Hai Nguyen (8)

Último

Último (20)

Hierarchical selection