The document presents the cooperative-Lasso, a regularization method for variable selection in regression that assumes sign-coherent group structure. It begins by introducing generalized linear models and the group Lasso estimator. It then notes two limitations of the group Lasso: it does not allow for single zeros within groups, and it does not enforce sign coherence within groups. The cooperative-Lasso is introduced as a penalty that assumes groups will have either all non-positive, non-negative, or null parameters. Examples of applications that could benefit from sign coherence between variables within groups are given.
Sparsity with sign-coherent groups of variables via the cooperative-Lasso
1. Sparsity with sign-coherent groups of variables via the
cooperative-Lasso
Julien Chiquet1 , Yves Grandvalet2 , Camille Charbonnier1
1 e ´
Statistique et G´nome, CNRS & Universit´ d’Evry Val d’Essonne
e
2 Heudiasyc, CNRS & Universit´ de Technologie de Compi`gne
e e
SSB – 29 mars 2011
arXiv preprint.
http://arxiv.org/abs/1103.2697
R-package scoop.
http://stat.genopole.cnrs.fr/logiciels/scoop
cooperative-Lasso 1
2. Notations
Let
Y be the output random variable,
X = (X 1 , . . . , X p ) be the input random variables, where X j is the
jth predictor.
The data
Given a sample {(yi , xi ), i = 1, . . . , n} of i.id. realizations of (Y, X),
denote
y = (y1 , . . . , yn ) the response vector,
xj = (xj , . . . , xj ) the vector of data for the jth predictor,
1 n
X the n × p design matrix of data whose jth column is xj ,
D = {i : (yi , xi ) ∈ training set},
T = {i : (yi , xi ) ∈ test set}.
cooperative-Lasso 2
3. Generalized linear models
Suppose Y depends linearly on X through a function g:
E(Y ) = g(Xβ ).
ˆ
We predict a response yi by yi = g(xi β) for any i ∈ T by solving
ˆ
ˆ
β = arg max D (β) = arg min Lg (yi , xi β),
β β i∈D
where Lg is a loss function depending on the function g. Typically,
if Y is Gaussian and g = Id (OLS),
Lg (y, xβ) = (y − xβ)2
if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression)
Lg (y, xβ) = − y · xβ − log 1 + exβ
or any negative log-likelihood of an exponential family distribution.
cooperative-Lasso 3
4. Generalized linear models
Suppose Y depends linearly on X through a function g:
E(Y ) = g(Xβ ).
ˆ
We predict a response yi by yi = g(xi β) for any i ∈ T by solving
ˆ
ˆ
β = arg max D (β) = arg min Lg (yi , xi β),
β β i∈D
where Lg is a loss function depending on the function g. Typically,
if Y is Gaussian and g = Id (OLS),
Lg (y, xβ) = (y − xβ)2
if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression)
Lg (y, xβ) = − y · xβ − log 1 + exβ
or any negative log-likelihood of an exponential family distribution.
cooperative-Lasso 3
5. Estimation and selection at the group level
1. Structure: the set I = {1, . . . , p} splits into a known partition.
K
I= Gk , with Gk ∩ G = ∅, k = .
k=1
2. Sparsity: the support S of β has few entries.
S = {i : βi = 0}, such as |S| p.
The group-Lasso estimator
Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06
K
ˆgroup = arg min −
β D (β) +λ wk β Gk .
β∈Rp k=1
λ ≥ 0 controls the overall amount of penalty,
wk > 0 adapts the penalty between groups (dropped hereafter).
cooperative-Lasso 4
6. Estimation and selection at the group level
1. Structure: the set I = {1, . . . , p} splits into a known partition.
K
I= Gk , with Gk ∩ G = ∅, k = .
k=1
2. Sparsity: the support S of β has few entries.
S = {i : βi = 0}, such as |S| p.
The group-Lasso estimator
Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06
K
ˆgroup = arg min −
β D (β) +λ wk β Gk .
β∈Rp k=1
λ ≥ 0 controls the overall amount of penalty,
wk > 0 adapts the penalty between groups (dropped hereafter).
cooperative-Lasso 4
7. Toy example: the prostate dataset
Examines the correlation between the prostate specific antigen and 8
clinical measures for 97 patients.
svi
lweight
lcavol lcavol log(cancer volume)
lweight log(prostate weight)
age age
coefficients
lbph log(benign prostatic
hyperplasia amount)
svi seminal vesicle invasion
lcp log(capsular penetration)
lbph
gleason gleason Gleason score
pgg45
age pgg45 percentage Gleason scores 4
or 5
lcp
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0
lambda (log scale)
Figure: Lasso
cooperative-Lasso 5
8. Toy example: the prostate dataset
Examines the correlation between the prostate specific antigen and 8
clinical measures for 97 patients.
600
age
500
lcavol log(cancer volume)
400
lweight log(prostate weight)
age age
300
Height
lbph log(benign prostatic
pgg45
200
hyperplasia amount)
svi seminal vesicle invasion
100
lcp log(capsular penetration)
0
gleason Gleason score
lweight
gleason
pgg45 percentage Gleason scores 4
lbph
lcavol
svi
lcp
or 5
Figure: hierarchical clustering
cooperative-Lasso 5
9. Toy example: the prostate dataset
Examines the correlation between the prostate specific antigen and 8
clinical measures for 97 patients.
svi
lweight
lcavol
lcavol log(cancer volume)
lweight log(prostate weight)
age age
coefficients
lbph log(benign prostatic
hyperplasia amount)
svi seminal vesicle invasion
lcp log(capsular penetration)
lbph
gleason gleason Gleason score
pgg45
age pgg45 percentage Gleason scores 4
or 5
lcp
-3 -2 -1 0
lambda (log scale)
Figure: group-Lasso
cooperative-Lasso 5
10. Toy example: the prostate dataset
Examines the correlation between the prostate specific antigen and 8
clinical measures for 97 patients.
svi
lweight
lcavol lcavol log(cancer volume)
lweight log(prostate weight)
age age
coefficients
lbph log(benign prostatic
hyperplasia amount)
svi seminal vesicle invasion
lcp log(capsular penetration)
lbph
gleason gleason Gleason score
pgg45
age pgg45 percentage Gleason scores 4
or 5
lcp
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0
lambda (log scale)
Figure: Lasso
cooperative-Lasso 5
11. Application to splice site detection
Predict splice site status (0/1) by a sequence of 7 bases and their
interactions.
2
1.5
order 0: 7 factors with 4 levels,
Information content
order 1: C7 factors with 42 levels,
2
1
order 2: C7 factors with 43 levels,
3
using dummy coding for factor,
0.5
we form groups.
0
1 2 3 4 5 6 7 8 9
Position
L. Meier, S. van de Geer, P. B¨hlmann, 2008.
u
The group-Lasso for logistic regression, JRSS series B.
cooperative-Lasso 6
12. Application to splice site detection
Predict splice site status (0/1) by a sequence of 7 bases and their
interactions.
order 0
g49 g45 g61
order 1
order 2
order 0: 7 factors with 4 levels,
g44 g54 g42
order 1: C7 factors with 42 levels,
2
order 2: C7 factors with 43 levels,
3
using dummy coding for factor,
g4
we form groups.
g18 g5
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0
L. Meier, S. van de Geer, P. B¨hlmann, 2008.
u
The group-Lasso for logistic regression, JRSS series B.
cooperative-Lasso 6
13. Group-Lasso limitations
1. Not a single zero should belong to a group with non-zeros
Strong group sparsity (Huang and Zhang, ’10 arXiv)
establish the conditions where the group-Lasso outperforms the Lasso,
and conversely.
2. No sign-coherence within group
Required if groups gather consonant variables
e.g., groups defined by clusters of positively correlated variables.
The cooperative-Lasso
A penalty which assumes a sign-coherent group structure, that is to say,
groups which gather either
non-positive,
non-negative,
or null parameters.
cooperative-Lasso 7
14. Group-Lasso limitations
1. Not a single zero should belong to a group with non-zeros
Strong group sparsity (Huang and Zhang, ’10 arXiv)
establish the conditions where the group-Lasso outperforms the Lasso,
and conversely.
2. No sign-coherence within group
Required if groups gather consonant variables
e.g., groups defined by clusters of positively correlated variables.
The cooperative-Lasso
A penalty which assumes a sign-coherent group structure, that is to say,
groups which gather either
non-positive,
non-negative,
or null parameters.
cooperative-Lasso 7
15. Motivation: multiple network inference
experiment 1 experiment 2 experiment 3
inference inference inference
A group is a set of corresponding edges across tasks (e.g., red or blue
ones): sign-coherence matters!
J. Chiquet, Y. Grandvalet, C. Ambroise, 2010.
Inferring multiple graphical structures, Statistics and Computing.
cooperative-Lasso 8
16. Motivation: joint segmentation of aCGH profiles
2
minimize β − y
,
β∈Rp
p
s.t
|βi − βi−1 | < s,
i=1
1
where
log-ratio (CNVs)
y a vector in Rp ,
β a vector in Rp ,
0
βi a size-n vector with ith probes
for the n profiles.
a group gathers every position i
-1
across profiles.
Sign-coherence may avoid inconsistent
variations across profiles.
-2
0 50 100 150 200
position on chromosom
K. Bleakley and J.-P. Vert, 2010.
Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
cooperative-Lasso 9
17. Motivation: joint segmentation of aCGH profiles
2
minimize β − Y
,
β∈Rn×p
p
s.t
βi − βi−1 < s,
i=1
1
where
log-ratio (CNVs)
Y a n × p matrix with n profiles
with size p.
0
βi a size-n vector with ith probes
for the n profiles.
a group gathers every position i
-1
across profiles.
Sign-coherence may avoid inconsistent
variations across profiles.
-2
0 50 100 150 200
position on chromosom
K. Bleakley and J.-P. Vert, 2010.
Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
cooperative-Lasso 9
18. Motivation: joint segmentation of aCGH profiles
2
minimize β − Y
,
β∈Rn×p
p
s.t
βi − βi−1 < s,
i=1
1
where
log-ratio (CNVs)
Y a n × p matrix with n profiles
with size p.
0
βi a size-n vector with ith probes
for the n profiles.
a group gathers every position i
-1
across profiles.
Sign-coherence may avoid inconsistent
variations across profiles.
-2
0 50 100 150 200
position on chromosom
K. Bleakley and J.-P. Vert, 2010.
Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
cooperative-Lasso 9
19. Motivation: joint segmentation of aCGH profiles
2
minimize β − Y
,
β∈Rn×p
p
s.t
βi − βi−1 < s,
i=1
1
where
log-ratio (CNVs)
Y a n × p matrix with n profiles
with size p.
0
βi a size-n vector with ith probes
for the n profiles.
a group gathers every position i
-1
across profiles.
Sign-coherence may avoid inconsistent
variations across profiles.
-2
0 50 100 150 200
position on chromosom
K. Bleakley and J.-P. Vert, 2010.
Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
cooperative-Lasso 9
20. Motivation: joint segmentation of aCGH profiles
2
minimize β − Y
,
β∈Rn×p
p
s.t
βi − βi−1 < s,
i=1
1
where
log-ratio (CNVs)
Y a n × p matrix with n profiles
with size p.
0
βi a size-n vector with ith probes
for the n profiles.
a group gathers every position i
-1
across profiles.
Sign-coherence may avoid inconsistent
variations across profiles.
-2
0 50 100 150 200
position on chromosom
K. Bleakley and J.-P. Vert, 2010.
Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
cooperative-Lasso 9
21. Motivation: joint segmentation of aCGH profiles
2
minimize β − Y
,
β∈Rn×p
p
s.t
βi − βi−1 < s,
i=1
1
where
log-ratio (CNVs)
Y a n × p matrix with n profiles
with size p.
0
βi a size-n vector with ith probes
for the n profiles.
a group gathers every position i
-1
across profiles.
Sign-coherence may avoid inconsistent
variations across profiles.
-2
0 50 100 150 200
position on chromosom
K. Bleakley and J.-P. Vert, 2010.
Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
cooperative-Lasso 9
22. Motivation: joint segmentation of aCGH profiles
2
minimize β − Y
,
β∈Rn×p
p
s.t
βi − βi−1 < s,
i=1
1
where
log-ratio (CNVs)
Y a n × p matrix with n profiles
with size p.
0
βi a size-n vector with ith probes
for the n profiles.
a group gathers every position i
-1
across profiles.
Sign-coherence may avoid inconsistent
variations across profiles.
-2
0 50 100 150 200
position on chromosom
K. Bleakley and J.-P. Vert, 2010.
Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
cooperative-Lasso 9
23. Outline
Definition
Resolution
Consistency
Model selection
Simulation studies
Sibling probe sets and gene selection
cooperative-Lasso 10
24. Outline
Definition
Resolution
Consistency
Model selection
Simulation studies
Sibling probe sets and gene selection
cooperative-Lasso 11
25. The cooperative-Lasso estimator
Definition
ˆcoop = arg min J(β), with J(β) = −
β D (β) +λ β coop ,
β∈Rp
where, for any v ∈ Rp ,
K
+ −
v coop = v+ group + v
−
group = vGk + vGk ,
k=1
and
+ +
v+ = (v1 , . . . , vp ), vj = max(0, vj ),
+
− +
v− = (v1 , . . . , vp ), vj = max(0, −vj ).
−
cooperative-Lasso 12
36. Outline
Definition
Resolution
Consistency
Model selection
Simulation studies
Sibling probe sets and gene selection
cooperative-Lasso 16
37. Convex analysis
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β1
cooperative-Lasso 17
38. Convex analysis
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β2
β1 β1
cooperative-Lasso 17
39. Convex analysis
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β2
β1 β1
cooperative-Lasso 17
40. Convex analysis
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β2
β1 β1
cooperative-Lasso 17
41. Convex analysis
Supporting Hyperplane
An hyperplane supports a set iff
the set is contained in one half-space
the set has at least one point on the hyperplane
β2
β2
β2
β1 β1 β1
There are Supporting Hyperplane at all points of convex sets:
Generalize tangents
cooperative-Lasso 17
42. Convex analysis
Dual Cone and subgradient
Generalizes normals
β2
β2
β2
β1 β1 β1
g is a subgradient at x
the vector (g, −1) is normal to the supporting hyperplane at this point
The subdifferential at x is the set of all subgradient at x.
cooperative-Lasso 18
43. Convex analysis
Dual Cone and subgradient
Generalizes normals
β2
β2
β2
β1 β1 β1
g is a subgradient at x
the vector (g, −1) is normal to the supporting hyperplane at this point
The subdifferential at x is the set of all subgradient at x.
cooperative-Lasso 18
44. Convex analysis
Dual Cone and subgradient
Generalizes normals
β2
β2
β2
β1 β1 β1
g is a subgradient at x
the vector (g, −1) is normal to the supporting hyperplane at this point
The subdifferential at x is the set of all subgradient at x.
cooperative-Lasso 18
45. Convex analysis
Dual Cone and subgradient
Generalizes normals
β2
β2
β2
β1 β1 β1
g is a subgradient at x
the vector (g, −1) is normal to the supporting hyperplane at this point
The subdifferential at x is the set of all subgradient at x.
cooperative-Lasso 18
46. Optimality conditions
Theorem
A necessary and sufficient condition for the optimality of β is that the
null vector 0 belong to the subdifferential of the convex function J:
0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ},
where θ ∈ Rp belongs to the subdifferential of the coop-norm.
Define
ϕj (v) = (sign(vj )v)+ ,
then θ is such as
βj
∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = ,
ϕj (β Gk )
c
∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1.
We derive a subset algorithm to solve that problem (that you can
enjoy in the paper and the package).
cooperative-Lasso 19
47. Optimality conditions
Theorem
A necessary and sufficient condition for the optimality of β is that the
null vector 0 belong to the subdifferential of the convex function J:
0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ},
where θ ∈ Rp belongs to the subdifferential of the coop-norm.
Define
ϕj (v) = (sign(vj )v)+ ,
then θ is such as
βj
∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = ,
ϕj (β Gk )
c
∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1.
We derive a subset algorithm to solve that problem (that you can
enjoy in the paper and the package).
cooperative-Lasso 19
48. Optimality conditions
Theorem
A necessary and sufficient condition for the optimality of β is that the
null vector 0 belong to the subdifferential of the convex function J:
0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ},
where θ ∈ Rp belongs to the subdifferential of the coop-norm.
Define
ϕj (v) = (sign(vj )v)+ ,
then θ is such as
βj
∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = ,
ϕj (β Gk )
c
∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1.
We derive a subset algorithm to solve that problem (that you can
enjoy in the paper and the package).
cooperative-Lasso 19
49. Linear regression with orthonormal design
Consider
ˆ 1 2
β = arg min y − Xβ + λΩ(β) ,
β 2
ˆols
with X X = I. Hence, (xj ) (Xβ − y) = βj − β and
ˆ 1 ˆols
β = arg min β (β − β ) + λΩ(β) .
β 2
We may find a closed-form of β for, e.g.,
1. Ω(β) = β lasso ,
2. Ω(β) = β group ,
3. Ω(β) = β coop .
cooperative-Lasso 20
50. Linear regression with orthonormal design
Consider
ˆ 1 2
β = arg min y − Xβ + λΩ(β) ,
β 2
ˆols
with X X = I. Hence, (xj ) (Xβ − y) = βj − β and
ˆ 1 ˆols
β = arg min β (β − β ) + λΩ(β) .
β 2
We may find a closed-form of β for, e.g.,
1. Ω(β) = β lasso ,
2. Ω(β) = β group ,
3. Ω(β) = β coop .
cooperative-Lasso 20
51. Linear regression with orthonormal design
ˆlasso
β1
∀j ∈ {1, . . . , p} ,
+
ˆlasso λ ˆols
βj = 1 − βj ,
ˆ
β olsj
+
ˆlasso =
βj ˆols
βj − λ .
ˆols
β2 ˆols
β1
Fig.: Lasso as a function of the OLS coefficients
cooperative-Lasso 20
52. Linear regression with orthonormal design
ˆgroup
β1
∀k ∈ {1, . . . , K} , ∀j ∈ Gk ,
+
ˆgroup = 1 − λ ˆols
βj βj ,
βˆols
Gk
+
ˆgroup =
β Gk ˆols
β Gk − λ .
ˆols
β2 ˆols
β1
Fig.: Group-Lasso as a function of the OLS coefficients
cooperative-Lasso 20
53. Linear regression with orthonormal design
ˆcoop
β1
∀k ∈ {1, . . . , K} , ∀j ∈ Gk ,
+
ˆcoop λ ˆols
βj = 1 − ols
βj ,
ˆ
ϕ (β )
j Gk
+
ˆcoop
ϕj (β Gk ) = ˆols
ϕj (β Gk ) − λ .
ˆols
β2 ˆols
β1
Fig.: Coop-Lasso as a function of the OLS coefficients
cooperative-Lasso 20
54. Outline
Definition
Resolution
Consistency
Model selection
Simulation studies
Sibling probe sets and gene selection
cooperative-Lasso 21
55. Linear regression setup
Technical assumptions
(A1) X and Y have finite fourth order moments
4
E X < ∞, E|Y |4 < ∞,
(A2) the covariance matrix Ψ = EXX ∈ Rp×p is invertible,
(A3) for every k = 1, . . . , K,
if (β )+ > 0 and (β )− > 0 then for every j ∈ Gk β j = 0.
(All sign-coherent groups are either included or excluded from the true support).
cooperative-Lasso 22
56. Irrepresentability condition
Define Sk = S ∩ Gk the support within a group and
−1
[D(β)]jj = [sign(βj )β Gk ]+ .
Assume there exists η > 0 such that
(A4) For every group Gk including at least one null coefficient:
max( (ΨSk S Ψ−1 D(β S )β S )+ , (ΨSk S Ψ−1 D(β S )β S )− ) ≤ 1 − η,
c
SS
c
SS
(A5) For every group Gk intersecting the support and including either
positive or negative coefficients, let νk be the sign of these
coefficients (νk = 1 if (β Gk )+ > 0 and νk = −1 if (β Gk )− > 0):
νk ΨSk S Ψ−1 D(β S )β S
c
SS 0,
where denotes componentwise inequality.
cooperative-Lasso 23
57. Consistency results
Theorem
If assumptions (A1-5) are satisfied and if there exists η > 0, then for
every sequence λn such that λn = λ0 n−γ , γ ∈]0, 1/2[,
ˆcoop −→ β
β
P ˆ
and P(S(β
coop
) = S) → 1.
Asymptotically, the cooperative-Lasso is unbiased and enjoys exact
support recovery (even when there are irrelevant variables within a
group).
cooperative-Lasso 24
58. Sketch of the proof
˜
1. Construct an artifical estimator β S restricted to the true support S
and extend it with 0 coefficients on S c .
˜
2. Consider the event En on which β satisfies the original optimality
coop
˜
conditions. On En , β = β ˆ ˆcoop
and β c = 0, by uniqueness.
S S S
3. We need to prove that limn→∞ P(En ) = 1.
4. Derive the asymptotic distribution of the derivative of the loss
˜
function X (y − Xβ) from
TCL on second order moments,
˜
Optimality conditions on β S .
Right choice of λn provides convergence in probability.
5. Assumptions (A4-5) assume that the limits in probability satisfy
optimality constraints with strict inequalities.
6. As a result, optimility conditions are satisfied (with large
inequalities) with probability tending to 1.
cooperative-Lasso 25
59. Sketch of the proof
˜
1. Construct an artifical estimator β S restricted to the true support S
and extend it with 0 coefficients on S c .
˜
2. Consider the event En on which β satisfies the original optimality
coop
˜
conditions. On En , β = β ˆ ˆcoop
and β c = 0, by uniqueness.
S S S
3. We need to prove that limn→∞ P(En ) = 1.
4. Derive the asymptotic distribution of the derivative of the loss
˜
function X (y − Xβ) from
TCL on second order moments,
˜
Optimality conditions on β S .
Right choice of λn provides convergence in probability.
5. Assumptions (A4-5) assume that the limits in probability satisfy
optimality constraints with strict inequalities.
6. As a result, optimility conditions are satisfied (with large
inequalities) with probability tending to 1.
cooperative-Lasso 25
60. Sketch of the proof
˜
1. Construct an artifical estimator β S restricted to the true support S
and extend it with 0 coefficients on S c .
˜
2. Consider the event En on which β satisfies the original optimality
coop
˜
conditions. On En , β = β ˆ ˆcoop
and β c = 0, by uniqueness.
S S S
3. We need to prove that limn→∞ P(En ) = 1.
4. Derive the asymptotic distribution of the derivative of the loss
˜
function X (y − Xβ) from
TCL on second order moments,
˜
Optimality conditions on β S .
Right choice of λn provides convergence in probability.
5. Assumptions (A4-5) assume that the limits in probability satisfy
optimality constraints with strict inequalities.
6. As a result, optimility conditions are satisfied (with large
inequalities) with probability tending to 1.
cooperative-Lasso 25
61. Sketch of the proof
˜
1. Construct an artifical estimator β S restricted to the true support S
and extend it with 0 coefficients on S c .
˜
2. Consider the event En on which β satisfies the original optimality
coop
˜
conditions. On En , β = β ˆ ˆcoop
and β c = 0, by uniqueness.
S S S
3. We need to prove that limn→∞ P(En ) = 1.
4. Derive the asymptotic distribution of the derivative of the loss
˜
function X (y − Xβ) from
TCL on second order moments,
˜
Optimality conditions on β S .
Right choice of λn provides convergence in probability.
5. Assumptions (A4-5) assume that the limits in probability satisfy
optimality constraints with strict inequalities.
6. As a result, optimility conditions are satisfied (with large
inequalities) with probability tending to 1.
cooperative-Lasso 25
62. Illustration
1.0
0.5
Generate data y = Xβ + σε,
coefficients
β = (1, 1, −1, −1, 0, 0, 0, 0)
G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}}
0.0
σ = 0.1, R2 ≈ 0.99, n = 20,
irrepresentability conditions
-0.5
holds for the coop-Lasso,
holds not for the group-Lasso.
average over 100 simulations.
-1.0
-3 -2 -1 0 1
log10 (λ)
Fig.:: 50% coverage intervals (upper / lower quartiles)
cooperative-Lasso 26
63. Illustration
1.0
0.5
Generate data y = Xβ + σε,
coefficients
β = (1, 1, −1, −1, 0, 0, 0, 0)
G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}}
0.0
σ = 0.1, R2 ≈ 0.99, n = 20,
irrepresentability conditions
-0.5
holds for the coop-Lasso,
holds not for the group-Lasso.
average over 100 simulations.
-1.0
-3 -2 -1 0 1
log10 (λ)
Fig.:group-Lasso: 50% coverage intervals (upper / lower quartiles)
cooperative-Lasso 26
64. Illustration
1.0
0.5
Generate data y = Xβ + σε,
coefficients
β = (1, 1, −1, −1, 0, 0, 0, 0)
G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}}
0.0
σ = 0.1, R2 ≈ 0.99, n = 20,
irrepresentability conditions
-0.5
holds for the coop-Lasso,
holds not for the group-Lasso.
average over 100 simulations.
-1.0
-3 -2 -1 0 1
log10 (λ)
Fig.:coop-Lasso: 50% coverage intervals (upper / lower quartiles)
cooperative-Lasso 26
65. Outline
Definition
Resolution
Consistency
Model selection
Simulation studies
Sibling probe sets and gene selection
cooperative-Lasso 27
66. Optimism of the training error
The training error:
1 ˆ
err = L(yi , xi β).
|D|
i∈D
The test error (“extra-sample” error):
ˆ
Errex = EX,Y [L(Y, X β)|D].
The “in-sample” error
1 ˆ
Errin = EY L(Yi , xi β)|D .
|D|
i∈D
Definition (Optimism)
Errin = err + ”optimism”.
cooperative-Lasso 28
67. Optimism of the training error
The training error:
1 ˆ
err = L(yi , xi β).
|D|
i∈D
The test error (“extra-sample” error):
ˆ
Errex = EX,Y [L(Y, X β)|D].
The “in-sample” error
1 ˆ
Errin = EY L(Yi , xi β)|D .
|D|
i∈D
Definition (Optimism)
Errin = err + ”optimism”.
cooperative-Lasso 28
68. Cp statistics
For squared-error loss (and some other loss),
2
Errin = err + cov(ˆi , yi ).
y
|D|
i∈D
The amount by which err underestimates the true error depends
on how strongly yi affects its own prediction. The harder we fit
the data, the greater the covariance will be thereby increasing
the optimism (ESLII 5th print).
Mallows’ Cp Statistic
ˆ
For a linear regression fit yi with p inputs i∈D cov(ˆi , yi ) = pσ 2 :
y
df 2
Cp = err + 2 · ˆ
σ , with df = p.
|D|
cooperative-Lasso 29
69. Cp statistics
For squared-error loss (and some other loss),
2
Errin = err + cov(ˆi , yi ).
y
|D|
i∈D
The amount by which err underestimates the true error depends
on how strongly yi affects its own prediction. The harder we fit
the data, the greater the covariance will be thereby increasing
the optimism (ESLII 5th print).
Mallows’ Cp Statistic
ˆ
For a linear regression fit yi with p inputs i∈D cov(ˆi , yi ) = pσ 2 :
y
df 2
Cp = err + 2 · ˆ
σ , with df = p.
|D|
cooperative-Lasso 29
70. Generalized degrees of freedom
ˆ ˆ
Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator.
Proposition (Efron (’04)+ Stein’s Lemma (’81))
. 1 ˆ
∂ yλ
df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr
y .
σ ∂y
i∈D
For the Lasso, Zou et al. (’07) show that
ˆlasso (λ)
df lasso (λ) = β .
0
Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that
the trace term equals
ˆgroup
K β Gk (λ)
df group (λ) = ˆgroup
1 β Gk (λ) > 0 1 + (pk − 1) .
k=1
β ols
Gk
cooperative-Lasso 30
71. Generalized degrees of freedom
ˆ ˆ
Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator.
Proposition (Efron (’04)+ Stein’s Lemma (’81))
. 1 ˆ
∂ yλ
df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr
y .
σ ∂y
i∈D
For the Lasso, Zou et al. (’07) show that
ˆlasso (λ)
df lasso (λ) = β .
0
Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that
the trace term equals
ˆgroup
K β Gk (λ)
df group (λ) = ˆgroup
1 β Gk (λ) > 0 1 + (pk − 1) .
k=1
β ols
Gk
cooperative-Lasso 30
72. Generalized degrees of freedom
ˆ ˆ
Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator.
Proposition (Efron (’04)+ Stein’s Lemma (’81))
. 1 ˆ
∂ yλ
df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr
y .
σ ∂y
i∈D
For the Lasso, Zou et al. (’07) show that
ˆlasso (λ)
df lasso (λ) = β .
0
Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that
the trace term equals
ˆgroup
K β Gk (λ)
df group (λ) = ˆgroup
1 β Gk (λ) > 0 1 + (pk − 1) .
k=1
β ols
Gk
cooperative-Lasso 30
73. Approximated degrees of freedom for the coop-Lasso
Proposition
Assuming that data are generated according to a linear regression model
and that X is orthonormal, the following expression of df coop (λ) is an
unbiased estimate of df(λ)
+
K ˆcoop
β Gk (λ)
1 + (pk − 1)
df coop (λ) = 1 + +
ˆcoop +
β G (λ) >0 ˆols
k=1 k β Gk
−
ˆcoop
β Gk (λ)
1 + (pk − 1)
+1 − −
,
ˆcoop −
β G (λ) >0 β ols
k Gk
where pk and pk are respectively the number of positive and negative
+ −
ˆols
entries in β (γ). Gk
cooperative-Lasso 31
74. Approximated degrees of freedom for the coop-Lasso
Proposition
Assuming that data are generated according to a linear regression model
and that X is orthonormal, the following expression of df coop (λ) is an
unbiased estimate of df(λ)
+
K ˆcoop
β Gk (λ)
k
1 + p+ − 1
df coop (λ) = 1 +
ˆcoop 1+γ +
β G (λ) >0 ˆridge
k=1 k β Gk (γ)
−
ˆcoop
β Gk (λ)
k
1 + p− − 1
+1 −
,
ˆcoop 1+γ −
β G (λ) >0 ˆridge
k β Gk (γ)
where pk and pk are respectively the number of positive and negative
+ −
entries in βˆridge (γ).
Gk
cooperative-Lasso 31
75. Approximated information criteria
Following Zou et al, we extend the Cp stat to an “approximated” AIC
y − y(λ)
ˆ ˜
AIC(λ) = + 2df(λ),
σ2
and from the AIC, there is (small) step to BIC:
y − y(λ)
ˆ ˜
BIC(λ) = + log(n)df(λ).
σ2
The K–fold cross-validation works well but is computationally
intensive.
It is required when we do not meet the linear regression setup. . .
cooperative-Lasso 32
76. Outline
Definition
Resolution
Consistency
Model selection
Simulation studies
Sibling probe sets and gene selection
cooperative-Lasso 33
79. Breiman’s setup
Simulations setting
A wave-like vector of parameters β
p = 90 variables partitioned into K = 10 groups of size pk = 9,
3 (partially) active groups, 6 groups of zeros,
in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.
0 20 40 60 80
Figure: β with h = 1, |Sk | = 1 non-zero coefficients in each active group.
cooperative-Lasso 36
80. Breiman’s setup
Simulations setting
A wave-like vector of parameters β
p = 90 variables partitioned into K = 10 groups of size pk = 9,
3 (partially) active groups, 6 groups of zeros,
in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.
0 20 40 60 80
Figure: β with h = 2, |Sk | = 3 non-zero coefficients in each active group.
cooperative-Lasso 36
81. Breiman’s setup
Simulations setting
A wave-like vector of parameters β
p = 90 variables partitioned into K = 10 groups of size pk = 9,
3 (partially) active groups, 6 groups of zeros,
in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.
0 20 40 60 80
Figure: β with h = 3, |Sk | = 5 non-zero coefficients in each active group.
cooperative-Lasso 36
82. Breiman’s setup
Simulations setting
A wave-like vector of parameters β
p = 90 variables partitioned into K = 10 groups of size pk = 9,
3 (partially) active groups, 6 groups of zeros,
in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.
0 20 40 60 80
Figure: β with h = 4, |Sk | = 7 non-zero coefficients in each active group.
cooperative-Lasso 36
83. Breiman’s setup
Simulations setting
A wave-like vector of parameters β
p = 90 variables partitioned into K = 10 groups of size pk = 9,
3 (partially) active groups, 6 groups of zeros,
in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.
0 20 40 60 80
Figure: β with h = 5, |Sk | = 9 non-zero coefficients in each active group.
cooperative-Lasso 36
84. Breiman’s setup
Example of path of solution and signal recovery with BIC choice
The signal strength is generated so as
y = Xβ + σ , with σ = 1, n = 30 to 500,
X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
magnitude in β chosen so as R2 ≈ 0.75.
Remark
Covariance structure is purposely disconnected from the group structure.
None of the support recovery conditions are fulfilled.
cooperative-Lasso 37
85. Breiman’s setup
Example of path of solution and signal recovery with BIC choice
The signal strength is generated so as
y = Xβ + σ , with σ = 1, n = 30 to 500,
X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
magnitude in β chosen so as R2 ≈ 0.75.
One shot sample with n = 120
cooperative-Lasso 37
86. Breiman’s setup
Example of path of solution and signal recovery with BIC choice
The signal strength is generated so as
y = Xβ + σ , with σ = 1, n = 30 to 500,
X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
magnitude in β chosen so as R2 ≈ 0.75.
0.6
0.5
0.4
0.4
0.3
ˆlasso
ˆlasso
0.2
True signal
0.2
β
β
Estimated signal
0.1
0.0
0.0
-0.2
-0.1
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80
log10 (λ) i
Figure: Lasso
cooperative-Lasso 37
87. Breiman’s setup
Example of path of solution and signal recovery with BIC choice
The signal strength is generated so as
y = Xβ + σ , with σ = 1, n = 30 to 500,
X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
magnitude in β chosen so as R2 ≈ 0.75.
0.5
0.5
0.4
0.4
0.3
0.3
ˆgroup
ˆgroup
0.2
True signal
0.2
β
β
0.1
Estimated signal
0.1
0.0
0.0
-0.1
-0.1
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80
log10 (λ) i
Figure: Group-Lasso
cooperative-Lasso 37
88. Breiman’s setup
Example of path of solution and signal recovery with BIC choice
The signal strength is generated so as
y = Xβ + σ , with σ = 1, n = 30 to 500,
X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
magnitude in β chosen so as R2 ≈ 0.75.
0.5
0.5
0.4
0.4
0.3
0.3
ˆcoop
ˆcoop
True signal
0.2
0.2
β
β
Estimated signal
0.1
0.1
0.0
0.0
-0.1
-0.1
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80
log10 (λ) i
Figure: Coop-Lasso
cooperative-Lasso 37
89. Breiman’s setup
Errors as a function of the sample size n
0.30
1.2
0.25
1.0
prediction error
0.20
0.8
sign error
0.15
0.6
0.10
0.4
0.05
0.2
0.00
0.0
100 200 300 400 500 100 200 300 400 500
n n
Figure: h = 3, |Sk | = 5 (favoring Lasso).
lasso group coop
cooperative-Lasso 38
90. Breiman’s setup
Errors as a function of the sample size n
0.30
1.2
0.25
1.0
prediction error
0.20
0.8
sign error
0.15
0.6
0.10
0.4
0.05
0.2
0.00
0.0
100 200 300 400 500 100 200 300 400 500
n n
Figure: h = 4, |Sk | = 7 (intermediate).
lasso group coop
cooperative-Lasso 38