Reading the Lasso 1996 paper by Robert Tibshirani

READING SEMINAR ON CLASSICS

Regression Shrinkage and Selection via the LASSO
By Robert Tibshirani

Presented by Ulcinaite Agne

November 4, 2012

Presented by Ulcinaite Agne LASSO November 4, 2012 1 / 41

Outline
1 Introduction


Outline
1 Introduction
2 OLS estimates
OLS critics
Standard improving techniques


Outline
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Motivation for LASSO
Orthonormal design case
Function forms
Example of prostate cancer
Prediction error and estimation of t


Outline
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
4 Algorithm for ﬁnding LASSO solutions


Outline
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation


Outline
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

Introduction

The Article
Regression Shrinkage and Selection via the LASSO by Robert
Tibshirani
Published in 1996 for the Royal Statistical Society. Series B
(Methodological), vol. 58, No.1


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

OLS estimates

We consider the usual regression situation.
The data: ( xi , y i ), i = 1, . . . , N, where xi = (xi1 , . . . , xip )T and yi
are the regressors and the response for the ith observation.
The ordinary least square (OLS) estimates minimize the residual sum of
squares (RSS):
N p
RSS = (yi − βo − xij βj )2
i=1 j=1


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

OLS critics

The two reasons why data analysts are often not satisﬁed with OLS
estimates:

Prediction accuracy: OLS estimates having low bias but large variance


OLS critics

The two reasons why data analysts are often not satisﬁed with OLS
estimates:

Prediction accuracy: OLS estimates having low bias but large variance
Iterpretation: when having too much predictors, it would be better to
have smaller subset exhibiting stronger eﬀects


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions


Subset selection: small changes in data can result in very diﬀerent
models



Subset selection: small changes in data can result in very diﬀerent
models
Ridge regression:
 
 N 
ˆ
β ridge = argmin (yi − β0 − βj xij )2
 
i=1 j

subject to
βj2 ≤ t
j

Does not set any of the coeﬃcients to 0 and hence does not give an
easily interpretable model


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

Deﬁnition

We are considering the same data as in OLS estimation case:
( xi , y i ), i = 1, . . . , N, where xi = (xi1 , . . . , xip )T


Deﬁnition


The LASSO (Least Absolute Shrinkage and Selection Operator) estimate
α ˆ
(ˆ , β) is deﬁned by
 
 N 
α ˆ
(ˆ , β) = argmin (yi − α − βj xij )2
 
i=1 j


Deﬁnition


The LASSO (Least Absolute Shrinkage and Selection Operator) estimate
α ˆ
(ˆ , β) is deﬁned by
 
 N 
α ˆ
(ˆ , β) = argmin (yi − α − βj xij )2
 
i=1 j

subject to
|βj | ≤ t
j


Deﬁnition

The amount of shrinkage is controlled by parameter t ≥ 0 which is applied
to the estimates.


Deﬁnition

to the estimates.
ˆ ˆ
Let βjo be the full least square estimates and let t0 = |βjo |.
Values t < t0 will shrink the solutions towards 0, some coeﬃcients making
equal to 0.


Definition

to the estimates.
ˆ ˆ
Let βjo be the full least square estimates and let t0 = |βjo |.
Values t < t0 will shrink the solutions towards 0, some coefficients making
equal to 0.
For example, taking t = t0 /2, we will have the effect roughly similar to
finding the best subset of size p/2.


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions


LASSO came from the proposal of Breiman (1993).
Breiman’s non-negative garotte minimizes
N
(yi − α − cj βjo xij )2
i=1 j

subject to
cj ≥ 0, cj ≤ t.


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions


Let X the n × p design matrix with ijth entry xij and XT X = I.
The solution of previous minimization problem is
ˆ ˆ ˆ
βj = sign(βjo )(|βjo | − γ)+



ˆ ˆ ˆ

Best subset selection (of size k)



ˆ ˆ ˆ

Ridge regression solutions: 1 β oˆ
1+γ j



ˆ ˆ ˆ

Ridge regression solutions: 1 β oˆ
1+γ j
Garotte estimates: (1 − ˆ ˆ
γ/βjo2 )+ βjo


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

Function forms

(a) Subset regression, (b) ridge regression, (c) the LASSO, (d) the garrotte

Estimation picture for (a) the LASSO and (b) ridge regression


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions


Data examined: from a study by
Stamey (1989)
The factors:
log(cancer volume) lcavol
log(prostate weigth) lweigth
age
log(benign prostatic hyperplasia
amount) lbph
seminal vesicle invasion svi
log(capsular penetration) lcp
Gleason score gleason
percentage Gleason scores pgg45


Linear model to log(prostate speciﬁc
Data examined: from a study by
antigen) lpsa
Stamey (1989)
The factors:
log(cancer volume) lcavol
log(prostate weigth) lweigth
age
log(benign prostatic hyperplasia
amount) lbph
seminal vesicle invasion svi
log(capsular penetration) lcp
Gleason score gleason
percentage Gleason scores pgg45


Statistics of the example

Estimated coeﬃcients and test error results, for diﬀerent subset and
shrinkage methods applied to the prostate data. The blank entries
correspond to variables omitted.


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions


Methods for the estimation of the LASSO parameter t:



Cross-validation



Cross-validation
Generalized cross-validation



Cross-validation
Analytical unbiased estimate of risk



Cross-validation
Analytical unbiased estimate of risk

Strictly speaking the ﬁrst two methods are applicable in the ’X-random’
case, and the third method applies to the X-ﬁxed case.



Suppose that
Y = η(X) + ε
where E (ε) = 0 and var (ε) = σ 2



Suppose that
Y = η(X) + ε

ME = E {ˆ(X) − η(X)}2
η



Suppose that
Y = η(X) + ε

ME = E {ˆ(X) − η(X)}2
η

PE = E {Y − η (X)}2 = ME + σ 2
ˆ


Cross-validation

The Prediction Error (PE) is estimated by ﬁvefold cross-validation. The
ˆ
LASSO is indexed in terms of the normalised parameter s = t/ βjo , PE
is estimated over a grid of values of s from 0 to 1 inclusive.


Cross-validation

ˆ
Create a 5-fold partition of the dataset


Cross-validation

ˆ
For each fold, all-but-one of the chunks are used for training and the
remaining chunk - for testing.


Cross-validation

ˆ
Repeat 5 times so that each chunk is used once for testing.


Cross-validation

ˆ
Repeat 5 times so that each chunk is used once for testing.
Value s yielding the lowest estimated PE is selected.
ˆ


Generalized Cross-validation

The constrained is re-written as βj2 /|βj | ≤ t. So the constrained
˜
solution β can be expressed as the ridge regression estimator

β = (XT X + λW− )−1 XT y
˜

where W = diag (|βj |) and W− denotes a generalized inverse. The number
˜
of eﬀective parameters in the constrained ﬁt β may be approximated by

p(t) = tr X(XT X + λW− )−1 XT )

The generalised cross-validation style statistic

1 RSS(t)
GCV (t) =
N {1 − p(t)/N}2


Unbiased estimate of risk
This method is based on Stein’s (1981) unbiased estimate of risk.
√
ˆ
Denote the estimated standard error of βjo by τ = σ / N, where
ˆ ˆ
σ 2 = (yi − yi )2 /(N − p). Then the formula is derived
ˆ ˆ
 
 p 
ˆ ˆ τ
R β(γ) ≈ τ 2 p − 2#(j; |βjo /ˆ| < γ) +
ˆ ˆ τ
max(|βjo /ˆ|, γ)2
 
j=1

as an approximately unbiased estimate of the risk . Hence an estimate of
ˆ
γ can be obtained as the minimizer of R β(γ) :

ˆ
γ = argminγ≥0 [R β(γ) ].
ˆ

From this we obtain an estimate of the LASSO parameter t:

ˆ
t= ˆ
(|βjo | − γ )+ .
ˆ


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

Algorithm for ﬁnding LASSO solutions

We ﬁx t ≥ 0. The minimization problem of
N
(yi − βj xij )2
i=1 j

subject to j |βj | ≤ t can be seen as a least squares problem with 2p
inequality constraints.



N
(yi − βj xij )2
i=1 j

Denote G an m × p matrix, corresponding to m linear inequality
constraints of the p-vector β. For our problem, m = 2p .



N
(yi − βj xij )2
i=1 j

Denote G an m × p matrix, corresponding to m linear inequality
constraints of the p-vector β. For our problem, m = 2p .
Denote g (β) = N (yi − j βj xij )2 .
i=1
Set E is the equality set corresponding to those constraints which are
exactly met.



Outline of the algorithm



1 ˆ
Start with E = {i0 } where δi0 = sign(β o )



1 ˆ
2 ˆ
Find β to minimize g (β) subject to GE β ≤ t1



1 ˆ
2 ˆ
3 While ˆ
|βj | > t ,



1 ˆ
2 ˆ
3 While ˆ
|βj | > t ,
4 ˆ ˆ
add i to the set E where δi = sign(β). Find β to minimize
N
g (β) = (yi − βj xij )2
i=1 j

subject to GE β ≤ t1.



1 ˆ
2 ˆ
3 While ˆ
|βj | > t ,
4 ˆ ˆ
add i to the set E where δi = sign(β). Find β to minimize
N
g (β) = (yi − βj xij )2
i=1 j

subject to GE β ≤ t1.

This procedure must always converge to in a ﬁnite number of steps since
one element is added to the set E at each step, and there is a total of 2p
elements.

Least angle regression algorithm (Efron 2004 )

Least Angle Regression Algorithm



1 Standardize the predictors to have mean zero and unit norm. Start
with the residual r = y − y , β1 , . . . , βp = 0.
¯



¯
2 Find the predictor xj most correlated with r.



¯
3 Move βj from 0 towards its least-squares coeﬃcient (xj , r ), until some
other competitor xk has as much correlation with the current residual
as does xj .



¯
as does xj .
4 Move βj and βk in the direction deﬁned by their joint least squares
coeﬃcient of the current residual on (xj , xk ), until some other
competitor xl has as much correlation with the current residual.



¯
as does xj .
5 If a non-zero coeﬃcient hits zero, drop its variable from the active set
of variables and recompute the current joint least squares direction.



¯
as does xj .
5 If a non-zero coeﬃcient hits zero, drop its variable from the active set
of variables and recompute the current joint least squares direction.
6 Continue in this way until all p predictors have been entered. After
min(N-1, p) steps, we arrive at the full least-squares solution.


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

Simulation

In the example, 50 data sets consisting of 20 observations from the model

y = βT + σ

were simulated, where β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and is standard
normal.


Simulation

In the example, 50 data sets consisting of 20 observations from the model

y = βT + σ

were simulated, where β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and is standard
normal.

Mean-squared errors over 200 simulations from the model


Simulation

Most frequent models selected by Most frequent models selected by
LASSO subset regression


Table of Contents
1 Introduction
2 OLS estimates
OLS critics
3 LASSO
Deﬁnition
Function forms
5 Simulation
6 Conclusions

Conclusions

LASSO - a worthy competitor to subset selection and ridge regression.
Performance in different scenarios:
Small number of large effects - Subset selection does best, LASSO
- not quite as well, ridge regression - quite poorly.
Small to moderate number of moderate-size effects - LASSO
does best, followed by ridge regression and then subset selection.
Large number of small effects - Ridge regression does best,
followed by LASSO and then subset selection.


References

Robert Tibshirani (1996)
Regression Shrinkage and Selection via the LASSO
Journal of the Royal Statistical Society 58(1), 267–288.

Travor Hastie, Robert Tibshirani, Jerome Friedman (2008)
The Elements of Statistical Learning
Springer-Verlag, 57–73.

Abhimanyu Das, David Kempe
Algorithms for Subset Selection in Linear Regression

Yizao Wang (2007)
A Note on the LASSO in Model Selection


The End


Reading the Lasso 1996 paper by Robert Tibshirani

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

Reading the Lasso 1996 paper by Robert Tibshirani