Plagiarism,forms,understand about plagiarism,avoid plagiarism,key significanc...
Ridge regression, lasso and elastic net
1. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Ridge Regression, LASSO and Elastic Net
A talk given at NYC open data meetup, find more at www.nycopendata.com
Yunting Sun
Google Inc
file:///Users/ytsun/elasticnet/index.html#42
1/42
2. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Overview
· Linear Regression
· Ordinary Least Square
· Ridge Regression
· LASSO
· Elastic Net
· Examples
· Exercises
Note: make sure that you have installed elasticnet package
library(MASS)
library(elasticnet)
2/42
file:///Users/ytsun/elasticnet/index.html#42
2/42
3. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Linear Regression
n observations, each has one response variable and p predictors
,) p X ,
p × n
, 1 x( = x
to
px
,
n
, 1x
y
, y( = Y
1
, 1X( = X
y
to predict
and
) px ,
) y,
of predictors
- describe the actual relationship between
T
x = y
- use
T
,
1 × n
· We want to find a linear combination
· Examples
- find relationship between pressure and water boiling point
- use GDP to predict interest rate (the accuracy of the prediction is important but the
actual relationship may not matter)
3/42
file:///Users/ytsun/elasticnet/index.html#42
3/42
4. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Quality of an estimator
+
T
0
x = y
, the difference between the actual response and the model prediction
2
2
])
T
0
)
T
0
x
T
0
T
0
x (raV + )
0
T
0
x
y([E = ) 0 x(EPE
x(E +
2
2
= ) 0 x(EPE
x( sa iB[ +
2
.
0
x
in estimating
T
T
0
x
)
= ) 0 x(EPE
T
0
x
T
0
0
x(E = )
T
0
x( sa iB
· The second and third terms make up the mean squared error of
0
x
Where
] 0 x = x| )
0
0
· Prediction error at
,
is the true value and
) 1 , 0(
Suppose
.
· How to estimate prediction error?
4/42
file:///Users/ytsun/elasticnet/index.html#42
4/42
5. 2/13/2014
Ridge Regression, LASSO and Elastic Net
K-fold Cross Validation
· Split dataset into K groups
- leave one group out as test set
- use the rest K-1 groups as training set to train the model
- estimate prediction error of the model from the test set
5/42
file:///Users/ytsun/elasticnet/index.html#42
5/42
6. 2/13/2014
Ridge Regression, LASSO and Elastic Net
K-fold Cross Validation
i
E
Let
be the prediction errors for the ith test group, the average prediction error is
K
1
iE
∑ K
= E
1=i
6/42
file:///Users/ytsun/elasticnet/index.html#42
6/42
7. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Quality of an estimator
· Mean squared error of the estimator
2
] )
)
0
(raV + )
([ E = )
2
(ESM
( sa iB = )
(ESM
· A biased estimator may achieve smaller MSE than an unbiased estimator
· useful when our goal is to understand the relationship instead of prediction
7/42
file:///Users/ytsun/elasticnet/index.html#42
7/42
8. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Least Squares Estimator (LSE)
+
1× n
p× n X
1× p
=
1× n Y
p
n,
,1 = i , i
+
j
ji x
∑
=
i
y
1=j
) 1 , 0(
d.i.i
i
Minimize Residual Sum of Square (RSS)
T
X
1
)X
T
X( = )
X
T
X
X
and
Y
Y(
T
)
p > n
X
Y( n im gra =
The solution is uniquely well defined when
inversible
8/42
file:///Users/ytsun/elasticnet/index.html#42
8/42
9. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Pros
de sa i b n u
= )
(E
· LSE has the minimum MSE among unbiased linear estimator though a biased estimator may
have smaller MSE than LSE
· explicit form
2
) p n(O
· computation
· confidence interval, significance of coefficient
9/42
file:///Users/ytsun/elasticnet/index.html#42
9/42
10. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Cons
2
1
)X
T
X( = )
(raV
· Multicollinearity leads to high variance of estimator
- exact or approximate linear relationship among predictors
1
)X
T
X(
-
tends to have large entries
· Requires n > p, i.e., number of observations larger than the number of predictors
r orre n o i tc i der p de tam i t se
p
2
+ ) n / p(
2
= ) 0 x(EPE
0x
E
· Prediction error increases linearly as a function of
· Hard to interpret when the number of predictors is large, need a smaller subset that exhibit
the strongest effects
10/42
file:///Users/ytsun/elasticnet/index.html#42
10/42
11. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Example: Leukemia classification
· Leukemia Data, Golub et al. Science 1999
ji
X
·
9217 = p
· There are 38 training samples and 34 test samples with total
genes (p >> n)
is the gene expression value for sample i and gene j
· Sample i either has tumor type AML or ALL
· We want to select genes relevant to tumor type
- eliminate the trivial genes
- grouped selection as many genes are highly correlated
· LSE does not work here!
11/42
file:///Users/ytsun/elasticnet/index.html#42
11/42
12. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Solution: regularization
· instead of minimizing RSS,
) sre temara p e h t n o y t la ne p ×
0 = !
+ SSR(
ez im i n im
· Trade bias for smaller variance, biased estimator when
· Continuous variable selection (unlike AIC, BIC, subset selection)
·
can be chosen by cross validation
12/42
file:///Users/ytsun/elasticnet/index.html#42
12/42
13. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Ridge Regression
}
2
+
2
Y
T
2
2
X
1
e g dir
X
)I
Y
+ X
{ n im gra =
T
e g dir
X( =
Pros:
· p >> n
· Multicollinearity
· biased but smaller variance and smaller MSE (Mean Squared Error)
· explicit solution
Cons:
· shrink coefficients to zero but can not produce a parsimonious model
13/42
file:///Users/ytsun/elasticnet/index.html#42
13/42
14. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Grouped Selection
· if two predictors are highly correlated among themselves, the estimated coefficients will be
similar for them.
· if some variables are exactly identical, they will have same coefficients
Ridge is good for grouped selection but not good for eliminating trivial genes
14/42
file:///Users/ytsun/elasticnet/index.html#42
14/42
15. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Example: Ridge Regression (Collinearity)
2x
+
1x
=
3x
· multicollinearity
· show that ridge regression beats OLS in the multilinearity case
lbayMS)
irr(AS
n=50
0
z=romn 0 1
nr(, , )
y=z+02*romn 0 1
.
nr(, , )
x =z+romn 0 1
1
nr(, , )
x =z+romn 0 1
2
nr(, , )
x =x +x
3
1
2
d=dt.rm( =y x =x,x =x,x =x)
aafaey
, 1
1 2
2 3
3
15/42
file:///Users/ytsun/elasticnet/index.html#42
15/42
16. 2/13/2014
Ridge Regression, LASSO and Elastic Net
OLS
#OSfi t cluaecefcetfrx
L al o aclt ofiin o 3
osmdl=l( ~.-1 d
l.oe
my
, )
ce(l.oe)
ofosmdl
#
#
x
1
x
2
x
3
# 035 038
# .03 .17
N
A
16/42
file:///Users/ytsun/elasticnet/index.html#42
16/42
17. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Ridge Regression
#cos tnn prmtr
hoe uig aaee
rdemdl=l.ig( ~.-1 d lmd =sq0 1,01)
ig.oe
mrdey
, , aba
e(, 0 .)
lmd.p =rdemdllmd[hc.i(ig.oe$C)
abaot
ig.oe$abawihmnrdemdlGV]
#rdergeso (hikcefcet)
ig ersin srn ofiins
ce(mrdey~.-1 d lmd =lmd.p)
ofl.ig(
, , aba
abaot)
#
#
x
1
x
2
x
3
# 017 010 015
# .71 .92 .28
17/42
file:///Users/ytsun/elasticnet/index.html#42
17/42
18. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Approximately multicollinear
· show that ridge regreesion correct coefficient signs and reduce mean squared error
x =x +x +00 *romn 0 1
3
1
2
.5
nr(, , )
d=dt.rm( =y x =x,x =x,x =x)
aafaey
, 1
1 2
2 3
3
dtan=d140 ]
.ri
[:0,
dts =d4150 ]
.et
[0:0,
18/42
file:///Users/ytsun/elasticnet/index.html#42
18/42
19. 2/13/2014
Ridge Regression, LASSO and Elastic Net
OLS
ostan=l( ~.-1 dtan
l.ri
my
, .ri)
ce(l.ri)
ofostan
#
#
x
1
x
2
x
3
# -.74-.52 063
# 036 032
.89
#peito err
rdcin ros
sm(.ety-peitostan nwaa=dts)^)
u(dts$
rdc(l.ri, edt
.et)2
# []3.3
# 1 75
19/42
file:///Users/ytsun/elasticnet/index.html#42
19/42
20. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Ridge Regression
#cos tnn prmtrfrrdergeso
hoe uig aaee o ig ersin
rdetan=l.ig( ~.-1 dtan lmd =sq0 1,01)
ig.ri
mrdey
, .ri, aba
e(, 0 .)
lmd.p =rdetanlmd[hc.i(ig.ri$C)
abaot
ig.ri$abawihmnrdetanGV]
rdemdl=l.ig( ~.-1 dtan lmd =lmd.p)
ig.oe
mrdey
, .ri, aba
abaot
ce(ig.oe) #cretsgs
ofrdemdl
orc in
#
#
x
1
x
2
x
3
# 011 013 014
# .73 .96 .30
ces=ce(ig.oe)
of
ofrdemdl
sm(.ety-a.arxdts[ -] %%mti(of,3 1)2
u(dts$
smti(.et, 1) * arxces , )^)
# []3.7
# 1 68
20/42
file:///Users/ytsun/elasticnet/index.html#42
20/42
21. 2/13/2014
Ridge Regression, LASSO and Elastic Net
LASSO
}1
+
ossal
2
X
2
Y
{ n im gra =
Or equivalently
p
t
|
|
j
∑
=
1
. t. s
2
2
X
Y
n im
1=j
Pros
· allow p >> n
· enforce sparcity in parameters
·
goes to
, OLS solution
, t goes to 0,
0 =
goes to 0, t goes to
2
·
) p n(O
· quadratic programming problem, lars solution requires
21/42
file:///Users/ytsun/elasticnet/index.html#42
21/42
22. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Cons
· if a group of predictors are highly correlated among themselves, LASSO tends to pick only
one of them and shrink the other to zero
· can not do grouped selection, tend to select one variable
LASSO is good for eliminating trivial genes but not good for grouped selection
22/42
file:///Users/ytsun/elasticnet/index.html#42
22/42
23. 2/13/2014
Ridge Regression, LASSO and Elastic Net
LARS algorithm of Efron et al (2004)
· stepwise variable selection (Least angle regression and shrinkage)
· less greedy version of traditional forward selection methods
· solve the entire lasso solution path efficiently
2
) p n(O
· same order of computational efforts as a single OLS fit
23/42
file:///Users/ytsun/elasticnet/index.html#42
23/42
24. 2/13/2014
Ridge Regression, LASSO and Elastic Net
LARS Path
S LO
] 1 , 0[
s ,1
s
1
. t. s
2
2
X
Y
n im
24/42
file:///Users/ytsun/elasticnet/index.html#42
24/42
25. 2/13/2014
Ridge Regression, LASSO and Elastic Net
parsimonious model
lbayMS)
irr(AS
n=2
0
#bt i sas
ea s pre
bt =mti((,15 0 0 2 0 0 0,8 1
ea
arxc3 ., , , , , , ) , )
p=lnt(ea
eghbt)
ro=03
h
.
cr =mti(,p p
or
arx0 , )
fr( i sqp){
o i n e()
fr( i sqp){
o j n e()
cr[,j =roasi-j
ori ]
h^b(
)
}
}
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
clae()=c"" pse(x,sqp)
onmsd
(y, at0"" e())
25/42
file:///Users/ytsun/elasticnet/index.html#42
25/42
26. 2/13/2014
Ridge Regression, LASSO and Elastic Net
OLS
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
clae()=c"" pse(x,sqp)
onmsd
(y, at0"" e())
#ftOSwtotitret
i L ihu necp
osmdl=l( ~.-1 d
l.oe
my
, )
mei =sm(ofosmdl -bt)2
s[]
u(ce(l.oe)
ea^)
}
mda(s)
einme
# []63
# 1 .2
26/42
file:///Users/ytsun/elasticnet/index.html#42
26/42
27. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Ridge Regression
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
clae()=c"" pse(x,sqp)
onmsd
(y, at0"" e())
rdec =l.ig( ~.-1 d lmd =sq0 1,01)
ig.v
mrdey
, , aba
e(, 0 .)
lmd.p =rdec$abawihmnrdec$C)
abaot
ig.vlmd[hc.i(ig.vGV]
#ftrdergeso wtotitret
i ig ersin ihu necp
rdemdl=l.ig( ~.-1 d lmd =lmd.p)
ig.oe
mrdey
, , aba
abaot
mei =sm(ofrdemdl -bt)2
s[]
u(ce(ig.oe)
ea^)
}
mda(s)
einme
# []404
# 1 .7
27/42
file:///Users/ytsun/elasticnet/index.html#42
27/42
28. 2/13/2014
Ridge Regression, LASSO and Elastic Net
LASSO
lbayeatce)
irr(lsint
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
ojc =c.ntX y lmd =0 s=sq01 1 lnt =10,po.t=FLE
b.v
vee(, , aba
,
e(., , egh
0) lti
AS,
md ="rcin,tae=FLE mxses=8)
oe
fato" rc
AS, a.tp
0
sot=ojc$[hc.i(b.vc)
.p
b.vswihmnojc$v]
lsomdl=ee(,y lmd =0 itret=FLE
as.oe
ntX , aba
, necp
AS)
ces=peitlsomdl s=sot tp ="ofiins,md ="rcin)
of
rdc(as.oe,
.p, ye
cefcet" oe
fato"
mei =sm(of$ofiins-bt)2
s[]
u(cescefcet
ea^)
}
mda(s)
einme
# []333
# 1 .9
28/42
file:///Users/ytsun/elasticnet/index.html#42
28/42
29. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Elastic Net
}
2
2
2
+
1
1
+ )
X
Y(
T
te ne
)
X
Y({ n im gra =
Pros
· enforce sparsity
· no limitation on the number of selected variable
· encourage grouping effect in the presence of highly correlated predictors
Cons
· naive elastic net suffers from double shrinkage
Correction
)2
+ 1( =
te ne
29/42
file:///Users/ytsun/elasticnet/index.html#42
29/42
30. 2/13/2014
Ridge Regression, LASSO and Elastic Net
LASSO vs Elastic Net
Construct a data set with grouped effects to show that Elastic Net outperform LASSO in
grouped selection
· response y
, 2x , 1 x
as minor factors
2
) 1 , 0( N +
2z
z
and
3x
1. 0 +
1z
1
z
Two independent "hidden" factors
, 5x , 4 x
we would like to shrink to zero
as dominant factors,
6x
· 6 predictors fall into two group,
= y
Correlated grouped covariates
1z
+
3
2z
+
6
=
=
3x
6x
,2
,5
)6x ,
+
+
1z
2z
=
=
2x
5x
,1
,4
+
+
1z
2z
=
=
1x
4x
, 2 x , 1x( = X
30/42
file:///Users/ytsun/elasticnet/index.html#42
30/42
31. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Simulated data
N=10
0
z =rnfN mn=0 mx=2)
1
ui(, i
, a
0
z =rnfN mn=0 mx=2)
2
ui(, i
, a
0
y=z +01*z +romN
1
.
2
nr()
X=cidz %%mti((,-,1,1 3,z %%mti((,-,1,1 3)
bn(1 * arxc1 1 ) , ) 2 * arxc1 1 ) , )
X=X+mti(nr( *6,N 6
arxromN
) , )
31/42
file:///Users/ytsun/elasticnet/index.html#42
31/42
32. 2/13/2014
Ridge Regression, LASSO and Elastic Net
LASSO path
lbayeatce)
irr(lsint
ojlso=ee(,y lmd =0
b.as
ntX , aba
)
po(b.as,ueclr=TU)
ltojlso s.oo
RE
32/42
file:///Users/ytsun/elasticnet/index.html#42
32/42
33. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Elastic Net
lbayeatce)
irr(lsint
ojee =ee(,y lmd =05
b.nt
ntX , aba
.)
po(b.nt ueclr=TU)
ltojee, s.oo
RE
33/42
file:///Users/ytsun/elasticnet/index.html#42
33/42
34. 2/13/2014
Ridge Regression, LASSO and Elastic Net
How to choose tuning parameter
For a sequence of , find the s that minimizer of the CV prediction error and then find the
which minimize the CV prediction error
lbayeatce)
irr(lsint
ojc =c.ntX y lmd =05 s=sq0 1 lnt =10,md ="rcin,
b.v
vee(, , aba
.,
e(, , egh
0) oe
fato"
tae=FLE mxses=8)
rc
AS, a.tp
0
34/42
file:///Users/ytsun/elasticnet/index.html#42
34/42
35. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Prostate Cancer Example
· Predictors are eight clinical measures
· Training set with 67 observations
· Test set with 30 observations
· Modeling fitting and turning parameter selection by tenfold CV on training set
· Compare model performance by prediction mean-squared error on the test data
35/42
file:///Users/ytsun/elasticnet/index.html#42
35/42
36. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Compare models
· medium correlation among predictors and the highest correlation is 0.76
· elastic net beat LASSO and ridge regression beat OLS
36/42
file:///Users/ytsun/elasticnet/index.html#42
36/42
37. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Summary
· Ridge Regression:
- good for multicollinearity, grouped selection
- not good for variable selection
· LASSO
- good for variable selection
- not good for grouped selection for strongly correlated predictors
· Elastic Net
- combine strength between Ridge Regression and LASSO
· Regularization
- trade bias for variance reduction
- better prediction accuracy
37/42
file:///Users/ytsun/elasticnet/index.html#42
37/42
38. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Reference
Most of the materials covered in this slides are adapted from
· Paper: Regularization and variable selection via the elastic net
· Slide: http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf
· The Elements of Statistical Learning
38/42
file:///Users/ytsun/elasticnet/index.html#42
38/42
40. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Questions:
· Fit OLS, LASSO, Ridge regression and elastic net to the training data and calculate the
prediction error from the test data
· Simulate the data set for 100 times and compare the median mean-squared errors for those
models
40/42
file:///Users/ytsun/elasticnet/index.html#42
40/42
41. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Exercise 2: Diabetes
· x a matrix with 10 columns
· y a numeric vector (442 rows)
· x2 a matrix with 64 columns
lbayeatce)
irr(lsint
dt(ibts
aadaee)
clae(ibts
onmsdaee)
# []"" "" "2
# 1 x
y
x"
41/42
file:///Users/ytsun/elasticnet/index.html#42
41/42
42. 2/13/2014
Ridge Regression, LASSO and Elastic Net
Questions
· Fit LASSO and Elastic Net to the data with optimal tuning parameter chosen by cross
validation.
· Compare solution paths for the two methods
42/42
file:///Users/ytsun/elasticnet/index.html#42
42/42