Ridge regression, lasso and elastic net

2/13/2014

Ridge Regression, LASSO and Elastic Net

A talk given at NYC open data meetup, find more at www.nycopendata.com
Yunting Sun
Google Inc

file:///Users/ytsun/elasticnet/index.html#42

1/42

2/13/2014


Overview
· Linear Regression
· Ordinary Least Square
· Ridge Regression
· LASSO
· Elastic Net
· Examples
· Exercises
Note: make sure that you have installed elasticnet package

library(MASS)
library(elasticnet)

2/42

2/42

2/13/2014


Linear Regression
n observations, each has one response variable and p predictors

,) p X ,

p × n
, 1 x( = x

to

px

,

n

, 1x

y

, y( = Y
1

, 1X( = X
y

to predict

and

) px ,

) y,

of predictors

- describe the actual relationship between
T

x = y

- use

T

,

1 × n

· We want to find a linear combination

· Examples
- find relationship between pressure and water boiling point
- use GDP to predict interest rate (the accuracy of the prediction is important but the
actual relationship may not matter)

3/42

3/42

2/13/2014


Quality of an estimator
+

T
0

x = y

, the difference between the actual response and the model prediction
2

2

])

T
0

)

T
0

x

T
0
T

0

x (raV + )

0
T
0

x

y([E = ) 0 x(EPE

x(E +
2

2

= ) 0 x(EPE

x( sa iB[ +

2

.

0

x

in estimating

T

T
0

x

)

= ) 0 x(EPE
T
0

x

T
0

0

x(E = )

T
0

x( sa iB

· The second and third terms make up the mean squared error of

0

x

Where

] 0 x = x| )

0

0

· Prediction error at

,

is the true value and

) 1 , 0( 

Suppose

.

· How to estimate prediction error?

4/42

4/42

2/13/2014


K-fold Cross Validation
· Split dataset into K groups
- leave one group out as test set
- use the rest K-1 groups as training set to train the model
- estimate prediction error of the model from the test set

5/42

5/42

2/13/2014


K-fold Cross Validation

i

E

Let

be the prediction errors for the ith test group, the average prediction error is
K

1
iE

∑ K

= E

1=i

6/42

6/42

2/13/2014


Quality of an estimator
· Mean squared error of the estimator
2

] )
)

0

(raV + )

([ E = )
2

(ESM

( sa iB = )

(ESM

· A biased estimator may achieve smaller MSE than an unbiased estimator
· useful when our goal is to understand the relationship instead of prediction

7/42

7/42

2/13/2014


Least Squares Estimator (LSE)
+

1× n

p× n X

1× p

=

1× n Y
p

n,

,1 = i , i

+

j

ji x

∑

=

i

y

1=j

) 1 , 0( 

d.i.i
i

Minimize Residual Sum of Square (RSS)
T

X

1

)X

T

X( = )
X

T

X

X

and

Y

Y(

T

)

p > n

X

Y( n im gra =

The solution is uniquely well defined when

inversible

8/42

8/42

2/13/2014


Pros
de sa i b n u

= )

(E

· LSE has the minimum MSE among unbiased linear estimator though a biased estimator may
have smaller MSE than LSE
· explicit form
2

) p n(O

· computation

· confidence interval, significance of coefficient

9/42

9/42

2/13/2014


Cons
2

1

)X

T

X( = )

(raV

· Multicollinearity leads to high variance of estimator
- exact or approximate linear relationship among predictors
1

)X

T

X(

-

tends to have large entries

· Requires n > p, i.e., number of observations larger than the number of predictors
r orre n o i tc i der p de tam i t se
p

2

+ ) n / p(

2

= ) 0 x(EPE

0x

E

· Prediction error increases linearly as a function of

· Hard to interpret when the number of predictors is large, need a smaller subset that exhibit
the strongest effects

10/42

10/42

2/13/2014


Example: Leukemia classification
· Leukemia Data, Golub et al. Science 1999

ji

X

·

9217 = p

· There are 38 training samples and 34 test samples with total

genes (p >> n)

is the gene expression value for sample i and gene j

· Sample i either has tumor type AML or ALL
· We want to select genes relevant to tumor type
- eliminate the trivial genes
- grouped selection as many genes are highly correlated
· LSE does not work here!

11/42

11/42

2/13/2014


Solution: regularization
· instead of minimizing RSS,
) sre temara p e h t n o y t la ne p ×
0 = !

+ SSR(

ez im i n im

· Trade bias for smaller variance, biased estimator when

· Continuous variable selection (unlike AIC, BIC, subset selection)
·

can be chosen by cross validation

12/42

12/42

2/13/2014


Ridge Regression
}

2

+

2

Y

T

2
2

X

1

e g dir

X
)I

Y
+ X

{ n im gra =
T

e g dir

X( =

Pros:
· p >> n
· Multicollinearity
· biased but smaller variance and smaller MSE (Mean Squared Error)
· explicit solution
Cons:
· shrink coefficients to zero but can not produce a parsimonious model

13/42

13/42

2/13/2014


Grouped Selection
· if two predictors are highly correlated among themselves, the estimated coefficients will be
similar for them.
· if some variables are exactly identical, they will have same coefficients
Ridge is good for grouped selection but not good for eliminating trivial genes

14/42

14/42

2/13/2014


Example: Ridge Regression (Collinearity)
2x

+

1x

=

3x

· multicollinearity

· show that ridge regression beats OLS in the multilinearity case
lbayMS)
irr(AS
n=50
0
z=romn 0 1
nr(, , )
y=z+02*romn 0 1
.
nr(, , )
x =z+romn 0 1
1
nr(, , )
x =z+romn 0 1
2
nr(, , )
x =x +x
3
1
2
d=dt.rm( =y x =x,x =x,x =x)
aafaey
, 1
1 2
2 3
3

15/42

15/42

2/13/2014


OLS
#OSfi t cluaecefcetfrx
L al o aclt ofiin o 3
osmdl=l( ~.-1 d
l.oe
my
, )
ce(l.oe)
ofosmdl

#
#

x
1

x
2

x
3

# 035 038
# .03 .17

N
A

16/42

16/42

2/13/2014


Ridge Regression
#cos tnn prmtr
hoe uig aaee
rdemdl=l.ig( ~.-1 d lmd =sq0 1,01)
ig.oe
mrdey
, , aba
e(, 0 .)
lmd.p =rdemdllmd[hc.i(ig.oe$C)
abaot
ig.oe$abawihmnrdemdlGV]
#rdergeso (hikcefcet)
ig ersin srn ofiins
ce(mrdey~.-1 d lmd =lmd.p)
ofl.ig(
, , aba
abaot)

#
#

x
1

x
2

x
3

# 017 010 015
# .71 .92 .28

17/42

17/42

2/13/2014


Approximately multicollinear
· show that ridge regreesion correct coefficient signs and reduce mean squared error
x =x +x +00 *romn 0 1
3
1
2
.5
nr(, , )
d=dt.rm( =y x =x,x =x,x =x)
aafaey
, 1
1 2
2 3
3
dtan=d140 ]
.ri
[:0,
dts =d4150 ]
.et
[0:0,

18/42

18/42

2/13/2014


OLS
ostan=l( ~.-1 dtan
l.ri
my
, .ri)
ce(l.ri)
ofostan

#
#
x
1
x
2
x
3
# -.74-.52 063
# 036 032
.89

#peito err
rdcin ros
sm(.ety-peitostan nwaa=dts)^)
u(dts$
rdc(l.ri, edt
.et)2

# []3.3
# 1 75

19/42

19/42

2/13/2014


Ridge Regression
#cos tnn prmtrfrrdergeso
hoe uig aaee o ig ersin
rdetan=l.ig( ~.-1 dtan lmd =sq0 1,01)
ig.ri
mrdey
, .ri, aba
e(, 0 .)
lmd.p =rdetanlmd[hc.i(ig.ri$C)
abaot
ig.ri$abawihmnrdetanGV]
rdemdl=l.ig( ~.-1 dtan lmd =lmd.p)
ig.oe
mrdey
, .ri, aba
abaot
ce(ig.oe) #cretsgs
ofrdemdl
orc in

#
#

x
1

x
2

x
3

# 011 013 014
# .73 .96 .30

ces=ce(ig.oe)
of
ofrdemdl
sm(.ety-a.arxdts[ -] %%mti(of,3 1)2
u(dts$
smti(.et, 1) * arxces , )^)

# []3.7
# 1 68

20/42

20/42

2/13/2014


LASSO
}1

+

ossal

2

X

2

Y

{ n im gra =

Or equivalently
p

t

|

|

j

∑

=

1

. t. s

2
2

X

Y

n im

1=j

Pros
· allow p >> n
· enforce sparcity in parameters

·

goes to

, OLS solution

, t goes to 0,

0 =

goes to 0, t goes to

2

·

) p n(O

· quadratic programming problem, lars solution requires

21/42

21/42

2/13/2014


Cons
· if a group of predictors are highly correlated among themselves, LASSO tends to pick only
one of them and shrink the other to zero
· can not do grouped selection, tend to select one variable
LASSO is good for eliminating trivial genes but not good for grouped selection

22/42

22/42

2/13/2014


LARS algorithm of Efron et al (2004)
· stepwise variable selection (Least angle regression and shrinkage)
· less greedy version of traditional forward selection methods
· solve the entire lasso solution path efficiently
2

) p n(O

· same order of computational efforts as a single OLS fit

23/42

23/42

2/13/2014


LARS Path
S LO

] 1 , 0[

s ,1

s

1

. t. s

2
2

X

Y

n im

24/42

24/42

2/13/2014


parsimonious model
lbayMS)
irr(AS
n=2
0
#bt i sas
ea s pre
bt =mti((,15 0 0 2 0 0 0,8 1
ea
arxc3 ., , , , , , ) , )
p=lnt(ea
eghbt)
ro=03
h
.
cr =mti(,p p
or
arx0 , )
fr( i sqp){
o i n e()
fr( i sqp){
o j n e()
cr[,j =roasi-j
ori ]
h^b(
)
}
}
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
clae()=c"" pse(x,sqp)
onmsd
(y, at0"" e())

25/42

25/42

2/13/2014


OLS
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
onmsd
(y, at0"" e())
#ftOSwtotitret
i L ihu necp
osmdl=l( ~.-1 d
l.oe
my
, )
mei =sm(ofosmdl -bt)2
s[]
u(ce(l.oe)
ea^)
}
mda(s)
einme

# []63
# 1 .2

26/42

26/42

2/13/2014


Ridge Regression
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
onmsd
(y, at0"" e())
rdec =l.ig( ~.-1 d lmd =sq0 1,01)
ig.v
mrdey
, , aba
e(, 0 .)
lmd.p =rdec$abawihmnrdec$C)
abaot
ig.vlmd[hc.i(ig.vGV]
#ftrdergeso wtotitret
i ig ersin ihu necp
rdemdl=l.ig( ~.-1 d lmd =lmd.p)
ig.oe
mrdey
, , aba
abaot
mei =sm(ofrdemdl -bt)2
s[]
u(ce(ig.oe)
ea^)
}
mda(s)
einme

# []404
# 1 .7

27/42

27/42

2/13/2014


LASSO
lbayeatce)
irr(lsint
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
ojc =c.ntX y lmd =0 s=sq01 1 lnt =10,po.t=FLE
b.v
vee(, , aba
,
e(., , egh
0) lti
AS,
md ="rcin,tae=FLE mxses=8)
oe
fato" rc
AS, a.tp
0
sot=ojc$[hc.i(b.vc)
.p
b.vswihmnojc$v]
lsomdl=ee(,y lmd =0 itret=FLE
as.oe
ntX , aba
, necp
AS)
ces=peitlsomdl s=sot tp ="ofiins,md ="rcin)
of
rdc(as.oe,
.p, ye
cefcet" oe
fato"
mei =sm(of$ofiins-bt)2
s[]
u(cescefcet
ea^)
}
mda(s)
einme

# []333
# 1 .9

28/42

28/42

2/13/2014


Elastic Net
}

2
2

2

+

1

1

+ )

X

Y(

T

te ne

)

X

Y({ n im gra =

Pros
· enforce sparsity
· no limitation on the number of selected variable
· encourage grouping effect in the presence of highly correlated predictors
Cons
· naive elastic net suffers from double shrinkage
Correction
)2

+ 1( =

te ne

29/42

29/42

2/13/2014


LASSO vs Elastic Net
Construct a data set with grouped effects to show that Elastic Net outperform LASSO in
grouped selection
· response y
, 2x , 1 x

as minor factors

2

) 1 , 0( N +

2z

z

and

3x

1. 0 +

1z

1

z

Two independent "hidden" factors

, 5x , 4 x

we would like to shrink to zero

as dominant factors,

6x

· 6 predictors fall into two group,

= y

Correlated grouped covariates
1z

+

3

2z

+

6

=
=

3x
6x

,2
,5

)6x ,

+
+

1z
2z

=
=

2x
5x

,1
,4

+
+

1z
2z

=
=

1x
4x

, 2 x , 1x( = X

30/42

30/42

2/13/2014


Simulated data
N=10
0
z =rnfN mn=0 mx=2)
1
ui(, i
, a
0
z =rnfN mn=0 mx=2)
2
ui(, i
, a
0
y=z +01*z +romN
1
.
2
nr()
X=cidz %%mti((,-,1,1 3,z %%mti((,-,1,1 3)
bn(1 * arxc1 1 ) , ) 2 * arxc1 1 ) , )
X=X+mti(nr( *6,N 6
arxromN
) , )

31/42

31/42

2/13/2014


LASSO path
lbayeatce)
irr(lsint
ojlso=ee(,y lmd =0
b.as
ntX , aba
)
po(b.as,ueclr=TU)
ltojlso s.oo
RE

32/42

32/42

2/13/2014


Elastic Net
lbayeatce)
irr(lsint
ojee =ee(,y lmd =05
b.nt
ntX , aba
.)
po(b.nt ueclr=TU)
ltojee, s.oo
RE

33/42

33/42

2/13/2014


How to choose tuning parameter
For a sequence of , find the s that minimizer of the CV prediction error and then find the
which minimize the CV prediction error
lbayeatce)
irr(lsint
ojc =c.ntX y lmd =05 s=sq0 1 lnt =10,md ="rcin,
b.v
vee(, , aba
.,
e(, , egh
0) oe
fato"
tae=FLE mxses=8)
rc
AS, a.tp
0

34/42

34/42

2/13/2014


Prostate Cancer Example
· Predictors are eight clinical measures
· Training set with 67 observations
· Test set with 30 observations
· Modeling fitting and turning parameter selection by tenfold CV on training set
· Compare model performance by prediction mean-squared error on the test data

35/42

35/42

2/13/2014


Compare models

· medium correlation among predictors and the highest correlation is 0.76
· elastic net beat LASSO and ridge regression beat OLS

36/42

36/42

2/13/2014


Summary
· Ridge Regression:
- good for multicollinearity, grouped selection
- not good for variable selection
· LASSO
- good for variable selection
- not good for grouped selection for strongly correlated predictors
· Elastic Net
- combine strength between Ridge Regression and LASSO
· Regularization
- trade bias for variance reduction
- better prediction accuracy

37/42

37/42

2/13/2014


Reference
Most of the materials covered in this slides are adapted from
· Paper: Regularization and variable selection via the elastic net
· Slide: http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf
· The Elements of Statistical Learning

38/42

38/42

2/13/2014


Exercise 1: simulated data
bt =mti((e(,1) rp0 2),4,1
ea
arxcrp3 5, e(, 5) 0 )
sga=1
im
5
n=50
0
z =mti(nr(,0 1,n 1
1
arxromn , ) , )
z =mti(nr(,0 1,n 1
2
arxromn , ) , )
z =mti(nr(,0 1,n 1
3
arxromn , ) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
1
1 * arxrp1 ) , )
.1
arxromn
) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
2
2 * arxrp1 ) , )
.1
arxromn
) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
3
3 * arxrp1 ) , )
.1
arxromn
) , )
X =mti(nr( *2,0 1,n 2)
4
arxromn
5 , ) , 5
X=cidX,X,X,X)
bn(1 2 3 4
Y=X%%bt +sga*romn 0 1
* ea
im
nr(, , )
Ytan=Y140
.ri
[:0]
Xtan=X140 ]
.ri
[:0,
Yts =Y4050
.et
[0:0]
Xts =X4050 ]
.et
[0:0,

39/42

39/42

2/13/2014


Questions:
· Fit OLS, LASSO, Ridge regression and elastic net to the training data and calculate the
prediction error from the test data
· Simulate the data set for 100 times and compare the median mean-squared errors for those
models

40/42

40/42

2/13/2014


Exercise 2: Diabetes
· x a matrix with 10 columns
· y a numeric vector (442 rows)
· x2 a matrix with 64 columns
lbayeatce)
irr(lsint
dt(ibts
aadaee)
clae(ibts
onmsdaee)

# []"" "" "2
# 1 x
y
x"

41/42

41/42

2/13/2014


Questions
· Fit LASSO and Elastic Net to the data with optimal tuning parameter chosen by cross
validation.
· Compare solution paths for the two methods

42/42

42/42

Ridge regression, lasso and elastic net

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Ridge regression, lasso and elastic net

Semelhante a Ridge regression, lasso and elastic net (20)

Mais de Vivian S. Zhang

Mais de Vivian S. Zhang (20)

Último

Último (20)

Ridge regression, lasso and elastic net