1) The document discusses estimation methods for generalized linear models (GLMs) and generalized partial linear models (GPLMs). 2) GPLMs extend GLMs by adding a single nonparametric component to the linear predictor. 3) Parameter estimation for GPLMs is performed by maximizing a penalized likelihood function, where the penalty term controls the tradeoff between model fit and smoothness of the nonparametric component. 4) An iterative algorithm such as Newton-Raphson is used to solve the penalized maximum likelihood estimation problem.
On Foundations of Parameter Estimation for Generalized Partial Linear Models with B–Splines and Continuous Optimization
1. 5th International Summer School
Achievements and Applications of Contemporary Informatics,
Mathematics and Physics
National University of Technology of the Ukraine
Kiev, Ukraine, August 3-15, 2010
On Foundations of Parameter Estimation for
Generalized Partial Linear Models
with B–Splines and Continuous Optimization
Gerhard-Wilhelm WEBER
Institute of Applied Mathematics, METU, Ankara,Turkey
Faculty of Economics, Business and Law, University of Siegen, Germany
Center for Research on Optimization and Control, University of Aveiro, Portugal
Universiti Teknologi Malaysia, Skudai, Malaysia
Pakize TAYLAN
Department of Mathematics, Dicle University, Diyarbakır, Turkey
Lian LIU
Roche Pharma Development Center in Asia Pacific, Shangai China
2. Outline
• Introduction
• Estimation for Generalized Linear Models
• Generalized Partial Linear Model (GPLM)
• Newton-Raphson and Scoring Methods
• Penalized Maximum Likelihood
• Penalized Iteratively Reweighted Least Squares (P-IRLS)
• An Alternative Solution for (P-IRLS) with CQP
• Solution Methods
• Linear Model + MARS, and Robust CMARS
• Conclusion
3. Introduction
The class of Generalized Linear Models (GLMs) has gained popularity as a statistical modeling tool.
This popularity is due to:
• The flexibility of GLM in addressing a variety of statistical problems,
• The availability of software (Stata, SAS, S-PLUS, R) )to fit the models.
The class of GLM is an extension of traditional linear models allows:
The mean of a dependent variable to depend on a linear predictor by a nonlinear link function......
The probability distribution of the response, to be any member of an exponential family of distributions.
Many widely used statistical models belong to GLM:
o linear models with normal errors,
o logistic and probit models for binary data,
o log-linear models for multinomial data.
4. Introduction
Many other useful statistical models such as with
• Poisson, binomial,
• Gamma or normal distributions,
can be formulated as GLM by the selection of an appropriate link function
and response probability distribution.
A GLM looks as follows:
i H ( i ) xiT ;
• i E(Yi ) : expected value of the response variable Yi ,
• H: smooth monotonic link function,
• xi : observed value of explanatory variable for the i-th case,
• : vector of unknown parameters.
5. Introduction
• Assumptions: Yi are independent and can have any distribution from exponential family density
Yi ~ fY ( yi ,i , )
i
y b ( )
exp i i i i ci ( yi , ) (i 1, 2,..., n),
ai ( )
• ai , bi , ci are arbitrary “scale” parameters, and i is called a natural parameter .
• General expressions for mean and variance of dependent variable Yi :
i E (Yi ) bi' (i ),
Var (Yi ) V ( i ) ,
V ( i ) bi" (i ) i , ai ( ) : / i .
6. Estimation for GLM
• Estimation and inference for GLM is based on the theory of
• Maximum Likelihood Estimation
• Least –Squares approach:
n
l ( ) : ( y
i 1
i i i
bi (i ) ci ( yi , )).
• The dependence of the right-hand side on is solely through the dependence of the i on .
• Score equations:
n
i -1
x Vi yi i 0,
i
ij
i 1
i i xij i j , xi 0 1 (i = 1, 2,..., n; j = 0,1,..., m) .
• Solution for score equations given by Fisher scoring procedure based on the a
Newton-Raphson algorithm.
7. Generalized Partial Linear Models (GPLMs)
• Particular semiparametric models are the Generalized Partial Linear Models (GPLMs) :
They extend the GLMs in that the usual parametric terms are augmented by a
single nonparametric component:
E Y X , T G X T T ;
m is a vector of parameters,
T
• and
is a smooth function,
which we try to estimate by splines.
• Assumption: m-dimensional random vector X which represents (typically discrete) covariates,
q-dimensional random vector T of continuous covariates,
which comes from a decomposition of explanatory variables.
Other interpretations of : role of the environment,
expert opinions,
Wiener processes, etc..
8. Newton-Raphson and Scoring Methods
The Newton-Raphson algorithm is based on a quadratic Taylor series approximation.
• An important statistical application of the Newton-Raphson algorithm is given by
maximum likelihood estimation:
l ( , y ) l a ( , y )
l ( , y )
l ( 0 , y ) ( 0 )
0
2l ( , y )
( 0 )2 ; 0 : startingvalue;
T
0
• l ( , y) log L( , y) : log likelihood function of T
is based on the observed data y = ( y1 , y2 ,..., yn ) .
• Next, determine the new iterate 1 from l a
( , y) 0 :
1 : 0 C 1r , r := l ( , y) , C := 2l ( , y) T .
• The Fisher’s scoring method replaces C by the expectation E(C).
9. Penalized Maximum Likelihood
• Penalized Maximum Likelihood criterion for GPLM:
b
j ( , ) : l ( , y) 1 ( (t )) 2 dt .
2
a
• l : log likelihood of the linear predictor and the second term penalizes the integrated squared
curvature of t over the given interval a, b .
• : smoothing parameter controlling the trade-off between
accuracy of the data fitting and its smoothness (stability, robustness or regularity).
• Maximization of j ( , ) given by B-splines through the local scoring algorithm.
For this, we write a k degree B-spline with knots at the values ti (i 1, 2, ..., n) for t :
v
t j B j ,k (t ),
j 1
where j are coefficients, and B j ,k are k degree B-spline basis functions.
10. Penalized Maximum Likelihood
• Zero and k degree B-splines bases are defined by
1, t j t t j 1
B j ,0 (t )
0, otherwise,
t tj t j k 1 t
B j ,k (t ) B j ,k 1 (t ) B j 1,k 1 (t ) k 1 .
t j k t j t j k 1 t j 1
t : (t1 ),..., (tn ) and define an n v matrix Β Bij : B j (ti );
T
• We write by
then,
t = Β 1 2 v .
• Further, define a vv matrix by
b
K kl : Bk (t ) Bl(t ) dt.
a
11. Penalized Maximum Likelihood
• Then, the j ( , ) criterion can be written as
j ( , ) l (, y) 1
2
• If we insert the least-squares estimation = ( BT B)1 BT t ,
ˆ we get:
j ( , ) l (, y) 1 ,
2
where := ( ΒΤ Β) ( ΒΤ Β) ΒΤ .
• Now, we will find ˆ and ˆ to solve the optimization problem of maximizing j( , ) .
• Let
H ( ) ( X , t ) g1 g2 ; g1 : X g2 : t
12. Penalized Maximum Likelihood
• To maximize j ( , ) with respect to g1 and g2 , we solve the following system of equations:
T
j ( , ) l ( , y )
0,
g1 g1
T
j ( , ) l ( , y )
g 2 0,
g 2 g 2
which we treat by Newton-Raphson method.
• These system equations are nonlinear in and g2 .
We linearize them around a current guess by
l ( , y) l ( , y) 2l (, y)
0.
13. Penalized Maximum Likelihood
• We use this equation in the system equations :
C C g1 g1
1 0
r l ( , y) 2l ( , y)
1 0
0
; r := , C := ,
C C + g2 g2 r g2
0
2
where g1 , g 0 g1 , g1
1
2 is a Newton-Raphson step and C and r are evaluated at
.
• More simple form:
C C g1 C
1
(A *) 1 h; h := C 1r , S B := (C + M )1 C ,
SB I g2 S B
which can be resolved for
g1 X h g1
1
1 1
2
1
.
g2 S B (h g1 )
14. Penalized Maximum Likelihood
• ˆ
and ˆ can be found explicitly without iteration (inner loop backfitting):
g1 X
ˆ ˆ
X { X T C ( I S B ) X }1 X T C ( I S B )h ,
ˆ
g2 ˆ S B (h X ).
ˆ
• Here, X represents the regression matrix for the input data xi ,
S B computes a weighted B-spline smoothing on the variable ti ,
with weights given by
C 2l (, y)
and h is the adjusted dependent variable.
15. Penalized Maximum Likelihood
• From the updated ˆ , the outer loop must be iterated to update
ˆ and, hence, h and C ;
then, the loop is repeated until sufficient convergence is achieved.
Step size optimization is performed by ( ) 1 (1 ) 0 ,
and we turn to maximize j ( ( ) ).
• Standard results on the Newton-Raphson procedure ensure local convergence.
• Asymptotic properties of the
RB ( C 1r ),
ˆ ˆ ˆ
RB h, r = l , y) ,
ˆ ˆ
where RB is the weighted additive fit operator:
If we replace h, RB and C by their asymptotic versions h0 , RB0 and C0 ,
then, we get the covariance matrix for
ˆ
16. Penalized Maximum Likelihood
Cov( ) RB0 C0 1 RB0
ˆ T
'' '' : asymptotically
RBC 1 RB
T
and
Cov( gs ) RBs C 1RBs
ˆ T
(s 1, 2).
Here, h h0 has mean and variance C0 C 1
1
• , and RB j is the matrix
ˆ
that produces g j from h based on B-splines.
• Furthermore, ˆ is asymptotically distributed as
RB C 01 RB
0
T
0
17. Penalized Iteratively Reweighted Least Squares (P-IRLS)
The penalized likelihood is maximized by the penalized iteratively reweighted least squares
to find the p 1 th estimate of the linear predictor
[ p +1]
, which is given by
B * i[ p ] X iT ˆ T ,
2
C [ p ] (h[ p ] ) ˆ
i[ p ] H 1 (i[ p ] ),
where h[ p ] is the iteratively adjusted dependent variable, given by
hi[ p ] : i[ p ] H (i[ p ] )( yi i[ p ] );
here, H p] represents the derivative of H with respect to , and
C[ is a diagonal weight matrix with entries
Cii p ] : 1 V ( i[ p ] ) H ( i[ p ] )2 ,
[
V (i[ p ] ) is proportional to the variance of Yi according to the current estimate i .
[ p]
where
18. Penalized Iteratively Reweighted Least Squares (P-IRLS)
• If we use γ t = Βλ in B * , which we rewrite as
2
C [ p]
(h[ p]
X Β ) .
• With Green and Yandell (1985), we suppose that K is of rank z v.
Two matrices J and T can be formed such that
J T KJ =I , T T KT =0 and J T T =0,
where J and T have v rows and with full column ranks z and v-z, respectively.
Then, rewriting as
C * J
with vectors , of dimensions z and v - z , respectively.
Then, B * becomes
2
C [ p ] (h[ p ] X , ΒT ΒJ )
19. Penalized Iteratively Reweighted Least Squares (P-IRLS)
• Using Householder decomposition, the minimization can be split
by separating the solution with respect to from the one on .
D * Q1 C [ p ] X , ΒT R,
T
Q2 C [ p ] X , ΒT 0,
T
where Q = Q1 , Q2 is orthogonal and R is nonsingular, upper triangular and of full rank m v z.
Then, we get the bilevel minimization problem of
2
E * upper Q1T C [ k ]h[ k ] R Q1T C [ k ] BJ (upper level)
with respect to , given based on minimizing
E * lower
2
Q2 C [ k ]h[ k ] Q2 C [ k ] BJ
T T
(lower level).
20. Penalized Iteratively Reweighted Least Squares (P-IRLS)
• The term E * upper can be set to 0.
• If we put
Q2 C [ k ]h[ k ] ,
T
V Q2 C [ k ] BJ ,
T
E * lower becomes the problem of minimizing
H V ,
2
which is a ridge regression problem. The solution is
E * V V I V H .
The other parameters can be found as
1 T [ k ]
R Q2 C ( H BJ ).
• Now, we can compute using C * and, finally,
[ p+1] X Β
21. An Alternative Solution for (P-IRLS) with CQP
• Both penalized maximum likelihood and P-IRLS methods contain a smoothing parameter .
This parameter can be estimated by
o Generalized Cross Validation (GCV),
o Minimization of an UnBiased Risk Estimator (UBRE).
• Different Method to solve P-IRLS by Conic Quadratic Programming (CQP).
Use Cholesky decomposition vv matrix K in B * such that K U U .
T
Then, B * becomes
( T , T )T
F * W v
2
U
2
. W C [ p] ( X , B)
v C [ p ] h[ p ]
• The regression problem F * can be reinterpreted as
H * min G( ), G ( ) : W v
2
where g( ) 0. g ( ) : U
2
M
M 0
22. An Alternative Solution for (P-IRLS) with CQP
• Then, our optimization problem H * is equivalent to
min t,
t ,
W v t 2 , t 0,
2
where
U M;
2
Here, W and U are n (m v) and vv matrices,
and v are (m v) and n vectors, respectively.
• This means:
min t,
I * t ,
where W v t,
U M .
23. An Alternative Solution for (P-IRLS) with CQP
• Conic Quadratic Programming (CQP) problem:
min cT x,
x
where Di x d i piT x qi (i 1, 2,..., k ) ;
our problem is from CQP with
c (1, 0T v )T , x (t , T )T (t , m , vT )T , D1 (0n , W ), d1 v, p1 (1,0,...,0)T ,
m
T
q1 0, D2 (0 v( m+1), U ), d 2 0v , p2 0mv 1 , q2 M ; k 2.
• We first reformulate I * as a Primal Problem:
min t,
t ,
0 W t v
such that : n T ,
1 0m v 0
0 0 v m U t 0v
: v ,
0 0Tm 0T M
v
Ln 1 , Lv 1 ,
24. An Alternative Solution for (P-IRLS) with CQP
with ice-cream (or second order or Lorentz) cones:
Ll 1 : x ( x1,..., xl 1)T Rl 1 | xl 1 x1 x2 ... xl2 .
2 2
• The corresponding Dual Problem is
max (v T , 0) 1 0T , M 2
v
0T 0
0
0T 1 1
v
1 0mv 0m 2
n
such that T ,
W 0m v UT 0v 0m v
1 Ln 1 , 2 Lv 1.
25. Solution Methods
• Polynomial time algorithms requested.
– Usually, only local information on the objective and the constraints given.
– This algorithm cannot utilize a priori knowledge of the problem’s structure.
– CQPs belong to the well-structured convex problems.
• Interior Point algorithms:
– We use the structure of problem.
– Yield better complexity bounds.
– Exhibit much better practical performance.
26. Outlook
Important new class of GPLs:
E Y X , T
G XT T , e.g.,
GPLM (X ,T ) = LM (X ) + MARS (T )
y
c-(x,)=[(x)] c+(x,)=[(x)]
x
* 2 * 2
X *
L
*
CMARS
28. References
[1] Aster, A., Borchers, B., and Thurber, C., Parameter Estimation and Inverse Problems, Academic
Press, 2004.
[2] Craven, P., and Wahba, G., Smoothing noisy data with spline functions, Numer. Math. 31, Linear
Models, (1979), 377-403.
[3] De Boor, C., Practical Guide to Splines, Springer Verlag, 2001.
[4] Dongarra, J.J., Bunch, J.R., Moler, C.B., and Stewart, G.W., Linpack User’s Guide, Philadelphia,
SIAM, 1979.
[5] Friedman, J.H., Multivariate adaptive regression splines, (1991), The Annals of Statistics
19, 1, 1-141.
[6] Green, P.J., and Yandell, B.S., Semi-Parametric Generalized Linear Models, Lecture Notes in
Statistics, 32 (1985).
[7] Hastie, T.J., and Tibshirani, R.J., Generalized Additive Models, New York, Chapman and Hall,
1990.
[8] Kincaid, D., and Cheney, W., Numerical Analysis: Mathematics of Scientific computing, Pacific
Grove, 2002.
[9] Müller, M., Estimation and testing in generalized partial linear models – A comparive study,
Statistics and Computing 11 (2001) 299-309, 2001.
[10] Nelder, J.A., and Wedderburn, R.W.M., Generalized linear models, Journal of the Royal Statistical
Society A, 145, (1972) 470-484.
[11] Nemirovski, A., Lectures on modern convex optimization, Israel Institute of Technology
http://iew3.technion.ac.il/Labs/Opt/opt/LN/Final.pdf.
29. References
[12] Nesterov, Y.E , and Nemirovskii, A.S., Interior Point Methods in Convex Programming,
SIAM, 1993.
[13] Ortega, J.M., and Rheinboldt, W.C., Iterative Solution of Nonlinear Equations in Several
Variables, Academic Press, New York, 1970.
[14] Renegar, J., Mathematical View of Interior Point Methods in Convex Programming, SIAM,
2000.
[15] Sheid, F., Numerical Analysis, McGraw-Hill Book Company, New-York, 1968.
[16] Taylan, P., Weber, G.-W., and Beck, A., New approaches to regression by generalized
additive and continuous optimization for modern applications in finance, science and
technology, Optimization 56, 5-6 (2007), pp. 1-24.
[17] Taylan, P., Weber, G.-W., and Liu, L., On foundations of parameter estimation for
generalized partial linear models with B-splines and continuous optimization, in the
proceedings of PCO 2010, 3rd Global Conference on Power Control and Optimization,
February 2-4, 2010, Gold Coast, Queensland, Australia.
[18] Weber, G.-W., Akteke-Öztürk, B., İşcanoğlu, A., Özöğür, S., and Taylan, P., Data Mining:
Clustering, Classification and Regression, four lectures given at the Graduate Summer
School on New Advances in Statistics, Middle East Technical University, Ankara, Turkey,
August 11-24, 2007 (http://www.statsummer.com/).
[19] Wood, S.N., Generalized Additive Models, An Introduction with R, New York, Chapman
and Hall, 2006.
30. Thank you very much for your attention!
http://www3.iam.metu.edu.tr/iam/images/7/73/Willi-CV.pdf