On Foundations of Parameter Estimation for Generalized Partial Linear Models with B–Splines and Continuous Optimization

5th International Summer School
Achievements and Applications of Contemporary Informatics,
Mathematics and Physics
National University of Technology of the Ukraine
Kiev, Ukraine, August 3-15, 2010

On Foundations of Parameter Estimation for
Generalized Partial Linear Models
with B–Splines and Continuous Optimization

Gerhard-Wilhelm WEBER
Institute of Applied Mathematics, METU, Ankara,Turkey
Faculty of Economics, Business and Law, University of Siegen, Germany
Center for Research on Optimization and Control, University of Aveiro, Portugal
Universiti Teknologi Malaysia, Skudai, Malaysia

Pakize TAYLAN
Department of Mathematics, Dicle University, Diyarbakır, Turkey

Lian LIU
Roche Pharma Development Center in Asia Pacific, Shangai China

Outline

• Introduction
• Estimation for Generalized Linear Models
• Generalized Partial Linear Model (GPLM)
• Newton-Raphson and Scoring Methods
• Penalized Maximum Likelihood
• Penalized Iteratively Reweighted Least Squares (P-IRLS)
• An Alternative Solution for (P-IRLS) with CQP
• Solution Methods
• Linear Model + MARS, and Robust CMARS
• Conclusion

Introduction

The class of Generalized Linear Models (GLMs) has gained popularity as a statistical modeling tool.

This popularity is due to:

• The flexibility of GLM in addressing a variety of statistical problems,
• The availability of software (Stata, SAS, S-PLUS, R) )to fit the models.

The class of GLM is an extension of traditional linear models allows:

 The mean of a dependent variable to depend on a linear predictor by a nonlinear link function......

 The probability distribution of the response, to be any member of an exponential family of distributions.

 Many widely used statistical models belong to GLM:

o linear models with normal errors,
o logistic and probit models for binary data,
o log-linear models for multinomial data.

Introduction

Many other useful statistical models such as with

• Poisson, binomial,
• Gamma or normal distributions,

can be formulated as GLM by the selection of an appropriate link function
and response probability distribution.

A GLM looks as follows:

i  H ( i )  xiT  ;

• i  E(Yi ) : expected value of the response variable Yi ,
• H: smooth monotonic link function,
• xi : observed value of explanatory variable for the i-th case,
•  : vector of unknown parameters.

Introduction

• Assumptions: Yi are independent and can have any distribution from exponential family density

Yi ~ fY ( yi ,i , )
i

 y   b ( ) 
 exp  i i i i  ci ( yi , )  (i  1, 2,..., n),
 ai ( ) 

• ai , bi , ci are arbitrary “scale” parameters, and i is called a natural parameter .

• General expressions for mean and variance of dependent variable Yi :

i  E (Yi )  bi' (i ),
Var (Yi )  V ( i ) ,
V ( i )  bi" (i ) i , ai ( ) :  / i .

Estimation for GLM

• Estimation and inference for GLM is based on the theory of

• Maximum Likelihood Estimation

• Least –Squares approach:

n
l (  ) :  (  y 
i 1
i i i
 bi (i )   ci ( yi , )).

• The dependence of the right-hand side on  is solely through the dependence of the i on  .

• Score equations:

n
 i  -1
x   Vi  yi  i   0,
 i
ij
i 1 
 i i  xij   i  j  , xi 0  1 (i = 1, 2,..., n; j = 0,1,..., m) .

• Solution for score equations given by Fisher scoring procedure based on the a
Newton-Raphson algorithm.

Generalized Partial Linear Models (GPLMs)

• Particular semiparametric models are the Generalized Partial Linear Models (GPLMs) :

They extend the GLMs in that the usual parametric terms are augmented by a
single nonparametric component:


E Y X , T   G X T    T  ; 
        m  is a vector of parameters,
T
• and
   is a smooth function,
which we try to estimate by splines.

• Assumption: m-dimensional random vector X which represents (typically discrete) covariates,
q-dimensional random vector T of continuous covariates,
which comes from a decomposition of explanatory variables.

Other interpretations of    : role of the environment,
expert opinions,
Wiener processes, etc..

Newton-Raphson and Scoring Methods

The Newton-Raphson algorithm is based on a quadratic Taylor series approximation.

• An important statistical application of the Newton-Raphson algorithm is given by
maximum likelihood estimation:

l ( , y )  l a ( , y )
l ( , y )
 l ( 0 , y )  (   0 )
  0

 2l ( , y )
 (   0 )2 ; 0 : startingvalue;
 T
0

• l ( , y)  log L( , y) : log likelihood function of  T
is based on the observed data y = ( y1 , y2 ,..., yn ) .

• Next, determine the new iterate  1 from  l a

( , y)   0 :

1 :  0  C 1r , r := l ( , y)  , C :=  2l ( , y)  T .

• The Fisher’s scoring method replaces C by the expectation E(C).

Penalized Maximum Likelihood

• Penalized Maximum Likelihood criterion for GPLM:

b
j (  ,  ) : l ( , y)  1   ( (t )) 2 dt .
2
a

• l : log likelihood of the linear predictor and the second term penalizes the integrated squared
curvature of   t  over the given interval  a, b  .

•  : smoothing parameter controlling the trade-off between
accuracy of the data fitting and its smoothness (stability, robustness or regularity).

• Maximization of j (  ,  ) given by B-splines through the local scoring algorithm.
For this, we write a k degree B-spline with knots at the values ti (i  1, 2, ..., n) for  t  :
v
  t     j B j ,k (t ),
j 1

where j are coefficients, and B j ,k are k degree B-spline basis functions.


• Zero and k degree B-splines bases are defined by

1, t j  t  t j 1
B j ,0 (t )  
0, otherwise,
t tj t j  k 1  t
B j ,k (t )  B j ,k 1 (t )  B j 1,k 1 (t )  k  1 .
t j k  t j t j  k 1  t j 1

  t  :  (t1 ),...,  (tn )  and define an n  v matrix Β Bij : B j (ti );
T
• We write by
then,

  t  = Β     1  2  v  .

• Further, define a vv matrix  by

b
K kl :  Bk (t ) Bl(t ) dt.

a


• Then, the j (  ,  ) criterion can be written as

j (  ,  )  l (, y)  1     
2

• If we insert the least-squares estimation  = ( BT B)1 BT   t  ,
ˆ we get:

j (  ,  )  l (, y)  1      ,
2

where  := ( ΒΤ Β)  ( ΒΤ Β)  ΒΤ .
• Now, we will find ˆ and ˆ to solve the optimization problem of maximizing j( ,  ) .

• Let

H ( )   ( X , t )  g1  g2 ; g1 : X   g2 :   t 


• To maximize j (  ,  ) with respect to g1 and g2 , we solve the following system of equations:
T
j (  ,  )    l ( , y )
   0,
g1  g1  
T
j (  ,  )    l ( , y )
     g 2  0,
g 2  g 2  

which we treat by Newton-Raphson method.

• These system equations are nonlinear in  and g2 .
We linearize them around a current guess  by

l ( , y) l ( , y)  2l (, y)


  

  

     0. 


• We use this equation in the system equations :

C C   g1  g1  
1 0
r  l ( , y)  2l ( , y)
  1 0
 0
; r := , C :=  ,
C C +   g2  g2   r    g2   

0
 2 
where g1 , g 0  g1 , g1
1
2  is a Newton-Raphson step and C and r are evaluated at 

.

• More simple form:

C C   g1   C 
1
(A *)    1     h; h :=    C 1r , S B := (C +  M )1 C ,
 SB I   g2   S B 

which can be resolved for

 g1   X     h  g1 
1

 1 1 
2
1 
.
 g2      S B (h  g1 ) 


• ˆ
 and ˆ can be found explicitly without iteration (inner loop backfitting):

g1  X 
ˆ ˆ

 X { X T C ( I  S B ) X }1 X T C ( I  S B )h ,
ˆ
g2  ˆ  S B (h  X  ).
ˆ

• Here, X represents the regression matrix for the input data xi ,
S B computes a weighted B-spline smoothing on the variable ti ,
with weights given by
C   2l (, y) 

and h is the adjusted dependent variable.


• From the updated    ˆ  , the outer loop must be iterated to update
ˆ  and, hence, h and C ;
then, the loop is repeated until sufficient convergence is achieved.

Step size optimization is performed by  ( )   1  (1   ) 0 ,
and we turn to maximize j ( ( ) ).

• Standard results on the Newton-Raphson procedure ensure local convergence.

• Asymptotic properties of the

  RB (  C 1r ),
ˆ ˆ ˆ
 RB h, r = l  , y)   ,
ˆ ˆ

where RB is the weighted additive fit operator:

If we replace h, RB and C by their asymptotic versions h0 , RB0 and C0 ,
then, we get the covariance matrix for  
ˆ


Cov( )  RB0 C0 1 RB0 
ˆ  T
''  '' : asymptotically
 RBC 1 RB 
T

and

Cov( gs )  RBs C 1RBs 
ˆ T
(s  1, 2).

Here, h  h0 has mean   and variance C0   C 1
1
• , and RB j is the matrix
ˆ
that produces g j from h based on B-splines.

• Furthermore, ˆ is asymptotically distributed as

   RB C 01 RB  
0
T
0

Penalized Iteratively Reweighted Least Squares (P-IRLS)

The penalized likelihood is maximized by the penalized iteratively reweighted least squares
to find the  p  1 th estimate of the linear predictor 
[ p +1]
, which is given by

 B * i[ p ]  X iT   ˆ T  ,
2
C [ p ] (h[ p ]   )       ˆ

i[ p ]  H 1 (i[ p ] ),

where h[ p ] is the iteratively adjusted dependent variable, given by

hi[ p ] : i[ p ]  H (i[ p ] )( yi  i[ p ] );

here, H p] represents the derivative of H with respect to  , and
C[ is a diagonal weight matrix with entries

Cii p ] : 1 V ( i[ p ] ) H ( i[ p ] )2 ,
[

V (i[ p ] ) is proportional to the variance of Yi according to the current estimate i .
[ p]
where


• If we use γ  t  = Βλ in  B * , which we rewrite as
2
C [ p]
(h[ p]
 X   Β )     .

• With Green and Yandell (1985), we suppose that K is of rank z  v.
Two matrices J and T can be formed such that

J T KJ =I , T T KT =0 and J T T =0,

where J and T have v rows and with full column ranks z and v-z, respectively.
Then, rewriting  as

 C *     J 
with vectors  ,  of dimensions z and v - z , respectively.

Then,  B * becomes
2
 
C [ p ] (h[ p ]   X , ΒT     ΒJ )     
 


• Using Householder decomposition, the minimization can be split
by separating the solution with respect to      from the one on .

 D * Q1 C [ p ]  X , ΒT   R,
T
Q2 C [ p ]  X , ΒT   0,
T


where Q = Q1 , Q2  is orthogonal and R is nonsingular, upper triangular and of full rank m  v  z.

Then, we get the bilevel minimization problem of
2
 
 E * upper  Q1T C [ k ]h[ k ]  R    Q1T C [ k ] BJ  (upper level)
 

with respect to     , given  based on minimizing

 E * lower 
2
Q2 C [ k ]h[ k ]  Q2 C [ k ] BJ 
T T
    (lower level).


• The term  E * upper  can be set to 0.
• If we put
  Q2 C [ k ]h[ k ] ,
T
V  Q2 C [ k ] BJ ,
T

 E * lower  becomes the problem of minimizing
H  V     ,
2

which is a ridge regression problem. The solution is

 E *   V  V   I V  H .
The other parameters can be found as

  1 T [ k ]
   R Q2 C ( H  BJ ).
 
• Now, we can compute  using  C *  and, finally,

[ p+1]  X   Β 

An Alternative Solution for (P-IRLS) with CQP

• Both penalized maximum likelihood and P-IRLS methods contain a smoothing parameter .
This parameter can be estimated by

o Generalized Cross Validation (GCV),
o Minimization of an UnBiased Risk Estimator (UBRE).

• Different Method to solve P-IRLS by Conic Quadratic Programming (CQP).

Use Cholesky decomposition vv matrix K in  B * such that K  U U .
T

Then,  B *  becomes
  (  T ,  T )T
 F * W  v
2
 U
2
. W  C [ p] ( X , B)
v  C [ p ] h[ p ]

• The regression problem  F * can be reinterpreted as

 H * min G( ), G ( ) : W   v
2


where g(  )  0. g ( ) : U 
2
M
M 0


• Then, our optimization problem  H * is equivalent to

min t,
t ,

W  v  t 2 , t  0,
2
where

U  M;
2

Here, W and U are n  (m  v) and vv matrices,
 and v are (m  v) and n vectors, respectively.

• This means:

min t,
 I * t ,

where W   v  t,
U  M .


• Conic Quadratic Programming (CQP) problem:

min cT x,
x

where Di x  d i  piT x  qi (i  1, 2,..., k ) ;

our problem is from CQP with

c  (1, 0T v )T , x  (t ,  T )T  (t ,  m , vT )T , D1  (0n , W ), d1  v, p1  (1,0,...,0)T ,
m
T

q1  0, D2  (0 v( m+1), U ), d 2  0v , p2  0mv 1 , q2  M ; k  2.

• We first reformulate  I *  as a Primal Problem:

min t,
t ,

0 W  t   v 
such that  :  n T      ,
 1 0m  v      0 
0 0 v m U  t   0v 
 :  v    ,
0 0Tm 0T      M 
v   
  Ln 1 ,   Lv 1 ,


with ice-cream (or second order or Lorentz) cones:


Ll 1 : x  ( x1,..., xl 1)T  Rl 1 | xl 1  x1  x2  ...  xl2 .
2 2

• The corresponding Dual Problem is


max (v T , 0)  1  0T ,  M  2
v 
 0T 0 
0
 0T 1   1 
v
 
  1   0mv 0m   2  
n
such that  T ,
W 0m  v   UT 0v   0m  v 
 
 1  Ln 1 ,  2  Lv 1.

Solution Methods

• Polynomial time algorithms requested.

– Usually, only local information on the objective and the constraints given.

– This algorithm cannot utilize a priori knowledge of the problem’s structure.

– CQPs belong to the well-structured convex problems.

• Interior Point algorithms:

– We use the structure of problem.

– Yield better complexity bounds.

– Exhibit much better practical performance.

Outlook

Important new class of GPLs:

E Y X , T  
 G XT   T  , e.g.,

GPLM (X ,T ) = LM (X ) + MARS (T )

y

  
 
   
  
   
  
c-(x,)=[(x)] c+(x,)=[(x)]
 x

* 2 * 2
X  *
  L
*
CMARS

Outlook

Robust CMARS:

confidence interval

(T j  )

... ...................... .
outlier

...  
outlier

 


semi-length of confidence interval

RCMARS

References

[1] Aster, A., Borchers, B., and Thurber, C., Parameter Estimation and Inverse Problems, Academic
Press, 2004.
[2] Craven, P., and Wahba, G., Smoothing noisy data with spline functions, Numer. Math. 31, Linear
Models, (1979), 377-403.
[3] De Boor, C., Practical Guide to Splines, Springer Verlag, 2001.
[4] Dongarra, J.J., Bunch, J.R., Moler, C.B., and Stewart, G.W., Linpack User’s Guide, Philadelphia,
SIAM, 1979.
[5] Friedman, J.H., Multivariate adaptive regression splines, (1991), The Annals of Statistics
19, 1, 1-141.
[6] Green, P.J., and Yandell, B.S., Semi-Parametric Generalized Linear Models, Lecture Notes in
Statistics, 32 (1985).
[7] Hastie, T.J., and Tibshirani, R.J., Generalized Additive Models, New York, Chapman and Hall,
1990.
[8] Kincaid, D., and Cheney, W., Numerical Analysis: Mathematics of Scientific computing, Pacific
Grove, 2002.
[9] Müller, M., Estimation and testing in generalized partial linear models – A comparive study,
Statistics and Computing 11 (2001) 299-309, 2001.
[10] Nelder, J.A., and Wedderburn, R.W.M., Generalized linear models, Journal of the Royal Statistical
Society A, 145, (1972) 470-484.
[11] Nemirovski, A., Lectures on modern convex optimization, Israel Institute of Technology
http://iew3.technion.ac.il/Labs/Opt/opt/LN/Final.pdf.

References

[12] Nesterov, Y.E , and Nemirovskii, A.S., Interior Point Methods in Convex Programming,
SIAM, 1993.
[13] Ortega, J.M., and Rheinboldt, W.C., Iterative Solution of Nonlinear Equations in Several
Variables, Academic Press, New York, 1970.
[14] Renegar, J., Mathematical View of Interior Point Methods in Convex Programming, SIAM,
2000.
[15] Sheid, F., Numerical Analysis, McGraw-Hill Book Company, New-York, 1968.
[16] Taylan, P., Weber, G.-W., and Beck, A., New approaches to regression by generalized
additive and continuous optimization for modern applications in finance, science and
technology, Optimization 56, 5-6 (2007), pp. 1-24.
[17] Taylan, P., Weber, G.-W., and Liu, L., On foundations of parameter estimation for
generalized partial linear models with B-splines and continuous optimization, in the
proceedings of PCO 2010, 3rd Global Conference on Power Control and Optimization,
February 2-4, 2010, Gold Coast, Queensland, Australia.
[18] Weber, G.-W., Akteke-Öztürk, B., İşcanoğlu, A., Özöğür, S., and Taylan, P., Data Mining:
Clustering, Classification and Regression, four lectures given at the Graduate Summer
School on New Advances in Statistics, Middle East Technical University, Ankara, Turkey,
August 11-24, 2007 (http://www.statsummer.com/).
[19] Wood, S.N., Generalized Additive Models, An Introduction with R, New York, Chapman
and Hall, 2006.

Thank you very much for your attention!

http://www3.iam.metu.edu.tr/iam/images/7/73/Willi-CV.pdf

On Foundations of Parameter Estimation for Generalized Partial Linear Models with B–Splines and Continuous Optimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to On Foundations of Parameter Estimation for Generalized Partial Linear Models with B–Splines and Continuous Optimization

Similar to On Foundations of Parameter Estimation for Generalized Partial Linear Models with B–Splines and Continuous Optimization (20)

More from SSA KPI

More from SSA KPI (20)

Recently uploaded

Recently uploaded (20)

On Foundations of Parameter Estimation for Generalized Partial Linear Models with B–Splines and Continuous Optimization