SlideShare uma empresa Scribd logo
1 de 12
Baixar para ler offline
RECENT ADVANCES in
                               PREDICTIVE (MACHINE) LEARNING

                                                 Jerome H. Friedman
                                             Deptartment of Statisitcs and
                                          Stanford Linear Accelerator Center,
                                        Stanford University, Stanford, CA 94305
                                                   (jhf@stanford.edu)
             Prediction involves estimating the unknown value of an attribute of a system under study given the
          values of other measured attributes. In prediction (machine) learning the prediction rule is derived
          from data consisting of previously solved cases. Most methods for predictive learning were originated
          many years ago at the dawn of the computer age. Recently two new techniques have emerged that
          have revitalized the field. These are support vector machines and boosted decision trees. This
          paper provides an introduction to these two new methods tracing their respective ancestral roots to
          standard kernel methods and ordinary decision trees.


                I.    INTRODUCTION                            Given joint values for the input variables x, the opti-
                                                              mal prediction for the output variable is y = F ∗ (x).
                                                                                                          ˆ
   The predictive or machine learning problem                    When the response takes on numeric values y ∈
is easy to state if difficult to solve in gen-                  R1 , the learning problem is called “regression” and
eral.    Given a set of measured values of at-                commonly used loss functions include absolute error
tributes/characteristics/properties on a object (obser-       L(y, F ) = |y − F |, and even more commonly squared–
vation) x = (x1 , x2 , ···, xn ) (often called “variables”)   error L(y, F ) = (y − F )2 because algorithms for min-
the goal is to predict (estimate) the unknown value           imization of the corresponding risk tend to be much
of another attribute y. The quantity y is called the          simpler. In the “classification” problem the response
“output” or “response” variable, and x = {x1 , ···, xn }      takes on a discrete set of K unorderable categorical
are referred to as the “input” or “predictor” variables.      values (names or class labels) y, F ∈ {c1 , · · ·, cK } and
The prediction takes the form of function                     the loss criterion Ly,F becomes a discrete K × K ma-
                                                              trix.
             y = F (x1 , x2 , · · ·, xn ) = F (x)
             ˆ                                                   There are a variety of ways one can go about trying
that maps a point x in the space of all joint values of       to find a good predicting function F (x). One might
the predictor variables, to a point y in the space of
                                       ˆ                      seek the opinions of domain experts, formally codified
response values. The goal is to produce a “good” pre-         in the “expert systems” approach of artificial intel-
dictive F (x). This requires a definition for the quality,     ligence. In predictive or machine learning one uses
or lack of quality, of any particular F (x). The most         data. A “training” data base
commonly used measure of lack of quality is predic-
tion “risk”. One defines a “loss” criterion that reflects               D = {yi , xi1 , xi2 , · · ·, xin }1 = {yi , xi }N
                                                                                                        N
                                                                                                                      1   (3)
the cost of mistakes: L(y, y ) is the loss or cost of pre-
                            ˆ
dicting a value y for the response when its true value
                 ˆ                                            of N previously solved cases is presumed to exist for
is y. The prediction risk is defined as the average loss       which the values of all variables (response and predic-
over all predictions                                          tors) have been jointly measured. A “learning” pro-
                                                              cedure is applied to these data in order to extract
                R(F ) = Eyx L(y, F (x))                (1)    (estimate) a good predicting function F (x). There
where the average (expected value) is over the joint          are a great many commonly used learning procedures.
(population) distribution of all of the variables (y, x)      These include linear/logistic regression, neural net-
which is represented by a probability density function        works, kernel methods, decision trees, multivariate
p(y, x). Thus, the goal is to find a mapping function          splines (MARS), etc. For descriptions of a large num-
F (x) with low predictive risk.                               ber of such learning procedures see Hastie, Tibshirani
   Given a function f of elements w in some set, the          and Friedman 2001.
choice of w that gives the smallest value of f (w) is            Most machine learning procedures have been
called arg minw f (w). This definition applies to all          around for a long time and most research in the field
types of sets including numbers, vectors, colors, or          has concentrated on producing refinements to these
functions. In terms of this notation the optimal pre-         long standing methods. However, in the past several
dictor with lowest predictive risk (called the “target        years there has been a revolution in the field inspired
function”) is given by                                        by the introduction of two new approaches: the ex-
                                                              tension of kernel methods to support vector machines
                     F ∗ = arg min R(F ).              (2)    (Vapnik 1995), and the extension of decision trees by
                                F
2                       PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003

    boosting (Freund and Schapire 1996, Friedman 2001).              (4) (5) approaches the optimal predicting target func-
    It is the purpose of this paper to provide an introduc-          tion (2), FN (x) → F ∗ (x), provided the value chosen
    tion to these new developments. First the classic ker-           for the scale parameter σ as a function of N ap-
    nel and decision tree methods are introduced. Then               proaches zero, σ(N ) → 0, at a slower rate than 1/N .
    the extension of kernels to support vector machines is           This result holds for almost any distance function
    described, followed by a description of applying boost-          d(x, x ); only very mild restrictions (such as convex-
    ing to extend decision tree methods. Finally, similari-          ity) are required. Another advantage of kernel meth-
    ties and differences between these two approaches will            ods is that no training is required to build a model; the
    be discussed.                                                    training data set is the model. Also, the procedure is
       Although arguably the most influential recent de-              conceptually quite simple and easily explained.
    velopments, support vector machines and boosting are                Kernel methods suffer from some disadvantages
    not the only important advances in machine learning              that have kept them from becoming highly used in
    in the past several years. Owing to space limitations            practice, especially in data mining applications. Since
    these are the ones discussed here. There have been               there is no model, they provide no easily understood
    other important developments that have considerably              model summary. Thus, they cannot be easily inter-
    advanced the field as well. These include (but are not            preted. There is no way to discern how the function
    limited to) the bagging and random forest techniques             FN (x) (4) depends on the respective predictor vari-
    of Breiman 1996 and 2001 that are somewhat related               ables x. Kernel methods produce a “black–box” pre-
    to boosting, and the reproducing kernel Hilbert space            diction machine. In order to make each prediction, the
    methods of Wahba 1990 that share similarities with               kernel method needs to examine the entire data base.
    support vector machines. It is hoped that this article           This requires enough random access memory to store
    will inspire the reader to investigate these as well as          the entire data set, and the computation required to
    other machine learning procedures.                               make each prediction is proportional to the training
                                                                     sample size N . For large data sets this is much slower
                                                                     than that for competing methods.
                 II.   KERNEL METHODS                                   Perhaps the most serious limitation of kernel meth-
                                                                     ods is statistical. For any finite N , performance
       Kernel methods for predictive learning were intro-            (prediction accuracy) depends critically on the cho-
    duced by Nadaraya (1964) and Watson (1964). Given                sen distance function d(x, x ), especially for regression
    the training data (3), the response estimate y for a set
                                                 ˆ                   y ∈ R1 . When there are more than a few predictor
    of joint values x is taken to be a weighted average of           variables, even the largest data sets produce a very
    the training responses {yi }N :
                                 1                                   sparse sampling in the corresponding n–dimensional
                       N                    N
                                                                     predictor variable space. This is a consequence of the
                                                                     so called “curse–of–dimensionality” (Bellman 1962).
       y = FN (x) =
       ˆ                     yi K(x, xi )         K(x, xi ).   (4)
                                                                     In order for kernel methods to perform well, the dis-
                       i=1                  i=1
                                                                     tance function must be carefully matched to the (un-
    The weight K(x, xi ) assigned to each response value             known) target function (2), and the procedure is not
    yi depends on its location xi in the predictor variable          very robust to mismatches.
    space and the location x where the prediction is to be              As an example, consider the often used Euclidean
    made. The function K(x, x ) defining the respective               distance function
    weights is called the “kernel function”, and it defines                                                       1/2
    the kernel method. Often the form of the kernel func-
                                                                                            
                                                                                                n
    tion is taken to be                                                          d(x, x ) =          (xj − xj )2      .   (6)
                                                                                                j=1
                   K(x, x ) = g(d(x, x )/σ)                    (5)
    where d(x, x ) is a defined “distance” between x and              If the target function F ∗ (x) dominately depends on
    x , σ is a scale (“smoothing”) parameter, and g(z)               only a small subset of the predictor variables, then
    is a (usually monotone) decreasing function with in-             performance will be poor because the kernel function
    creasing z; often g(z) = exp(−z 2 /2). Using this kernel         (5) (6) depends on all of the predictors with equal
    (5), the estimate y (4) is a weighted average of {yi }N ,
                        ˆ                                 1          strength. If one happened to know which variables
    with more weight given to observations i for which               were the important ones, an appropriate kernel could
    d(x, xi ) is small. The value of σ defines “small”. The           be constructed. However, this knowledge is often not
    distance function d(x, x ) must be specified for each             available. Such “kernel customizing” is a requirement
    particular application.                                          with kernel methods, but it is difficult to do without
       Kernel methods have several advantages that make              considerable a priori knowledge concerning the prob-
    them potentially attractive. They represent a univer-            lem at hand.
    sal approximator; as the training sample size N be-                 The performance of kernel methods tends to be
    comes arbitrarily large, N → ∞, the kernel estimate              fairly insensitive to the detailed choice of the function


WEAT003
PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003                                   3

g(z) (5), but somewhat more sensitive to the value           region all have the same response value y. At this
chosen for the smoothing parameter σ. A good value           point a recursive recombination strategy (“tree prun-
depends on the (usually unknown) smoothness prop-            ing”) is employed in which sibling regions are in turn
erties of the target function F ∗ (x), as well as the sam-   merged in a bottom–up manner until the number of
ple size N and the signal/noise ratio.                       regions J ∗ that minimizes an estimate of future pre-
                                                             diction risk is reached (see Breiman et al 1983, Ch.
                                                             3).
              III.   DECISION TREES

   Decision trees were developed largely in response to                  A.   Decision tree properties
the limitations of kernel methods. Detailed descrip-
tions are contained in monographs by Brieman, Fried-            Decision trees are the most popular predictive learn-
man, Olshen and Stone 1983, and by Quinlan 1992.             ing method used in data mining. There are a number
The minimal description provided here is intended as         of reasons for this. As with kernel methods, deci-
an introduction sufficient for understanding what fol-         sion trees represent a universal method. As the train-
lows.                                                        ing data set becomes arbitrarily large, N → ∞, tree
   A decision tree partitions the space of all joint pre-    based predictions (7) (8) approach those of the target
dictor variable values x into J–disjoint regions {Rj }J .
                                                       1     function (2), TJ (x) → F ∗ (x), provided the number
A response value yj is assigned to each corresponding
                   ˆ                                         of regions grows arbitrarily large, J(N ) → ∞, but at
region Rj . For a given set of joint predictor values x,     rate slower than N .
the tree prediction y = TJ (x) assigns as the response
                     ˆ                                          In contrast to kernel methods, decision trees do pro-
estimate, the value assigned to the region containing        duce a model summary. It takes the form of a binary
x                                                            tree graph. The root node of the tree represents the
                                                             entire predictor variable space, and the (first) split
                 x ∈ Rj ⇒ TJ (x) =ˆj .
                                  y                   (7)
                                                             into its daughter regions. Edges connect the root to
Given a set of regions, the optimal response values          two descendent nodes below it, representing these two
associated with each one are easily obtained, namely         daughter regions and their respective splits, and so on.
the value that minimizes prediction risk in that region      Each internal node of the tree represents an interme-
                                                             diate region and its optimal split, defined by a one of
          yj = arg min Ey [L(y, y ) | x ∈ Rj ].
          ˆ                                           (8)    the predictor variables xj and a split point s. The ter-
                     y
                                                             minal nodes represent the final region set {Rj }J used
                                                                                                                1
The difficult problem is to find a good set of regions          for prediction (7). It is this binary tree graphic that is
{Rj }J . There are a huge number of ways to parti-
      1                                                      most responsible for the popularity of decision trees.
tion the predictor variable space, the vast majority         No matter how high the dimensionality of the predic-
of which would provide poor predictive performance.          tor variable space, or how many variables are actually
In the context of decision trees, choice of a particu-       used for prediction (splits), the entire model can be
lar partition directly corresponds to choice of a dis-       represented by this two–dimensional graphic, which
tance function d(x, x ) and scale parameter σ in ker-        can be plotted and then examined for interpretation.
nel methods. Unlike with kernel methods where this           For examples of interpreting binary tree representa-
choice is the responsibility of the user, decision trees     tions see Breiman et al 1983 and Hastie, Tibshirani
attempt to use the data to estimate a good partition.        and Friedman 2001.
   Unfortunately, finding the optimal partition re-              Tree based models have other advantages as well
quires computation that grows exponentially with the         that account for their popularity. Training (tree build-
number of regions J, so that this is only possible for       ing) is relatively fast, scaling as nN log N with the
very small values of J. All tree based methods use           number of variables n and training observations N .
a greedy top–down recursive partitioning strategy to         Subsequent prediction is extremely fast, scaling as
induce a good set of regions given the training data         log J with the number of regions J. The predictor
set (3). One starts with a single region covering the        variables need not all be numeric valued. Trees can
entire space of all joint predictor variable values. This    seamlessly accommodate binary and categorical vari-
is partitioned into two regions by choosing an optimal       ables. They also have a very elegant way of dealing
splitting predictor variable xj and a corresponding op-      with missing variable values in both the training data
timal split point s. Points x for which xj ≤ s are           and future observations to be predicted (see Breiman
defined to be in the left daughter region, and those          et al 1983, Ch. 5.3).
for which xj > s comprise the right daughter region.            One property that sets tree based models apart
Each of these two daughter regions is then itself op-        from all other techniques is their invariance to mono-
timally partitioned into two daughters of its own in         tone transformations of the predictor variables. Re-
the same manner, and so on. This recursive parti-            placing any subset of the predictor variables {xj } by
tioning continues until the observations within each         (possibly different) arbitrary strictly monotone func-


WEAT003
4                       PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003

    tions of them { xj ← mj (xj )}, gives rise to the same                  IV.     RECENT ADVANCES
    tree model. Thus, there is no issue of having to exper-
    iment with different possible transformations mj (xj )          Both kernel methods and decision trees have been
    for each individual predictor xj , to try to find the        around for a long time. Trees have seen active use,
    best ones. This invariance provides immunity to the         especially in data mining applications. The clas-
    presence of extreme values “outliers” in the predictor      sic kernel approach has seen somewhat less use. As
    variable space. It also provides invariance to chang-       discussed above, both methodologies have (different)
    ing the measurement scales of the predictor variables,      advantages and disadvantages. Recently, these two
    something to which kernel methods can be very sen-          technologies have been completely revitalized in dif-
    sitive.                                                     ferent ways by addressing different aspects of their
                                                                corresponding weaknesses; support vector machines
       Another advantage of trees over kernel methods
                                                                (Vapnik 1995) address the computational problems of
    is fairly high resistance to irrelevant predictor vari-
                                                                kernel methods, and boosting (Freund and Schapire
    ables. As discussed in Section II, the presence of many
                                                                1996, Friedman 2001) improves the accuracy of deci-
    such irrelevant variables can highly degrade the per-
                                                                sion trees.
    formance of kernel methods based on generic kernels
    that involve all of the predictor variables such as (6).
    Since the recursive tree building algorithm estimates
                                                                       A.    Support vector machines (SVM)
    the optimal variable on which to split at each step,
    predictors unrelated to the response tend not to be
    chosen for splitting. This is a consequence of attempt-        A principal goal of the SVM approach is to fix the
    ing to find a good partition based on the data. Also,        computational problem of predicting with kernels (4).
    trees have few tunable parameters so they can be used       As discussed in Section II, in order to make a kernel
    as an “off–the–shelf” procedure.                             prediction a pass over the entire training data base
                                                                is required. For large data sets this can be too time
       The principal limitation of tree based methods           consuming and it requires that the entire data base
    is that in situations not especially advantageous to        be stored in random access memory.
    them, their performance tends not to be competitive            Support vector machines were introduced for the
    with other methods that might be used in those situa-       two–class classification problem. Here the response
    tions. One problem limiting accuracy is the piecewise–      variable realizes only two values (class labels) which
    constant nature of the predicting model. The predic-        can be respectively encoded as
    tions yj (8) are constant within each region Rj and
           ˆ
    sharply discontinuous across region boundaries. This                              +1 label = class 1
                                                                              y=                         .                   (9)
    is purely an artifact of the model, and target functions                          −1 label = class 2
    F ∗ (x) (2) occurring in practice are not likely to share
    this property. Another problem with trees is instabil-      The average or expected value of y given a set of joint
    ity. Changing the values of just a few observations can     predictor variable values x is
    dramatically change the structure of the tree, and sub-
    stantially change its predictions. This leads to high                   E [y | x] = 2 · Pr(y = +1 | x) − 1.             (10)
    variance in potential predictions TJ (x) at any partic-
    ular prediction point x over different training samples        Prediction error rate is minimized by predicting at
    (3) that might be drawn from the system under study.        x the class with the highest probability, so that the
    This is especially the case for large trees.                optimal prediction is given by

       Finally, trees fragment the data. As the recursive                         y ∗ (x) = sign(E [y | x]).
    splitting proceeds each daughter region contains fewer
    observations than its parent. At some point regions         From (4) the kernel estimate of (10) based on the
    will contain too few observations and cannot be fur-        training data (3) is given by
    ther split. Paths from the root to the terminal nodes
                                                                                          N                      N
    tend to contain on average a relatively small fraction       ˆ
                                                                 E [y | x] = FN (x) =          yi K(x, xi )           K(x, xi )
    of all of the predictor variables that thereby define
                                                                                         i=1                    i=1
    the region boundaries. Thus, each prediction involves                                                        (11)
    only a relatively small number of predictor variables.      and, assuming a strictly non negative kernel K(x, xi ),
    If the target function is influenced by only a small         the prediction estimate is
    number of (potentially different) variables in different
    local regions of the predictor variable space, then trees                                            N
    can produce accurate results. But, if the target func-                     ˆ
                                                                  y (x) = sign(E [y | x]) = sign
                                                                  ˆ                                           yi K(x, xi ) .
    tion depends on a substantial fraction of the predictors                                            i=1
    everywhere in the space, trees will have problems.                                                                      (12)


WEAT003
PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003                                       5

Note that ignoring the denominator in (11) to ob-          representing two observations i and j
tain (12) removes information concerning the absolute
                                                                                    M
value of Pr(y = +1 | x); only the estimated sign of (10)
is retained for classification.                                            zT zj =
                                                                           i              zik zjk                      (15)
                                                                                    k=1
   A support vector machine is a weighted kernel clas-
                                                                                    M
sifier
                                                                                =         hk (xi ) hk (xj )
                           N                                                        k=1

     y (x) = sign a0 +
     ˆ                          αi yi K(x, xi ) .   (13)                        = H(xi , xj ).
                          i=1
                                                           This (highly nonlinear) function of the x–variables,
                                                           H(xi , xj ), defines the simple bilinear inner product
Each training observation (yi , xi ) has an associ-
                                                           zT zj in z–space.
                                                            i
ated coefficient αi additionally used with the kernel
                                                              Suppose for example, the derived variables (14)
K(x, xi ) to evaluate the weighted sum (13) compris-
                                                           were taken to be all d–degree polynomials in the
ing the kernel estimate y (x). The goal is to choose
                         ˆ
                                                           original predictor variables {xj }n . That is zk =
a set of coefficient values {αi }N so that many αi = 0
                                1
                                                                                                    1
                                                           xi1 (k) xi2 (k) · · · xid (k) , with k labeling each of the
while still maintaining prediction accuracy. The ob-
servations associated with non zero valued coefficients      M = (n+1)d possible sets of d integers, 0 ≤ ij (k) ≤ n,
{xi | αi = 0} are called “support vectors”. Clearly        and with the added convention that x0 = 1 even
from (13) only the support vectors are required to do      though x0 is not really a component of the vector
prediction. If the number of support vectors is a small    x. In this case the number of derived variables is
fraction of the total number of observations computa-      M = (n + 1)d , which is the order of computation for
tion required for prediction is thereby much reduced.      obtaining zT zj directly from the z variables. However,
                                                                          i
                                                           using

                                                                      zT zj = H(xi , xj ) = (xT xj + 1)d
                                                                       i                      i                        (16)
                    1.   Kernel trick                      reduces the computation to order n, the much smaller
                                                           number of originally measured variables. Thus, if for
  In order to see how to accomplish this goal consider     any particular set of derived variables (14), the func-
a different formulation. Suppose that instead of using      tion H(xi , xj ) that defines the corresponding inner
the original measured variables x = (x1 , · · ·, xn ) as   products zT zj in terms of the original x–variables can
                                                                      i
the basis for prediction, one constructs a very large      be found, computation can be considerably reduced.
number of (nonlinear) functions of them                       As an example of a very simple linear classifier in
                                                           z–space, consider one based on nearest–means.
                    {zk = hk (x)}M                  (14)
                                 1                              y (z) = sign( || z − ¯− ||2 − || z − ¯+ ||2 ).
                                                                ˆ                    z               z                 (17)

for use in prediction. Here each hk (x) is a differ-        Here ¯± are the respective means of the y = +1 and
                                                                z
ent function (transformation) of x. For any given x,       y = −1 observations
z = {zk }M represents a point in a M –dimensional
          1
space where M >> dim(x) = n. Thus, the number of                                       1
                                                                              ¯± =
                                                                              z                       zi .
“variables” used for classification is dramatically ex-                                N±     yi =±1
panded. The procedure constructs simple linear clas-
sifier in z–space                                           For simplicity, let N+ = N− = N/2. Choosing the
                                                           midpoint between ¯+ and ¯− as the coordinate system
                                                                              z        z
                                 M                         origin, the decision rule (17) can be expressed as
          y (z) = sign β0 +
          ˆ                           β k zk
                                k=1                                y (z) = sign(zT (¯+ − ¯− ))
                                                                   ˆ                z    z                             (18)
                                 M
               = sign β0 +            βk hk (x) .                       = sign              zT zi −            zT zi
                                k=1                                                 yi =1             yi =−1
                                                                                     N
This is a highly non–linear classifier in x–space ow-                    = sign            y i zT zi
ing to the nonlinearity of the derived transformations                              i=1
{hk (x)}M .
        1                                                                            N
  An important ingredient for calculating such a lin-                   = sign            yi H(x, xi ) .
ear classifier is the inner product between the points                               i=1



WEAT003
6                       PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003

    Comparing this (18) with (12) (13), one sees that             The SVM classifier is thereby
    ordinary kernel rule ({αi = 1}N ) in x–space is the
                                    1                                                                    N
    nearest–means classifier in the z–space of derived vari-                             ∗
                                                                          y (z) = sign β0 +
                                                                          ˆ                                    α i y i zT zi
                                                                                                                 ∗
    ables (14) whose inner product is given by the kernel
                                                                                                         i=1
    function zT zj = K(xi , xj ). Therefore to construct
               i                                                                                                                
    an (implicit) nearest means classifier in z–space, all                              ∗                            ∗
    computations can be done in x–space because they                          = sign β0 +                         αi yi K(x, xi ) .
    only depend on evaluating inner products. The ex-                                                    α∗ =0
                                                                                                          i

    plicit transformations (14) need never be defined or             This is a weighted kernel classifier involving only
    even known.                                                   support vectors. Also (not shown here), the quadratic
                                                                  program used to solve for the OSH involves the data
                                                                  only through the inner products zT zj = K(xi , xj ).
                                                                                                      i
               2.   Optimal separating hyperplane
                                                                  Thus, one only needs to specify the kernel function to
                                                                  implicitly define z–variables (kernel trick).
       Nearest–means is an especially simple linear classi-         Besides the polynomial kernel (16), other popular
    fier in z–space and it leads to no compression: {αi =          kernels used with support vector machines are the “ra-
    1}N in (13). A support vector machine uses a more
       1                                                          dial basis function” kernel
    “realistic” linear classifier in z–space, that can also be
    computed using only inner products, for which often                     K(x, x ) = exp(− || x − x ||2 /2σ 2 ),                      (20)
    many of the coefficients have the value zero (αi = 0).
    This classifier is the “optimal” separating hyperplane         and the “neural network” kernel
    (OSH).                                                                         K(x, x ) = tanh(a xT x + b).                         (21)
       We consider first the case in which the observations
    representing the respective two classes are linearly          Note that both of these kernels (20) (21) involve addi-
    separable in z–space. This is often the case since the        tional tuning parameters, and produce infinite dimen-
    dimension M (14) of that (implicitly defined) space            sional derived variable (14) spaces (M = ∞).
    is very large. In this case the OSH is the unique hy-
    perplane that separates two classes while maximizing
    the distance to the closest points in each class. Only                    3.     Penalized learning formulation
    this set of closest points equidistant to the OSH are
    required to define it. These closest points are called            The support vector machine was motivated above
    the support points (vectors). Their number can range          by the optimal separating hyperplane in the high di-
    from a minimum of two to a maximum of the training            mensional space of the derived variables (14). There
    sample size N . The “margin” is defined to be the dis-         is another equivalent formulation in that space that
    tance of support points from OSH. The z–space linear          shows that the SVM procedure is related to other well
    classifier is given by                                         known statistically based methods. The parameters of
                                                                  the OSH (19) are the solution to
                                          M
                              ∗                     ∗
                y (z) = sign β0 +
                ˆ                                 β k zk   (19)                              N
                                                                                                                                          2
                                         k=1                      (β0 , β ∗ ) = arg min
                                                                    ∗
                                                                                                   [1 − yi (β0 + β T zi )]+ + λ · || β || .
                                                                                     β0 ,β
                                                                                             i=1
    where (β0 , β ∗ = {βk }M ) define the OSH. Their val-
              ∗          ∗
                           1                                                                                         (22)
    ues can be determined using standard quadratic pro-           Here the expression [η]+ represents the “positive part”
    gramming techniques.                                          of its argument; that is, [η]+ = max(0, η). The “regu-
       An OSH can also be defined for the case when the            larization” parameter λ is related to the SVM smooth-
    two classes are not separable in z–space by allowing          ing parameter mentioned above. This expression (22)
    some points to be on wrong side of their class mar-           represents a penalized learning problem where the
    gin. The amount by which they are allowed to do so            goal is to minimize the empirical risk on the training
    is a regularization (smoothing) parameter of the pro-         data using as a loss criterion
    cedure. In both the separable and non separable cases
    the solution parameter values (β0 , β ∗ ) (19) are defined
                                    ∗                                              L(y, F (z)) = [1 − yF (z)]+ ,                        (23)
    only by points close to boundary between the classes.
                                                                  where
    The solution for β ∗ can be expressed as
                               N                                                         F (z) = β0 + β T z,
                        β∗ =           ∗
                                     α i y i zi                   subject to an increasing penalty for larger values of
                               i=1
                                                                                                               n
            ∗                                                                                        2               2
    with αi = 0 only for points on, or on the wrong side                                     || β || =              βj .                (24)
    of, their class margin. These are the support vectors.                                                   j=1



WEAT003
PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003                                      7

This penalty (24) is well known and often used to reg-                  respective distributions of the two classes in the space
ularize statistical procedures, for example linear least                of the original predictor variables x (small Bayes error
squares regression leading to ridge–regression (Hoerl                   rate).
and Kannard 1970)                                                          The computational savings in prediction are bought
                                                                        by dramatic increase in computation required for
                          N
                                                                  2     training. Ordinary kernel methods (4) require no
(β0 , β ∗ ) = arg min
  ∗
                               [yi − (β0 + β T zi )]2 + λ · || β || .   training; the data set is the model. The quadratic pro-
                 β0 ,β
                         i=1                                            gram for obtaining the optimal separating hyperplane
                                                    (25)
                                                                        (solving (22)) requires computation proportional to
  The “hinge” loss criterion (23) is not familiar in
                                                                        the square of the sample size (N 2 ), multiplied by the
statistics. However, it is closely related to one that is
                                                                        number of resulting support vectors. There has been
well known in that field, namely conditional negative
                                                                        much research on fast algorithms for training SVMs,
log–likelihood associated with logistic regression
                                                                        extending computational feasibility to data sets of size
                                                                        N 30, 000 or so. However, they are still not feasible
            L(y, F (z)) = − log[1 + e−yF (z) ].                 (26)
                                                                        for really large data sets N 100, 000.
In fact, one can view the SVM hinge loss as a                              SVMs share some of the disadvantages of ordinary
piecewise–linear approximation to (26). Unregular-                      kernel methods. They are a black–box procedure with
ized logistic regression is one of the most popular                     little interpretive value. Also, as with all kernel meth-
methods in statistics for treating binary response out-                 ods, performance can be very sensitive to kernel (dis-
comes (9). Thus, a support vector machine can                           tance function) choice (5). For good performance the
be viewed as an approximation to regularized logis-                     kernel needs to be matched to the properties of the
tic regression (in z–space) using the ridge–regression                  target function F ∗ (x) (2), which are often unknown.
penalty (24).                                                           However, when there is a known “natural” distance for
   This penalized learning formulation forms the basis                  the problem, SVMs represent very powerful learning
for extending SVMs to the regression setting where                      machines.
the response variable y assumes numeric values y ∈
R1 , rather than binary values (9). One simply replaces
the loss criterion (23) in (22) with                                                      B.    Boosted trees

            L(y, F (z)) = (|y − F (z)| − ε)+ .                  (27)
                                                                          Boosting decision trees was first proposed by Fre-
This is called the “ε–insensitive” loss and can be                      und and Schapire 1996. The basic idea is rather than
viewed as a piecewise–linear approximation to the Hu-                   using just a single tree for prediction, a linear combi-
ber 1964 loss                                                           nation of (many) trees

                    |y − F (z)|2 /2    |y − F (z)| ≤ ε                                              M
 L(y, F (z)) =
                 ε(|y − F (z)| − ε/2) |y − F (z)| > ε                                   F (x) =           am Tm (x)         (29)
                                                    (28)                                            m=1
often used for robust regression in statistics. This loss
(28) is a compromise between squared–error loss (25)                    is used instead. Here each Tm (x) is a decision tree of
and absolute–deviation loss L(y, F (z)) = |y − F (z)|.                  the type discussed in Section III and am is its coeffi-
The value of the “transition” point ε differentiates the                 cient in the linear combination. This approach main-
errors that are treated as “outliers” being subject to                  tains the (statistical) advantages of trees, while often
absolute–deviation loss, from the other (smaller) er-                   dramatically increasing accuracy over that of a single
rors that are subject to squared–error loss.                            tree.


                     4.       SVM properties                                                   1.   Training

   Support vector machines inherit most of the advan-                     The recursive partitioning technique for construct-
tages of ordinary kernel methods discussed in Section                   ing a single tree on the training data was discussed in
II. In addition, they can overcome the computation                      Section III. Algorithm 1 describes a forward stagewise
problems associated with prediction, since only the                     method for constructing a prediction machine based
support vectors (αi = 0 in (13)) are required for mak-                  on a linear combination of M trees.
ing predictions. If the number of support vectors is
much smaller that than the total sample size N , com-
putation is correspondingly reduced. This will tend to                                     Algorithm 1
be the case when there is small overlap between the                                  Forward stagewise boosting


WEAT003
8                        PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003

         1   F0 (x) = 0                                        often involves hundreds of trees, this assumption is
         2   For m = 1 to M do:                                far from true and as a result accuracy suffers. A bet-
         3        (am , Tm (x)) = arg mina,T (x)               ter strategy turns out to be (Friedman 2001) to use a
                         N                                     constant tree size (J regions) at each iteration, where
         4               i=1 L(yi , Fm−1 (xi )+ aT (xi ))
         5        Fm (x) =Fm−1 (xi )+am Tm (x)                 the value of J is taken to be small, but not too small.
         6   EndFor                                            Typically 4 ≤ J ≤ 10 works well in the context of
         7
                                   M
             F (x) = FM (x) = m=1 am Tm (x)                    boosting, with performance being fairly insensitive to
                                                               particular choices.
       The first line initializes the predicting function to
    everywhere have the value zero. Lines 2 and 6 con-
    trol the M iterations of the operations associated with                       2.   Regularization
    lines 3–5. At each iteration m there is a current pre-
    dicting function Fm−1 (x). At the first iteration this
    is the initial function F0 (x) = 0, whereas for m > 1 it      Even if one restricts the size of the trees entering
    is the linear combination of the m − 1 trees induced       into a boosted tree model it is still possible to fit the
    at the previous iterations. Lines 3 and 4 construct        training data arbitrarily well, reducing training error
    that tree Tm (x), and find the corresponding coeffi-          to zero, with a linear combination of enough trees.
    cient am , that minimize the estimated prediction risk     However, as is well known in statistics, this is seldom
    (1) on the training data when am Tm (x) is added to        the best thing to do. Fitting the training data too
    the current linear combination Fm−1 (x). This is then      well can increase prediction risk on future predictions.
    added to the current approximation Fm−1 (x) on line        This is a phenomenon called “over–fitting”. Since
    5, producing a current predicting function Fm (x) for      each tree tries to best fit the errors associated with
    the next (m + 1)st iteration.                              the linear combination of previous trees, the train-
       At the first step, a1 T1 (x) is just the standard tree   ing error monotonically decreases as more trees are
    build on the data as described in Section III, since       included. This is, however, not the case for future
    the current function is F0 (x) = 0. At the next            prediction error on data not used for training.
    step, the estimated optimal tree T2 (x) is found to           Typically at the beginning, future prediction error
    add to it with coefficient a2 , producing the function       decreases with increasing number of trees M until at
    F2 (x) =a1 T1 (x) + a2 T2 (x). This process is continued   some point M ∗ a minimum is reached. For M > M ∗ ,
    for M steps, producing a predicting function consist-      future error tends to (more or less) monotonically in-
    ing of a linear combination of M trees (line 7).           crease as more trees are added. Thus there is an opti-
       The potentially difficult part of the algorithm is        mal number M ∗ of trees to include in the linear combi-
    constructing the optimal tree to add at each step.         nation. This number will depend on the problem (tar-
    This will depend on the chosen loss function L(y, F ).     get function (2), training sample size N , and signal to
    For squared–error loss                                     noise ratio). Thus, in any given situation, the value
                                                               of M ∗ is unknown and must be estimated from the
                       L(y, F ) = (y − F )2                    training data itself. This is most easily accomplished
                                                               by the “early stopping” strategy used in neural net-
    the procedure is especially straight forward, since        work training. The training data is randomly parti-
                                                    2
                                                               tioned into learning and test samples. The boosting is
              L(y, Fm−1 + aT )=(y− Fm−1 − aT )                 performed using only the data in the learning sample.
                                = (rm − aT )2 .                As iterations proceed and trees are added, prediction
                                                               risk as estimated on the test sample is monitored. At
    Here rm = y −Fm−1 is just the error (“residual”) from      that point where a definite upward trend is detected
    the current model Fm−1 at the mth iteration. Thus          iterations stop and M ∗ is estimated as the value of
    each successive tree is built in the standard way to       M producing the smallest prediction risk on the test
    best predict the errors produced by the linear com-        sample.
    bination of the previous trees. This basic idea can           Inhibiting the ability of a learning machine to fit
    be extended to produce boosting algorithms for any         the training data so as to increase future performance
    differentiable loss criterion L(y, F ) (Friedman 2001).     is called a “method–of–regularization”. It can be mo-
       As originally proposed the standard tree construc-      tivated from a frequentist perspective in terms of the
    tion algorithm was treated as a primitive in the boost-    “bias–variance trade–off” (Geman, Bienenstock and
    ing algorithm, inserted in lines 3 and 4 to produced       Doursat 1992) or by the Bayesian introduction of a
    a tree that best predicts the current errors {rim =        prior distribution over the space of solution functions.
                    N
    yi − Fm−1 (xi )}1 . In particular, an optimal tree size    In either case, controlling the number of trees is not
    was estimated at each step in the standard tree build-     the only way to regularize. Another method com-
    ing manner. This basically assumes that each tree          monly used in statistics is “shrinkage”. In the context
    will be the last one in the sequence. Since boosting       of boosting, shrinkage can be accomplished by replac-


WEAT003
PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003                                   9

ing line 5 in Algorithm 1 by                                     values by penalizing the l2 –norm of the coefficient vec-
                                                                 tor. Another penalty becoming increasingly popular
        Fm (x) = Fm−1 (x) + (ν · am ) Tm (x) .          (30)     is the “lasso” (Tibshirani 1996)
Here the contribution to the linear combination of the
estimated best tree to add at each step is reduced                               P ({am }) =        | am |.          (34)
by a factor 0 < ν ≤ 1. This “shrinkage” factor or
“learning rate” parameter controls the rate at which             This also encourages small coefficient absolute values,
adding trees reduces prediction risk on the learning             but by penalizing the l1 –norm. Both (33) and (34) in-
sample; smaller values produce a slower rate so that             creasingly penalize larger average absolute coefficient
more trees are required to fit the learning data to the           values. They differ in how they react to dispersion or
same degree.                                                     variation of the absolute coefficient values. The ridge
   Shrinkage (30) was introduced in Friedman 2001                penalty discourages dispersion by penalizing variation
and shown empirically to dramatically improve the                in absolute values. It thus tends to produce solutions
performance of all boosting methods. Smaller learning            in which coefficients tend to have equal absolute val-
rates were seen to produce more improvement, with a              ues and none with the value zero. The lasso (34) is in-
diminishing return for ν 0.1, provided that the esti-            different to dispersion and tends to produce solutions
mated optimal number of trees M ∗ (ν) for that value             with a much larger variation in the absolute values of
of ν is used. This number increases with decreasing              the coefficients, with many of them set to zero. The
learning rate, so that the price paid for better perfor-         best penalty will depend on the (unknown population)
mance is increased computation.                                  optimal coefficient values. If these have more or less
                                                                 equal absolute values the ridge penalty (33) will pro-
                                                                 duce better performance. On the other hand, if their
           3.   Penalized learning formulation                   absolute values are highly diverse, especially with a
                                                                 few large values and many small values, the lasso will
   The introduction of shrinkage (30) in boosting was            provide higher accuracy.
originally justified purely on empirical evidence and                As discussed in Hastie, Tibshirani and Friedman
the reason for its success was a mystery. Recently,              2001 and rigorously derived in Efron et al 2002, there
this mystery has been solved (Hastie, Tibshirani and             is a connection between boosting (Algorithm 1) with
Friedman 2001 and Efron, Hastie, Johnstone and Tib-              shrinkage (30) and penalized linear regression on all
shirani 2002). Consider a learning machine consisting            possible trees (31) (32) using the lasso penalty (34).
of a linear combination of all possible (J–region) trees:        They produce very similar solutions as the shrink-
                                                                 age parameter becomes arbitrarily small ν → 0. The
                  ˆ
                  F (x) =          am Tm (x)
                                   ˆ                    (31)     number of trees M is inversely related to the penalty
                                                                 strength parameter λ; more boosted trees corresponds
where                                                            to smaller values of λ (less regularization). Using early
                    N
                                                                 stopping to estimate the optimal number M ∗ is equiv-
                                                                 alent to estimating the optimal value of the penalty
{ˆm } = arg min
 a                        L yi ,     am Tm (xi ) +λ·P ({am }).
            {am }                                                strength parameter λ. Therefore, one can view the in-
                    i=1
                                                  (32)           troduction of shrinkage (30) with a small learning rate
This is a penalized (regularized) linear regression,             ν 0.1 as approximating a learning machine based on
based on a chosen loss criterion L, of the response              all possible (J–region) trees with a lasso penalty for
values {yi }N on the predictors (trees) {Tm (xi )}N .            regularization. The lasso is especially appropriate in
             1                                    i=1
The first term in (32) is the prediction risk on the              this context because among all possible trees only a
training data and the second is a penalty on the val-            small number will likely represent very good predic-
ues of the coefficients {am }. This penalty is required            tors with population optimal absolute coefficient val-
to regularize the solution because the number of all             ues substantially different from zero. As noted above,
possible J–region trees is infinite. The value of the             this is an especially bad situation for the ridge penalty
“regularization” parameter λ controls the strength of            (33), but ideal for the lasso (34).
the penalty. Its value is chosen to minimize an esti-
mate of future prediction risk, for example based on
                                                                                4.   Boosted tree properties
a left out test sample.
   A commonly used penalty for regularization in
statistics is the “ridge” penalty                                   Boosted trees maintain almost all of the advantages
                                                                 of single tree modeling described in Section III A while
                    P ({am }) =         a2
                                         m              (33)     often dramatically increasing their accuracy. One of
                                                                 the properties of single tree models leading to inac-
used in ridge–regression (25) and support vector ma-             curacy is the coarse piecewise constant nature of the
chines (22). This encourages small coefficient absolute            resulting approximation. Since boosted tree machines


WEAT003
10                      PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003

     are linear combinations of individual trees, they pro-     of the original predictor variables x. For support vec-
     duce a superposition of piecewise constant approxi-        tor machines these derived variables (14) are implic-
     mations. These are of course also piecewise constant,      itly defined through the chosen kernel K(x, x ) defin-
     but with many more pieces. The corresponding dis-          ing their inner product (15). With boosted trees these
     continuous jumps are very much smaller and they are        derived variables are all possible (J–region) decision
     able to more accurately approximate smooth target          trees (31) (32).
     functions.                                                    The coefficients defining the respective linear mod-
        Boosting also dramatically reduces the instability      els in the derived space for both methods are solu-
     associated with single tree models. First only small       tions to a penalized learning problem (22) (32) in-
     trees (Section IV B 1) are used which are inherently       volving a loss criterion L(y, F ) and a penalty on the
     more stable than the generally larger trees associated     coefficients P ({am }). Support vector machines use
     with single tree approximations. However, the big in-      L(y, F ) = (1 − y F )+ for classification y ∈ {−1, 1},
     crease in stability results from the averaging process     and (27) for regression y ∈ R1 . Boosting can be
     associated with using the linear combination of a large    used with any (differentiable) loss criterion L(y, F )
     number of trees. Averaging reduces variance; that is       (Friedman 2001). The respective penalties P ({am })
     why it plays such a fundamental role in statistical es-    are (24) for SVMs and (34) with boosting. Addition-
     timation.                                                  ally, both methods have a computational trick that
        Finally, boosting mitigates the fragmentation prob-     allows all (implicit) calculations required to solve the
     lem plaguing single tree models. Again only small          learning problem in the very high (usually infinite) di-
     trees are used which fragment the data to a much           mensional space of the derived variables z to be per-
     lesser extent than large trees. Each boosting iteration    formed in the space of the original variables x. For
     uses the entire data set to build a small tree. Each       support vector machines this is the kernel trick (Sec-
     respective tree can (if dictated by the data) involve      tion IV A 1), whereas with boosting it is forward stage-
     different sets of predictor variables. Thus, each pre-      wise tree building (Algorithm 1) with shrinkage (30).
     diction can be influenced by a large number of predic-         The two approaches do have some basic differences.
     tor variables associated with all of the trees involved    These involve the particular derived variables defin-
     in the prediction if that is estimated to produce more     ing the linear model in the high dimensional space,
     accurate results.                                          and the penalty P ({am }) on the corresponding coef-
        The computation associated with boosting trees          ficients. The performance of any linear learning ma-
     roughly scales as nN log N with the number of pre-         chine based on derived variables (14) will depend on
     dictor variables n and training sample size N . Thus,      the detailed nature of those variables. That is, dif-
     it can be applied to fairly large problems. For exam-      ferent transformations {hk (x)} will produce different
     ple, problems with n 102 –103 and N 105 –106 are           learners as functions of the original variables x, and
     routinely feasible.                                        for any given problem some will be better than others.
        The one advantage of single decision trees not inher-   The prediction accuracy achieved by a particular set
     ited by boosting is interpretability. It is not possible   of transformations will depend on the (unknown) tar-
     to inspect the very large number of individual tree        get function F ∗ (x) (2). With support vector machines
     components in order to discern the relationships be-       the transformations are implicitly defined through the
     tween the response y and the predictors x. Thus, like      chosen kernel function. Thus the problem of choosing
     support vector machines, boosted tree machines pro-        transformations becomes, as with any kernel method,
     duce black–box models. Techniques for interpreting         one of choosing a particular kernel function K(x, x )
     boosted trees as well as other black–box models are        (“kernel customizing”).
     described in Friedman 2001.                                   Although motivated here for use with decision trees,
                                                                boosting can in fact be implemented using any spec-
                                                                ified “base learner” h(x; p). This is a function of the
                       C.   Connections                         predictor variables x characterized by a set of param-
                                                                eters p = {p1, p2 , · · ·}. A particular set of joint pa-
        The preceding section has reviewed two of the most      rameter values p indexes a particular function (trans-
     important advances in machine learning in the re-          formation) of x, and the set of all functions induced
     cent past: support vector machines and boosted de-         over all possible joint parameter values define the de-
     cision trees. Although motivated from very different        rived variables of the linear prediction machine in the
     perspectives, these two approaches share fundamen-         transformed space. If all of the parameters assume
     tal properties that may account for their respective       values on a finite discrete set this derived space will
     success. These similarities are most readily apparent      be finite dimensional, otherwise it will have infinite
     from their respective penalized learning formulations      dimension. When the base learner is a decision tree
     (Section IV A 3 and Section IV B 3). Both build linear     the parameters represent the identities of the predic-
     models in a very high dimensional space of derived         tor variables used for splitting, the split points, and
     variables, each of which is a highly nonlinear function    the response values assigned to the induced regions.


WEAT003
PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003                                     11

The forward stagewise approach can be used with any             prediction. The regularization effect of this penalty
base learner by simply substituting it for the deci-            tends to produce large coefficient absolute values for
sion tree T (x) →h(x; p) in lines 3–5 of Algorithm 1.           those derived variables that appear to be relevant and
Thus boosting provides explicit control on the choice           small (mostly zero) values for the others. This can
of transformations to the high dimensional space. So            sacrifice accuracy if the chosen base learner happens
far boosting has seen greatest success with decision            to provide an especially appropriate space of derived
tree base learners, especially in data mining applica-          variables in which a large number turn out to be highly
tions, owing to their advantages outlined in Section            relevant. However, this approach provides consider-
III A. However, boosting other base learners can pro-           able robustness against less than optimal choices for
vide potentially attractive alternatives in some situa-         the base learner and thus the space of derived vari-
tions.                                                          ables.
   Another difference between SVMs and boosting is                                V.   CONCLUSION
the nature of the regularizing penalty P ({am }) that
they implicitly employ. Support vector machines use                A choice between support vector machines and
the “ridge” penalty (24). The effect of this penalty is          boosting depends on one’s a priori knowledge con-
to shrink the absolute values of the coefficients {βm }           cerning the problem at hand. If that knowledge is
from that of the unpenalized solution λ = 0 (22),               sufficient to lead to the construction of an especially
while discouraging dispersion among those absolute              effective kernel function K(x, x ) then an SVM (or
values. That is, it prefers solutions in which the de-          perhaps other kernel method) would be most appro-
rived variables (14) all have similar influence on the           priate. If that knowledge can suggest an especially ef-
resulting linear model. Boosting implicitly uses the            fective base learner h(x; p) then boosting would likely
“lasso” penalty (34). This also shrinks the coefficient           produce superior results. As noted above, boosting
absolute values, but it is indifferent to their disper-          tends to be more robust to misspecification. These
sion. It tends to produce solutions with relatively few         two techniques represent additional tools to be con-
large absolute valued coefficients and many with zero             sidered along with other machine learning methods.
value.                                                          The best tool for any particular application depends
   If a very large number of the derived variables in           on the detailed nature of that problem. As with any
the high dimensional space are all highly relevant for          endeavor one must match the tool to the problem. If
prediction then the ridge penalty used by SVMs will             little is known about which technique might be best
provide good results. This will be the case if the cho-         in any given application, several can be tried and ef-
sen kernel K(x, x ) is well matched to the unknown              fectiveness judged on independent data not used to
target function F ∗ (x) (2). Kernels not well matched           construct the respective learning machines under con-
to the target function will (implicitly) produce trans-         sideration.
formations (14) many of which have little or no rele-
vance to prediction. The homogenizing effect of the
ridge penalty is to inflate estimates of their relevance
while deflating that of the truly relevant ones, thereby                   VI.    ACKNOWLEDGMENTS
reducing prediction accuracy. Thus, the sharp sensi-
tivity of SVMs on choice of a particular kernel can be             Helpful discussions with Trevor Hastie are grate-
traced to the implicit use of the ridge penalty (24).           fully acknowledged. This work was partially sup-
   By implicitly employing the lasso penalty (34),              ported by the Department of Energy under contract
boosting anticipates that only a small number of its            DE-AC03-76SF00515, and by grant DMS–97–64431 of
derived variables are likely to be highly relevant to           the National Science Foundation.




 [1] Bellman, R. E. (1961). Adaptive Control Processes.          [6] Freund, Y and Schapire, R. (1996). Experiments with
     Princeton University Press.                                     a new boosting algorithm. In Machine Learning: Pro-
 [2] Breiman, L. (1996). Bagging predictors. Machine                 ceedings of the Thirteenth International Conference,
     Learning 26, 123-140.                                           148–156.
 [3] Breiman, L. (2001). Random forests, random features.        [7] Friedman, J. H. (2001). Greedy function approxima-
     Technical Report, University of California, Berkeley.           tion: a gradient boosting machine. Annals of Statis-
 [4] Breiman, L., Friedman, J. H., Olshen, R. and                    tics 29, 1189-1232.
     Stone, C. (1983). Classification and Regression Trees.       [8] Geman, S., Bienenstock, E. and Doursat, R. (1992).
     Wadsworth.                                                      Neural networks and the bias/variance dilemma. Neu-
 [5] Efron, B., Hastie, T., Johnstone, I., and Tibshirani,           ral Computation 4, 1-58.
     R. (2002). Least angle regression. Annals of Statistics.    [9] Hastie, T., Tibshirani, R. and Friedman, J.H. (2001).
     To appear.                                                      The Elements of Statistical Learning. Springer–


WEAT003
12                       PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003

          Verlag.                                                       tion via the lasso. J. Royal Statist. Soc. 58, 267-288.
     [10] Hoerl, A. E. and Kennard, R. (1970). Ridge regres-       [14] Vapnik, V. N. (1995). The Nature of Statistical Learn-
          sion: biased estimation for nonorthogonal problems.           ing Theory. Springer.
          Technometrics 12, 55-67                                  [15] Wahba, G. (1990). Spline Models for Observational
     [11] Nadaraya, E. A. (1964). On estimating regression.             Data. SIAM, Philadelphia.
          Theory Prob. Appl. 10, 186-190.                          [16] Watson, G. S. (1964). Smooth regression analysis.
     [12] Quinlan, R. (1992). C4.5: Programs for machine                Sankhya Ser. A. 26, 359-372.
          learning. Morgan Kaufmann, San Mateo.
     [13] Tibshirani, R. (1996). Regression shrinkage and selec-




WEAT003

Mais conteúdo relacionado

Mais procurados

An introduction to neural network
An introduction to neural networkAn introduction to neural network
An introduction to neural networktuxette
 
Random Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesRandom Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesFörderverein Technische Fakultät
 
Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Datatuxette
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3arogozhnikov
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIRtuxette
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada BoostAndres Mendez-Vazquez
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Syed Atif Naseem
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster ValidityAndres Mendez-Vazquez
 
Minghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation TalkMinghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation TalkWei Wang
 
The Sample Average Approximation Method for Stochastic Programs with Integer ...
The Sample Average Approximation Method for Stochastic Programs with Integer ...The Sample Average Approximation Method for Stochastic Programs with Integer ...
The Sample Average Approximation Method for Stochastic Programs with Integer ...SSA KPI
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum EntropyJiawang Liu
 
Fcv hum mach_geman
Fcv hum mach_gemanFcv hum mach_geman
Fcv hum mach_gemanzukun
 
Habilitation à diriger des recherches
Habilitation à diriger des recherchesHabilitation à diriger des recherches
Habilitation à diriger des recherchesPierre Pudlo
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...NTNU
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...Jeongmin Cha
 

Mais procurados (20)

An introduction to neural network
An introduction to neural networkAn introduction to neural network
An introduction to neural network
 
Random Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application ExamplesRandom Matrix Theory in Array Signal Processing: Application Examples
Random Matrix Theory in Array Signal Processing: Application Examples
 
Random Forest for Big Data
Random Forest for Big DataRandom Forest for Big Data
Random Forest for Big Data
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
About functional SIR
About functional SIRAbout functional SIR
About functional SIR
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
 
Minghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation TalkMinghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation Talk
 
The Sample Average Approximation Method for Stochastic Programs with Integer ...
The Sample Average Approximation Method for Stochastic Programs with Integer ...The Sample Average Approximation Method for Stochastic Programs with Integer ...
The Sample Average Approximation Method for Stochastic Programs with Integer ...
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
 
Fcv hum mach_geman
Fcv hum mach_gemanFcv hum mach_geman
Fcv hum mach_geman
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
Habilitation à diriger des recherches
Habilitation à diriger des recherchesHabilitation à diriger des recherches
Habilitation à diriger des recherches
 
06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes
 
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
An Importance Sampling Approach to Integrate Expert Knowledge When Learning B...
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...
 

Destaque

Elegant Resume
Elegant ResumeElegant Resume
Elegant Resumebutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
CHAP7.DOC.doc
CHAP7.DOC.docCHAP7.DOC.doc
CHAP7.DOC.docbutest
 
Artem Volftrub анатомия интернет банка
Artem Volftrub анатомия интернет банкаArtem Volftrub анатомия интернет банка
Artem Volftrub анатомия интернет банкаguest092df8
 

Destaque (6)

(ppt
(ppt(ppt
(ppt
 
Elegant Resume
Elegant ResumeElegant Resume
Elegant Resume
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
CHAP7.DOC.doc
CHAP7.DOC.docCHAP7.DOC.doc
CHAP7.DOC.doc
 
Artem Volftrub анатомия интернет банка
Artem Volftrub анатомия интернет банкаArtem Volftrub анатомия интернет банка
Artem Volftrub анатомия интернет банка
 

Semelhante a RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING

Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Introduction
IntroductionIntroduction
Introductionbutest
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesXavier Rafael Palou
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligencekeerthikaA8
 
Artificial intelligence.pptx
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptxkeerthikaA8
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligencekeerthikaA8
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Improving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning AlgorithmImproving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning Algorithmijsrd.com
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...cscpconf
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networksbutest
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Editor IJARCET
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
nonlinear_rmt.pdf
nonlinear_rmt.pdfnonlinear_rmt.pdf
nonlinear_rmt.pdfGieTe
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 

Semelhante a RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING (20)

Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Introduction
IntroductionIntroduction
Introduction
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Artificial intelligence.pptx
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptx
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Improving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning AlgorithmImproving Performance of Back propagation Learning Algorithm
Improving Performance of Back propagation Learning Algorithm
 
QMC: Transition Workshop - Approximating Multivariate Functions When Function...
QMC: Transition Workshop - Approximating Multivariate Functions When Function...QMC: Transition Workshop - Approximating Multivariate Functions When Function...
QMC: Transition Workshop - Approximating Multivariate Functions When Function...
 
Ajas11 alok
Ajas11 alokAjas11 alok
Ajas11 alok
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
nonlinear_rmt.pdf
nonlinear_rmt.pdfnonlinear_rmt.pdf
nonlinear_rmt.pdf
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING

  • 1. RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING Jerome H. Friedman Deptartment of Statisitcs and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 (jhf@stanford.edu) Prediction involves estimating the unknown value of an attribute of a system under study given the values of other measured attributes. In prediction (machine) learning the prediction rule is derived from data consisting of previously solved cases. Most methods for predictive learning were originated many years ago at the dawn of the computer age. Recently two new techniques have emerged that have revitalized the field. These are support vector machines and boosted decision trees. This paper provides an introduction to these two new methods tracing their respective ancestral roots to standard kernel methods and ordinary decision trees. I. INTRODUCTION Given joint values for the input variables x, the opti- mal prediction for the output variable is y = F ∗ (x). ˆ The predictive or machine learning problem When the response takes on numeric values y ∈ is easy to state if difficult to solve in gen- R1 , the learning problem is called “regression” and eral. Given a set of measured values of at- commonly used loss functions include absolute error tributes/characteristics/properties on a object (obser- L(y, F ) = |y − F |, and even more commonly squared– vation) x = (x1 , x2 , ···, xn ) (often called “variables”) error L(y, F ) = (y − F )2 because algorithms for min- the goal is to predict (estimate) the unknown value imization of the corresponding risk tend to be much of another attribute y. The quantity y is called the simpler. In the “classification” problem the response “output” or “response” variable, and x = {x1 , ···, xn } takes on a discrete set of K unorderable categorical are referred to as the “input” or “predictor” variables. values (names or class labels) y, F ∈ {c1 , · · ·, cK } and The prediction takes the form of function the loss criterion Ly,F becomes a discrete K × K ma- trix. y = F (x1 , x2 , · · ·, xn ) = F (x) ˆ There are a variety of ways one can go about trying that maps a point x in the space of all joint values of to find a good predicting function F (x). One might the predictor variables, to a point y in the space of ˆ seek the opinions of domain experts, formally codified response values. The goal is to produce a “good” pre- in the “expert systems” approach of artificial intel- dictive F (x). This requires a definition for the quality, ligence. In predictive or machine learning one uses or lack of quality, of any particular F (x). The most data. A “training” data base commonly used measure of lack of quality is predic- tion “risk”. One defines a “loss” criterion that reflects D = {yi , xi1 , xi2 , · · ·, xin }1 = {yi , xi }N N 1 (3) the cost of mistakes: L(y, y ) is the loss or cost of pre- ˆ dicting a value y for the response when its true value ˆ of N previously solved cases is presumed to exist for is y. The prediction risk is defined as the average loss which the values of all variables (response and predic- over all predictions tors) have been jointly measured. A “learning” pro- cedure is applied to these data in order to extract R(F ) = Eyx L(y, F (x)) (1) (estimate) a good predicting function F (x). There where the average (expected value) is over the joint are a great many commonly used learning procedures. (population) distribution of all of the variables (y, x) These include linear/logistic regression, neural net- which is represented by a probability density function works, kernel methods, decision trees, multivariate p(y, x). Thus, the goal is to find a mapping function splines (MARS), etc. For descriptions of a large num- F (x) with low predictive risk. ber of such learning procedures see Hastie, Tibshirani Given a function f of elements w in some set, the and Friedman 2001. choice of w that gives the smallest value of f (w) is Most machine learning procedures have been called arg minw f (w). This definition applies to all around for a long time and most research in the field types of sets including numbers, vectors, colors, or has concentrated on producing refinements to these functions. In terms of this notation the optimal pre- long standing methods. However, in the past several dictor with lowest predictive risk (called the “target years there has been a revolution in the field inspired function”) is given by by the introduction of two new approaches: the ex- tension of kernel methods to support vector machines F ∗ = arg min R(F ). (2) (Vapnik 1995), and the extension of decision trees by F
  • 2. 2 PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 boosting (Freund and Schapire 1996, Friedman 2001). (4) (5) approaches the optimal predicting target func- It is the purpose of this paper to provide an introduc- tion (2), FN (x) → F ∗ (x), provided the value chosen tion to these new developments. First the classic ker- for the scale parameter σ as a function of N ap- nel and decision tree methods are introduced. Then proaches zero, σ(N ) → 0, at a slower rate than 1/N . the extension of kernels to support vector machines is This result holds for almost any distance function described, followed by a description of applying boost- d(x, x ); only very mild restrictions (such as convex- ing to extend decision tree methods. Finally, similari- ity) are required. Another advantage of kernel meth- ties and differences between these two approaches will ods is that no training is required to build a model; the be discussed. training data set is the model. Also, the procedure is Although arguably the most influential recent de- conceptually quite simple and easily explained. velopments, support vector machines and boosting are Kernel methods suffer from some disadvantages not the only important advances in machine learning that have kept them from becoming highly used in in the past several years. Owing to space limitations practice, especially in data mining applications. Since these are the ones discussed here. There have been there is no model, they provide no easily understood other important developments that have considerably model summary. Thus, they cannot be easily inter- advanced the field as well. These include (but are not preted. There is no way to discern how the function limited to) the bagging and random forest techniques FN (x) (4) depends on the respective predictor vari- of Breiman 1996 and 2001 that are somewhat related ables x. Kernel methods produce a “black–box” pre- to boosting, and the reproducing kernel Hilbert space diction machine. In order to make each prediction, the methods of Wahba 1990 that share similarities with kernel method needs to examine the entire data base. support vector machines. It is hoped that this article This requires enough random access memory to store will inspire the reader to investigate these as well as the entire data set, and the computation required to other machine learning procedures. make each prediction is proportional to the training sample size N . For large data sets this is much slower than that for competing methods. II. KERNEL METHODS Perhaps the most serious limitation of kernel meth- ods is statistical. For any finite N , performance Kernel methods for predictive learning were intro- (prediction accuracy) depends critically on the cho- duced by Nadaraya (1964) and Watson (1964). Given sen distance function d(x, x ), especially for regression the training data (3), the response estimate y for a set ˆ y ∈ R1 . When there are more than a few predictor of joint values x is taken to be a weighted average of variables, even the largest data sets produce a very the training responses {yi }N : 1 sparse sampling in the corresponding n–dimensional N N predictor variable space. This is a consequence of the so called “curse–of–dimensionality” (Bellman 1962). y = FN (x) = ˆ yi K(x, xi ) K(x, xi ). (4) In order for kernel methods to perform well, the dis- i=1 i=1 tance function must be carefully matched to the (un- The weight K(x, xi ) assigned to each response value known) target function (2), and the procedure is not yi depends on its location xi in the predictor variable very robust to mismatches. space and the location x where the prediction is to be As an example, consider the often used Euclidean made. The function K(x, x ) defining the respective distance function weights is called the “kernel function”, and it defines 1/2 the kernel method. Often the form of the kernel func-  n tion is taken to be d(x, x ) =  (xj − xj )2  . (6) j=1 K(x, x ) = g(d(x, x )/σ) (5) where d(x, x ) is a defined “distance” between x and If the target function F ∗ (x) dominately depends on x , σ is a scale (“smoothing”) parameter, and g(z) only a small subset of the predictor variables, then is a (usually monotone) decreasing function with in- performance will be poor because the kernel function creasing z; often g(z) = exp(−z 2 /2). Using this kernel (5) (6) depends on all of the predictors with equal (5), the estimate y (4) is a weighted average of {yi }N , ˆ 1 strength. If one happened to know which variables with more weight given to observations i for which were the important ones, an appropriate kernel could d(x, xi ) is small. The value of σ defines “small”. The be constructed. However, this knowledge is often not distance function d(x, x ) must be specified for each available. Such “kernel customizing” is a requirement particular application. with kernel methods, but it is difficult to do without Kernel methods have several advantages that make considerable a priori knowledge concerning the prob- them potentially attractive. They represent a univer- lem at hand. sal approximator; as the training sample size N be- The performance of kernel methods tends to be comes arbitrarily large, N → ∞, the kernel estimate fairly insensitive to the detailed choice of the function WEAT003
  • 3. PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 3 g(z) (5), but somewhat more sensitive to the value region all have the same response value y. At this chosen for the smoothing parameter σ. A good value point a recursive recombination strategy (“tree prun- depends on the (usually unknown) smoothness prop- ing”) is employed in which sibling regions are in turn erties of the target function F ∗ (x), as well as the sam- merged in a bottom–up manner until the number of ple size N and the signal/noise ratio. regions J ∗ that minimizes an estimate of future pre- diction risk is reached (see Breiman et al 1983, Ch. 3). III. DECISION TREES Decision trees were developed largely in response to A. Decision tree properties the limitations of kernel methods. Detailed descrip- tions are contained in monographs by Brieman, Fried- Decision trees are the most popular predictive learn- man, Olshen and Stone 1983, and by Quinlan 1992. ing method used in data mining. There are a number The minimal description provided here is intended as of reasons for this. As with kernel methods, deci- an introduction sufficient for understanding what fol- sion trees represent a universal method. As the train- lows. ing data set becomes arbitrarily large, N → ∞, tree A decision tree partitions the space of all joint pre- based predictions (7) (8) approach those of the target dictor variable values x into J–disjoint regions {Rj }J . 1 function (2), TJ (x) → F ∗ (x), provided the number A response value yj is assigned to each corresponding ˆ of regions grows arbitrarily large, J(N ) → ∞, but at region Rj . For a given set of joint predictor values x, rate slower than N . the tree prediction y = TJ (x) assigns as the response ˆ In contrast to kernel methods, decision trees do pro- estimate, the value assigned to the region containing duce a model summary. It takes the form of a binary x tree graph. The root node of the tree represents the entire predictor variable space, and the (first) split x ∈ Rj ⇒ TJ (x) =ˆj . y (7) into its daughter regions. Edges connect the root to Given a set of regions, the optimal response values two descendent nodes below it, representing these two associated with each one are easily obtained, namely daughter regions and their respective splits, and so on. the value that minimizes prediction risk in that region Each internal node of the tree represents an interme- diate region and its optimal split, defined by a one of yj = arg min Ey [L(y, y ) | x ∈ Rj ]. ˆ (8) the predictor variables xj and a split point s. The ter- y minal nodes represent the final region set {Rj }J used 1 The difficult problem is to find a good set of regions for prediction (7). It is this binary tree graphic that is {Rj }J . There are a huge number of ways to parti- 1 most responsible for the popularity of decision trees. tion the predictor variable space, the vast majority No matter how high the dimensionality of the predic- of which would provide poor predictive performance. tor variable space, or how many variables are actually In the context of decision trees, choice of a particu- used for prediction (splits), the entire model can be lar partition directly corresponds to choice of a dis- represented by this two–dimensional graphic, which tance function d(x, x ) and scale parameter σ in ker- can be plotted and then examined for interpretation. nel methods. Unlike with kernel methods where this For examples of interpreting binary tree representa- choice is the responsibility of the user, decision trees tions see Breiman et al 1983 and Hastie, Tibshirani attempt to use the data to estimate a good partition. and Friedman 2001. Unfortunately, finding the optimal partition re- Tree based models have other advantages as well quires computation that grows exponentially with the that account for their popularity. Training (tree build- number of regions J, so that this is only possible for ing) is relatively fast, scaling as nN log N with the very small values of J. All tree based methods use number of variables n and training observations N . a greedy top–down recursive partitioning strategy to Subsequent prediction is extremely fast, scaling as induce a good set of regions given the training data log J with the number of regions J. The predictor set (3). One starts with a single region covering the variables need not all be numeric valued. Trees can entire space of all joint predictor variable values. This seamlessly accommodate binary and categorical vari- is partitioned into two regions by choosing an optimal ables. They also have a very elegant way of dealing splitting predictor variable xj and a corresponding op- with missing variable values in both the training data timal split point s. Points x for which xj ≤ s are and future observations to be predicted (see Breiman defined to be in the left daughter region, and those et al 1983, Ch. 5.3). for which xj > s comprise the right daughter region. One property that sets tree based models apart Each of these two daughter regions is then itself op- from all other techniques is their invariance to mono- timally partitioned into two daughters of its own in tone transformations of the predictor variables. Re- the same manner, and so on. This recursive parti- placing any subset of the predictor variables {xj } by tioning continues until the observations within each (possibly different) arbitrary strictly monotone func- WEAT003
  • 4. 4 PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 tions of them { xj ← mj (xj )}, gives rise to the same IV. RECENT ADVANCES tree model. Thus, there is no issue of having to exper- iment with different possible transformations mj (xj ) Both kernel methods and decision trees have been for each individual predictor xj , to try to find the around for a long time. Trees have seen active use, best ones. This invariance provides immunity to the especially in data mining applications. The clas- presence of extreme values “outliers” in the predictor sic kernel approach has seen somewhat less use. As variable space. It also provides invariance to chang- discussed above, both methodologies have (different) ing the measurement scales of the predictor variables, advantages and disadvantages. Recently, these two something to which kernel methods can be very sen- technologies have been completely revitalized in dif- sitive. ferent ways by addressing different aspects of their corresponding weaknesses; support vector machines Another advantage of trees over kernel methods (Vapnik 1995) address the computational problems of is fairly high resistance to irrelevant predictor vari- kernel methods, and boosting (Freund and Schapire ables. As discussed in Section II, the presence of many 1996, Friedman 2001) improves the accuracy of deci- such irrelevant variables can highly degrade the per- sion trees. formance of kernel methods based on generic kernels that involve all of the predictor variables such as (6). Since the recursive tree building algorithm estimates A. Support vector machines (SVM) the optimal variable on which to split at each step, predictors unrelated to the response tend not to be chosen for splitting. This is a consequence of attempt- A principal goal of the SVM approach is to fix the ing to find a good partition based on the data. Also, computational problem of predicting with kernels (4). trees have few tunable parameters so they can be used As discussed in Section II, in order to make a kernel as an “off–the–shelf” procedure. prediction a pass over the entire training data base is required. For large data sets this can be too time The principal limitation of tree based methods consuming and it requires that the entire data base is that in situations not especially advantageous to be stored in random access memory. them, their performance tends not to be competitive Support vector machines were introduced for the with other methods that might be used in those situa- two–class classification problem. Here the response tions. One problem limiting accuracy is the piecewise– variable realizes only two values (class labels) which constant nature of the predicting model. The predic- can be respectively encoded as tions yj (8) are constant within each region Rj and ˆ sharply discontinuous across region boundaries. This +1 label = class 1 y= . (9) is purely an artifact of the model, and target functions −1 label = class 2 F ∗ (x) (2) occurring in practice are not likely to share this property. Another problem with trees is instabil- The average or expected value of y given a set of joint ity. Changing the values of just a few observations can predictor variable values x is dramatically change the structure of the tree, and sub- stantially change its predictions. This leads to high E [y | x] = 2 · Pr(y = +1 | x) − 1. (10) variance in potential predictions TJ (x) at any partic- ular prediction point x over different training samples Prediction error rate is minimized by predicting at (3) that might be drawn from the system under study. x the class with the highest probability, so that the This is especially the case for large trees. optimal prediction is given by Finally, trees fragment the data. As the recursive y ∗ (x) = sign(E [y | x]). splitting proceeds each daughter region contains fewer observations than its parent. At some point regions From (4) the kernel estimate of (10) based on the will contain too few observations and cannot be fur- training data (3) is given by ther split. Paths from the root to the terminal nodes N N tend to contain on average a relatively small fraction ˆ E [y | x] = FN (x) = yi K(x, xi ) K(x, xi ) of all of the predictor variables that thereby define i=1 i=1 the region boundaries. Thus, each prediction involves (11) only a relatively small number of predictor variables. and, assuming a strictly non negative kernel K(x, xi ), If the target function is influenced by only a small the prediction estimate is number of (potentially different) variables in different local regions of the predictor variable space, then trees N can produce accurate results. But, if the target func- ˆ y (x) = sign(E [y | x]) = sign ˆ yi K(x, xi ) . tion depends on a substantial fraction of the predictors i=1 everywhere in the space, trees will have problems. (12) WEAT003
  • 5. PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 5 Note that ignoring the denominator in (11) to ob- representing two observations i and j tain (12) removes information concerning the absolute M value of Pr(y = +1 | x); only the estimated sign of (10) is retained for classification. zT zj = i zik zjk (15) k=1 A support vector machine is a weighted kernel clas- M sifier = hk (xi ) hk (xj ) N k=1 y (x) = sign a0 + ˆ αi yi K(x, xi ) . (13) = H(xi , xj ). i=1 This (highly nonlinear) function of the x–variables, H(xi , xj ), defines the simple bilinear inner product Each training observation (yi , xi ) has an associ- zT zj in z–space. i ated coefficient αi additionally used with the kernel Suppose for example, the derived variables (14) K(x, xi ) to evaluate the weighted sum (13) compris- were taken to be all d–degree polynomials in the ing the kernel estimate y (x). The goal is to choose ˆ original predictor variables {xj }n . That is zk = a set of coefficient values {αi }N so that many αi = 0 1 1 xi1 (k) xi2 (k) · · · xid (k) , with k labeling each of the while still maintaining prediction accuracy. The ob- servations associated with non zero valued coefficients M = (n+1)d possible sets of d integers, 0 ≤ ij (k) ≤ n, {xi | αi = 0} are called “support vectors”. Clearly and with the added convention that x0 = 1 even from (13) only the support vectors are required to do though x0 is not really a component of the vector prediction. If the number of support vectors is a small x. In this case the number of derived variables is fraction of the total number of observations computa- M = (n + 1)d , which is the order of computation for tion required for prediction is thereby much reduced. obtaining zT zj directly from the z variables. However, i using zT zj = H(xi , xj ) = (xT xj + 1)d i i (16) 1. Kernel trick reduces the computation to order n, the much smaller number of originally measured variables. Thus, if for In order to see how to accomplish this goal consider any particular set of derived variables (14), the func- a different formulation. Suppose that instead of using tion H(xi , xj ) that defines the corresponding inner the original measured variables x = (x1 , · · ·, xn ) as products zT zj in terms of the original x–variables can i the basis for prediction, one constructs a very large be found, computation can be considerably reduced. number of (nonlinear) functions of them As an example of a very simple linear classifier in z–space, consider one based on nearest–means. {zk = hk (x)}M (14) 1 y (z) = sign( || z − ¯− ||2 − || z − ¯+ ||2 ). ˆ z z (17) for use in prediction. Here each hk (x) is a differ- Here ¯± are the respective means of the y = +1 and z ent function (transformation) of x. For any given x, y = −1 observations z = {zk }M represents a point in a M –dimensional 1 space where M >> dim(x) = n. Thus, the number of 1 ¯± = z zi . “variables” used for classification is dramatically ex- N± yi =±1 panded. The procedure constructs simple linear clas- sifier in z–space For simplicity, let N+ = N− = N/2. Choosing the midpoint between ¯+ and ¯− as the coordinate system z z M origin, the decision rule (17) can be expressed as y (z) = sign β0 + ˆ β k zk k=1 y (z) = sign(zT (¯+ − ¯− )) ˆ z z (18) M = sign β0 + βk hk (x) . = sign zT zi − zT zi k=1 yi =1 yi =−1 N This is a highly non–linear classifier in x–space ow- = sign y i zT zi ing to the nonlinearity of the derived transformations i=1 {hk (x)}M . 1 N An important ingredient for calculating such a lin- = sign yi H(x, xi ) . ear classifier is the inner product between the points i=1 WEAT003
  • 6. 6 PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 Comparing this (18) with (12) (13), one sees that The SVM classifier is thereby ordinary kernel rule ({αi = 1}N ) in x–space is the 1 N nearest–means classifier in the z–space of derived vari- ∗ y (z) = sign β0 + ˆ α i y i zT zi ∗ ables (14) whose inner product is given by the kernel i=1 function zT zj = K(xi , xj ). Therefore to construct i   an (implicit) nearest means classifier in z–space, all ∗ ∗ computations can be done in x–space because they = sign β0 + αi yi K(x, xi ) . only depend on evaluating inner products. The ex- α∗ =0 i plicit transformations (14) need never be defined or This is a weighted kernel classifier involving only even known. support vectors. Also (not shown here), the quadratic program used to solve for the OSH involves the data only through the inner products zT zj = K(xi , xj ). i 2. Optimal separating hyperplane Thus, one only needs to specify the kernel function to implicitly define z–variables (kernel trick). Nearest–means is an especially simple linear classi- Besides the polynomial kernel (16), other popular fier in z–space and it leads to no compression: {αi = kernels used with support vector machines are the “ra- 1}N in (13). A support vector machine uses a more 1 dial basis function” kernel “realistic” linear classifier in z–space, that can also be computed using only inner products, for which often K(x, x ) = exp(− || x − x ||2 /2σ 2 ), (20) many of the coefficients have the value zero (αi = 0). This classifier is the “optimal” separating hyperplane and the “neural network” kernel (OSH). K(x, x ) = tanh(a xT x + b). (21) We consider first the case in which the observations representing the respective two classes are linearly Note that both of these kernels (20) (21) involve addi- separable in z–space. This is often the case since the tional tuning parameters, and produce infinite dimen- dimension M (14) of that (implicitly defined) space sional derived variable (14) spaces (M = ∞). is very large. In this case the OSH is the unique hy- perplane that separates two classes while maximizing the distance to the closest points in each class. Only 3. Penalized learning formulation this set of closest points equidistant to the OSH are required to define it. These closest points are called The support vector machine was motivated above the support points (vectors). Their number can range by the optimal separating hyperplane in the high di- from a minimum of two to a maximum of the training mensional space of the derived variables (14). There sample size N . The “margin” is defined to be the dis- is another equivalent formulation in that space that tance of support points from OSH. The z–space linear shows that the SVM procedure is related to other well classifier is given by known statistically based methods. The parameters of the OSH (19) are the solution to M ∗ ∗ y (z) = sign β0 + ˆ β k zk (19) N 2 k=1 (β0 , β ∗ ) = arg min ∗ [1 − yi (β0 + β T zi )]+ + λ · || β || . β0 ,β i=1 where (β0 , β ∗ = {βk }M ) define the OSH. Their val- ∗ ∗ 1 (22) ues can be determined using standard quadratic pro- Here the expression [η]+ represents the “positive part” gramming techniques. of its argument; that is, [η]+ = max(0, η). The “regu- An OSH can also be defined for the case when the larization” parameter λ is related to the SVM smooth- two classes are not separable in z–space by allowing ing parameter mentioned above. This expression (22) some points to be on wrong side of their class mar- represents a penalized learning problem where the gin. The amount by which they are allowed to do so goal is to minimize the empirical risk on the training is a regularization (smoothing) parameter of the pro- data using as a loss criterion cedure. In both the separable and non separable cases the solution parameter values (β0 , β ∗ ) (19) are defined ∗ L(y, F (z)) = [1 − yF (z)]+ , (23) only by points close to boundary between the classes. where The solution for β ∗ can be expressed as N F (z) = β0 + β T z, β∗ = ∗ α i y i zi subject to an increasing penalty for larger values of i=1 n ∗ 2 2 with αi = 0 only for points on, or on the wrong side || β || = βj . (24) of, their class margin. These are the support vectors. j=1 WEAT003
  • 7. PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 7 This penalty (24) is well known and often used to reg- respective distributions of the two classes in the space ularize statistical procedures, for example linear least of the original predictor variables x (small Bayes error squares regression leading to ridge–regression (Hoerl rate). and Kannard 1970) The computational savings in prediction are bought by dramatic increase in computation required for N 2 training. Ordinary kernel methods (4) require no (β0 , β ∗ ) = arg min ∗ [yi − (β0 + β T zi )]2 + λ · || β || . training; the data set is the model. The quadratic pro- β0 ,β i=1 gram for obtaining the optimal separating hyperplane (25) (solving (22)) requires computation proportional to The “hinge” loss criterion (23) is not familiar in the square of the sample size (N 2 ), multiplied by the statistics. However, it is closely related to one that is number of resulting support vectors. There has been well known in that field, namely conditional negative much research on fast algorithms for training SVMs, log–likelihood associated with logistic regression extending computational feasibility to data sets of size N 30, 000 or so. However, they are still not feasible L(y, F (z)) = − log[1 + e−yF (z) ]. (26) for really large data sets N 100, 000. In fact, one can view the SVM hinge loss as a SVMs share some of the disadvantages of ordinary piecewise–linear approximation to (26). Unregular- kernel methods. They are a black–box procedure with ized logistic regression is one of the most popular little interpretive value. Also, as with all kernel meth- methods in statistics for treating binary response out- ods, performance can be very sensitive to kernel (dis- comes (9). Thus, a support vector machine can tance function) choice (5). For good performance the be viewed as an approximation to regularized logis- kernel needs to be matched to the properties of the tic regression (in z–space) using the ridge–regression target function F ∗ (x) (2), which are often unknown. penalty (24). However, when there is a known “natural” distance for This penalized learning formulation forms the basis the problem, SVMs represent very powerful learning for extending SVMs to the regression setting where machines. the response variable y assumes numeric values y ∈ R1 , rather than binary values (9). One simply replaces the loss criterion (23) in (22) with B. Boosted trees L(y, F (z)) = (|y − F (z)| − ε)+ . (27) Boosting decision trees was first proposed by Fre- This is called the “ε–insensitive” loss and can be und and Schapire 1996. The basic idea is rather than viewed as a piecewise–linear approximation to the Hu- using just a single tree for prediction, a linear combi- ber 1964 loss nation of (many) trees |y − F (z)|2 /2 |y − F (z)| ≤ ε M L(y, F (z)) = ε(|y − F (z)| − ε/2) |y − F (z)| > ε F (x) = am Tm (x) (29) (28) m=1 often used for robust regression in statistics. This loss (28) is a compromise between squared–error loss (25) is used instead. Here each Tm (x) is a decision tree of and absolute–deviation loss L(y, F (z)) = |y − F (z)|. the type discussed in Section III and am is its coeffi- The value of the “transition” point ε differentiates the cient in the linear combination. This approach main- errors that are treated as “outliers” being subject to tains the (statistical) advantages of trees, while often absolute–deviation loss, from the other (smaller) er- dramatically increasing accuracy over that of a single rors that are subject to squared–error loss. tree. 4. SVM properties 1. Training Support vector machines inherit most of the advan- The recursive partitioning technique for construct- tages of ordinary kernel methods discussed in Section ing a single tree on the training data was discussed in II. In addition, they can overcome the computation Section III. Algorithm 1 describes a forward stagewise problems associated with prediction, since only the method for constructing a prediction machine based support vectors (αi = 0 in (13)) are required for mak- on a linear combination of M trees. ing predictions. If the number of support vectors is much smaller that than the total sample size N , com- putation is correspondingly reduced. This will tend to Algorithm 1 be the case when there is small overlap between the Forward stagewise boosting WEAT003
  • 8. 8 PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 1 F0 (x) = 0 often involves hundreds of trees, this assumption is 2 For m = 1 to M do: far from true and as a result accuracy suffers. A bet- 3 (am , Tm (x)) = arg mina,T (x) ter strategy turns out to be (Friedman 2001) to use a N constant tree size (J regions) at each iteration, where 4 i=1 L(yi , Fm−1 (xi )+ aT (xi )) 5 Fm (x) =Fm−1 (xi )+am Tm (x) the value of J is taken to be small, but not too small. 6 EndFor Typically 4 ≤ J ≤ 10 works well in the context of 7 M F (x) = FM (x) = m=1 am Tm (x) boosting, with performance being fairly insensitive to particular choices. The first line initializes the predicting function to everywhere have the value zero. Lines 2 and 6 con- trol the M iterations of the operations associated with 2. Regularization lines 3–5. At each iteration m there is a current pre- dicting function Fm−1 (x). At the first iteration this is the initial function F0 (x) = 0, whereas for m > 1 it Even if one restricts the size of the trees entering is the linear combination of the m − 1 trees induced into a boosted tree model it is still possible to fit the at the previous iterations. Lines 3 and 4 construct training data arbitrarily well, reducing training error that tree Tm (x), and find the corresponding coeffi- to zero, with a linear combination of enough trees. cient am , that minimize the estimated prediction risk However, as is well known in statistics, this is seldom (1) on the training data when am Tm (x) is added to the best thing to do. Fitting the training data too the current linear combination Fm−1 (x). This is then well can increase prediction risk on future predictions. added to the current approximation Fm−1 (x) on line This is a phenomenon called “over–fitting”. Since 5, producing a current predicting function Fm (x) for each tree tries to best fit the errors associated with the next (m + 1)st iteration. the linear combination of previous trees, the train- At the first step, a1 T1 (x) is just the standard tree ing error monotonically decreases as more trees are build on the data as described in Section III, since included. This is, however, not the case for future the current function is F0 (x) = 0. At the next prediction error on data not used for training. step, the estimated optimal tree T2 (x) is found to Typically at the beginning, future prediction error add to it with coefficient a2 , producing the function decreases with increasing number of trees M until at F2 (x) =a1 T1 (x) + a2 T2 (x). This process is continued some point M ∗ a minimum is reached. For M > M ∗ , for M steps, producing a predicting function consist- future error tends to (more or less) monotonically in- ing of a linear combination of M trees (line 7). crease as more trees are added. Thus there is an opti- The potentially difficult part of the algorithm is mal number M ∗ of trees to include in the linear combi- constructing the optimal tree to add at each step. nation. This number will depend on the problem (tar- This will depend on the chosen loss function L(y, F ). get function (2), training sample size N , and signal to For squared–error loss noise ratio). Thus, in any given situation, the value of M ∗ is unknown and must be estimated from the L(y, F ) = (y − F )2 training data itself. This is most easily accomplished by the “early stopping” strategy used in neural net- the procedure is especially straight forward, since work training. The training data is randomly parti- 2 tioned into learning and test samples. The boosting is L(y, Fm−1 + aT )=(y− Fm−1 − aT ) performed using only the data in the learning sample. = (rm − aT )2 . As iterations proceed and trees are added, prediction risk as estimated on the test sample is monitored. At Here rm = y −Fm−1 is just the error (“residual”) from that point where a definite upward trend is detected the current model Fm−1 at the mth iteration. Thus iterations stop and M ∗ is estimated as the value of each successive tree is built in the standard way to M producing the smallest prediction risk on the test best predict the errors produced by the linear com- sample. bination of the previous trees. This basic idea can Inhibiting the ability of a learning machine to fit be extended to produce boosting algorithms for any the training data so as to increase future performance differentiable loss criterion L(y, F ) (Friedman 2001). is called a “method–of–regularization”. It can be mo- As originally proposed the standard tree construc- tivated from a frequentist perspective in terms of the tion algorithm was treated as a primitive in the boost- “bias–variance trade–off” (Geman, Bienenstock and ing algorithm, inserted in lines 3 and 4 to produced Doursat 1992) or by the Bayesian introduction of a a tree that best predicts the current errors {rim = prior distribution over the space of solution functions. N yi − Fm−1 (xi )}1 . In particular, an optimal tree size In either case, controlling the number of trees is not was estimated at each step in the standard tree build- the only way to regularize. Another method com- ing manner. This basically assumes that each tree monly used in statistics is “shrinkage”. In the context will be the last one in the sequence. Since boosting of boosting, shrinkage can be accomplished by replac- WEAT003
  • 9. PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 9 ing line 5 in Algorithm 1 by values by penalizing the l2 –norm of the coefficient vec- tor. Another penalty becoming increasingly popular Fm (x) = Fm−1 (x) + (ν · am ) Tm (x) . (30) is the “lasso” (Tibshirani 1996) Here the contribution to the linear combination of the estimated best tree to add at each step is reduced P ({am }) = | am |. (34) by a factor 0 < ν ≤ 1. This “shrinkage” factor or “learning rate” parameter controls the rate at which This also encourages small coefficient absolute values, adding trees reduces prediction risk on the learning but by penalizing the l1 –norm. Both (33) and (34) in- sample; smaller values produce a slower rate so that creasingly penalize larger average absolute coefficient more trees are required to fit the learning data to the values. They differ in how they react to dispersion or same degree. variation of the absolute coefficient values. The ridge Shrinkage (30) was introduced in Friedman 2001 penalty discourages dispersion by penalizing variation and shown empirically to dramatically improve the in absolute values. It thus tends to produce solutions performance of all boosting methods. Smaller learning in which coefficients tend to have equal absolute val- rates were seen to produce more improvement, with a ues and none with the value zero. The lasso (34) is in- diminishing return for ν 0.1, provided that the esti- different to dispersion and tends to produce solutions mated optimal number of trees M ∗ (ν) for that value with a much larger variation in the absolute values of of ν is used. This number increases with decreasing the coefficients, with many of them set to zero. The learning rate, so that the price paid for better perfor- best penalty will depend on the (unknown population) mance is increased computation. optimal coefficient values. If these have more or less equal absolute values the ridge penalty (33) will pro- duce better performance. On the other hand, if their 3. Penalized learning formulation absolute values are highly diverse, especially with a few large values and many small values, the lasso will The introduction of shrinkage (30) in boosting was provide higher accuracy. originally justified purely on empirical evidence and As discussed in Hastie, Tibshirani and Friedman the reason for its success was a mystery. Recently, 2001 and rigorously derived in Efron et al 2002, there this mystery has been solved (Hastie, Tibshirani and is a connection between boosting (Algorithm 1) with Friedman 2001 and Efron, Hastie, Johnstone and Tib- shrinkage (30) and penalized linear regression on all shirani 2002). Consider a learning machine consisting possible trees (31) (32) using the lasso penalty (34). of a linear combination of all possible (J–region) trees: They produce very similar solutions as the shrink- age parameter becomes arbitrarily small ν → 0. The ˆ F (x) = am Tm (x) ˆ (31) number of trees M is inversely related to the penalty strength parameter λ; more boosted trees corresponds where to smaller values of λ (less regularization). Using early N stopping to estimate the optimal number M ∗ is equiv- alent to estimating the optimal value of the penalty {ˆm } = arg min a L yi , am Tm (xi ) +λ·P ({am }). {am } strength parameter λ. Therefore, one can view the in- i=1 (32) troduction of shrinkage (30) with a small learning rate This is a penalized (regularized) linear regression, ν 0.1 as approximating a learning machine based on based on a chosen loss criterion L, of the response all possible (J–region) trees with a lasso penalty for values {yi }N on the predictors (trees) {Tm (xi )}N . regularization. The lasso is especially appropriate in 1 i=1 The first term in (32) is the prediction risk on the this context because among all possible trees only a training data and the second is a penalty on the val- small number will likely represent very good predic- ues of the coefficients {am }. This penalty is required tors with population optimal absolute coefficient val- to regularize the solution because the number of all ues substantially different from zero. As noted above, possible J–region trees is infinite. The value of the this is an especially bad situation for the ridge penalty “regularization” parameter λ controls the strength of (33), but ideal for the lasso (34). the penalty. Its value is chosen to minimize an esti- mate of future prediction risk, for example based on 4. Boosted tree properties a left out test sample. A commonly used penalty for regularization in statistics is the “ridge” penalty Boosted trees maintain almost all of the advantages of single tree modeling described in Section III A while P ({am }) = a2 m (33) often dramatically increasing their accuracy. One of the properties of single tree models leading to inac- used in ridge–regression (25) and support vector ma- curacy is the coarse piecewise constant nature of the chines (22). This encourages small coefficient absolute resulting approximation. Since boosted tree machines WEAT003
  • 10. 10 PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 are linear combinations of individual trees, they pro- of the original predictor variables x. For support vec- duce a superposition of piecewise constant approxi- tor machines these derived variables (14) are implic- mations. These are of course also piecewise constant, itly defined through the chosen kernel K(x, x ) defin- but with many more pieces. The corresponding dis- ing their inner product (15). With boosted trees these continuous jumps are very much smaller and they are derived variables are all possible (J–region) decision able to more accurately approximate smooth target trees (31) (32). functions. The coefficients defining the respective linear mod- Boosting also dramatically reduces the instability els in the derived space for both methods are solu- associated with single tree models. First only small tions to a penalized learning problem (22) (32) in- trees (Section IV B 1) are used which are inherently volving a loss criterion L(y, F ) and a penalty on the more stable than the generally larger trees associated coefficients P ({am }). Support vector machines use with single tree approximations. However, the big in- L(y, F ) = (1 − y F )+ for classification y ∈ {−1, 1}, crease in stability results from the averaging process and (27) for regression y ∈ R1 . Boosting can be associated with using the linear combination of a large used with any (differentiable) loss criterion L(y, F ) number of trees. Averaging reduces variance; that is (Friedman 2001). The respective penalties P ({am }) why it plays such a fundamental role in statistical es- are (24) for SVMs and (34) with boosting. Addition- timation. ally, both methods have a computational trick that Finally, boosting mitigates the fragmentation prob- allows all (implicit) calculations required to solve the lem plaguing single tree models. Again only small learning problem in the very high (usually infinite) di- trees are used which fragment the data to a much mensional space of the derived variables z to be per- lesser extent than large trees. Each boosting iteration formed in the space of the original variables x. For uses the entire data set to build a small tree. Each support vector machines this is the kernel trick (Sec- respective tree can (if dictated by the data) involve tion IV A 1), whereas with boosting it is forward stage- different sets of predictor variables. Thus, each pre- wise tree building (Algorithm 1) with shrinkage (30). diction can be influenced by a large number of predic- The two approaches do have some basic differences. tor variables associated with all of the trees involved These involve the particular derived variables defin- in the prediction if that is estimated to produce more ing the linear model in the high dimensional space, accurate results. and the penalty P ({am }) on the corresponding coef- The computation associated with boosting trees ficients. The performance of any linear learning ma- roughly scales as nN log N with the number of pre- chine based on derived variables (14) will depend on dictor variables n and training sample size N . Thus, the detailed nature of those variables. That is, dif- it can be applied to fairly large problems. For exam- ferent transformations {hk (x)} will produce different ple, problems with n 102 –103 and N 105 –106 are learners as functions of the original variables x, and routinely feasible. for any given problem some will be better than others. The one advantage of single decision trees not inher- The prediction accuracy achieved by a particular set ited by boosting is interpretability. It is not possible of transformations will depend on the (unknown) tar- to inspect the very large number of individual tree get function F ∗ (x) (2). With support vector machines components in order to discern the relationships be- the transformations are implicitly defined through the tween the response y and the predictors x. Thus, like chosen kernel function. Thus the problem of choosing support vector machines, boosted tree machines pro- transformations becomes, as with any kernel method, duce black–box models. Techniques for interpreting one of choosing a particular kernel function K(x, x ) boosted trees as well as other black–box models are (“kernel customizing”). described in Friedman 2001. Although motivated here for use with decision trees, boosting can in fact be implemented using any spec- ified “base learner” h(x; p). This is a function of the C. Connections predictor variables x characterized by a set of param- eters p = {p1, p2 , · · ·}. A particular set of joint pa- The preceding section has reviewed two of the most rameter values p indexes a particular function (trans- important advances in machine learning in the re- formation) of x, and the set of all functions induced cent past: support vector machines and boosted de- over all possible joint parameter values define the de- cision trees. Although motivated from very different rived variables of the linear prediction machine in the perspectives, these two approaches share fundamen- transformed space. If all of the parameters assume tal properties that may account for their respective values on a finite discrete set this derived space will success. These similarities are most readily apparent be finite dimensional, otherwise it will have infinite from their respective penalized learning formulations dimension. When the base learner is a decision tree (Section IV A 3 and Section IV B 3). Both build linear the parameters represent the identities of the predic- models in a very high dimensional space of derived tor variables used for splitting, the split points, and variables, each of which is a highly nonlinear function the response values assigned to the induced regions. WEAT003
  • 11. PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 11 The forward stagewise approach can be used with any prediction. The regularization effect of this penalty base learner by simply substituting it for the deci- tends to produce large coefficient absolute values for sion tree T (x) →h(x; p) in lines 3–5 of Algorithm 1. those derived variables that appear to be relevant and Thus boosting provides explicit control on the choice small (mostly zero) values for the others. This can of transformations to the high dimensional space. So sacrifice accuracy if the chosen base learner happens far boosting has seen greatest success with decision to provide an especially appropriate space of derived tree base learners, especially in data mining applica- variables in which a large number turn out to be highly tions, owing to their advantages outlined in Section relevant. However, this approach provides consider- III A. However, boosting other base learners can pro- able robustness against less than optimal choices for vide potentially attractive alternatives in some situa- the base learner and thus the space of derived vari- tions. ables. Another difference between SVMs and boosting is V. CONCLUSION the nature of the regularizing penalty P ({am }) that they implicitly employ. Support vector machines use A choice between support vector machines and the “ridge” penalty (24). The effect of this penalty is boosting depends on one’s a priori knowledge con- to shrink the absolute values of the coefficients {βm } cerning the problem at hand. If that knowledge is from that of the unpenalized solution λ = 0 (22), sufficient to lead to the construction of an especially while discouraging dispersion among those absolute effective kernel function K(x, x ) then an SVM (or values. That is, it prefers solutions in which the de- perhaps other kernel method) would be most appro- rived variables (14) all have similar influence on the priate. If that knowledge can suggest an especially ef- resulting linear model. Boosting implicitly uses the fective base learner h(x; p) then boosting would likely “lasso” penalty (34). This also shrinks the coefficient produce superior results. As noted above, boosting absolute values, but it is indifferent to their disper- tends to be more robust to misspecification. These sion. It tends to produce solutions with relatively few two techniques represent additional tools to be con- large absolute valued coefficients and many with zero sidered along with other machine learning methods. value. The best tool for any particular application depends If a very large number of the derived variables in on the detailed nature of that problem. As with any the high dimensional space are all highly relevant for endeavor one must match the tool to the problem. If prediction then the ridge penalty used by SVMs will little is known about which technique might be best provide good results. This will be the case if the cho- in any given application, several can be tried and ef- sen kernel K(x, x ) is well matched to the unknown fectiveness judged on independent data not used to target function F ∗ (x) (2). Kernels not well matched construct the respective learning machines under con- to the target function will (implicitly) produce trans- sideration. formations (14) many of which have little or no rele- vance to prediction. The homogenizing effect of the ridge penalty is to inflate estimates of their relevance while deflating that of the truly relevant ones, thereby VI. ACKNOWLEDGMENTS reducing prediction accuracy. Thus, the sharp sensi- tivity of SVMs on choice of a particular kernel can be Helpful discussions with Trevor Hastie are grate- traced to the implicit use of the ridge penalty (24). fully acknowledged. This work was partially sup- By implicitly employing the lasso penalty (34), ported by the Department of Energy under contract boosting anticipates that only a small number of its DE-AC03-76SF00515, and by grant DMS–97–64431 of derived variables are likely to be highly relevant to the National Science Foundation. [1] Bellman, R. E. (1961). Adaptive Control Processes. [6] Freund, Y and Schapire, R. (1996). Experiments with Princeton University Press. a new boosting algorithm. In Machine Learning: Pro- [2] Breiman, L. (1996). Bagging predictors. Machine ceedings of the Thirteenth International Conference, Learning 26, 123-140. 148–156. [3] Breiman, L. (2001). Random forests, random features. [7] Friedman, J. H. (2001). Greedy function approxima- Technical Report, University of California, Berkeley. tion: a gradient boosting machine. Annals of Statis- [4] Breiman, L., Friedman, J. H., Olshen, R. and tics 29, 1189-1232. Stone, C. (1983). Classification and Regression Trees. [8] Geman, S., Bienenstock, E. and Doursat, R. (1992). Wadsworth. Neural networks and the bias/variance dilemma. Neu- [5] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, ral Computation 4, 1-58. R. (2002). Least angle regression. Annals of Statistics. [9] Hastie, T., Tibshirani, R. and Friedman, J.H. (2001). To appear. The Elements of Statistical Learning. Springer– WEAT003
  • 12. 12 PHYSTAT2003, SLAC, Stanford, Calfornia, September 8-11, 2003 Verlag. tion via the lasso. J. Royal Statist. Soc. 58, 267-288. [10] Hoerl, A. E. and Kennard, R. (1970). Ridge regres- [14] Vapnik, V. N. (1995). The Nature of Statistical Learn- sion: biased estimation for nonorthogonal problems. ing Theory. Springer. Technometrics 12, 55-67 [15] Wahba, G. (1990). Spline Models for Observational [11] Nadaraya, E. A. (1964). On estimating regression. Data. SIAM, Philadelphia. Theory Prob. Appl. 10, 186-190. [16] Watson, G. S. (1964). Smooth regression analysis. [12] Quinlan, R. (1992). C4.5: Programs for machine Sankhya Ser. A. 26, 359-372. learning. Morgan Kaufmann, San Mateo. [13] Tibshirani, R. (1996). Regression shrinkage and selec- WEAT003