SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Ensemble Learners
• Classification Trees (Done)
• Bagging: Averaging Trees
• Random Forests: Cleverer Averaging of Trees
• Boosting: Cleverest Averaging of Trees
• Details in

chaps 10, 15 and 16

Methods for dramatically improving the performance of weak
learners such as Trees. Classification trees are adaptive and robust,
but do not make good prediction models. The techniques discussed
here enhance their performance considerably.

227
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Properties of Trees
• [ ]Can handle huge datasets
• [ ]Can handle mixed predictors—quantitative and qualitative
• [ ]Easily ignore redundant variables
• [ ]Handle missing data elegantly through surrogate splits
• [ ]Small trees are easy to interpret
• ["]Large trees are hard to interpret
• ["]Prediction performance is often poor

228
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

email
600/1536
ch$<0.0555
ch$>0.0555

email
280/1177

spam
48/359

remove<0.06

hp<0.405
hp>0.405

remove>0.06

email
180/1065

spam

ch!<0.191

spam

george<0.005
george>0.005

email

email

spam

80/652

0/209

36/123

16/81

free<0.065
free>0.065

email

email

email

spam

77/423

3/229

16/94

9/29

CAPMAX<10.5
business<0.145
CAPMAX>10.5
business>0.145

email

email

email

spam

20/238

57/185

14/89

3/5

receive<0.125
edu<0.045
receive>0.125
edu>0.045

email

spam

email

email

19/236

1/2

48/113

9/72

our<1.2
our>1.2

email

spam

37/101

1/12

0/3

CAPAVE<2.7505
CAPAVE>2.7505

email

hp<0.03
hp>0.03

email

6/109

email
100/204

80/861

0/22

george<0.15
CAPAVE<2.907
george>0.15
CAPAVE>2.907

ch!>0.191

email

email

spam
26/337

9/112

spam
19/110

spam

7/227

1999<0.58
1999>0.58

spam
18/109

email

0/1

229
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

ROC curve for pruned tree on SPAM data

0.8

1.0

SPAM Data
TREE − Error: 8.7%

0.0

0.2

0.4

Sensitivity

0.6

o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0

Overall error rate on test data:
8.7%.
ROC curve obtained by varying the threshold c0 of the classifier:
C(X) = +1 if Pr(+1|X) > c0 .
Sensitivity: proportion of true
spam identified
Specificity: proportion of true
email identified.

We may want specificity to be high, and suffer some spam:
Specificity : 95% =⇒ Sensitivity : 79%

230
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Toy Classification Problem
Bayes Error Rate: 0.25
6
4

1

2
0
-2
-4
-6

X2

1

1

1

1

1
1
111
11
1 1
1
1
1
1 0 110111 1 0 1 1
1
1 1 11 0 1
1
11 1
1 1
1
1 1
1
1 1
1
1
11 1 0 1 11 0 1 1 1 0
1
0
1
00 1 11
1
1
1 1 010 010 11 010 1 00 01 1 1 1
1
1 1
1 01
0
1
0 00 00
0
1 1 11 1 0 0 00 0 01011 0 0111
101 1
01 0
0 0 0 0 1 0011 1 11 0
1 1 1 101 1010 0 0 0000010 0 0 1
0 00 1
0 100 0 1 0 0 0
1
0
0 00 01 100 000 011
0 00 0 0 0
1 00 0 001000 000 1 1 1
000 00 1 000 0
11
1 0 1
0
0
1 0 0 0101 0
1
11
1 1 001 01001 000110 000 0111 0 1 1 11
0 00
0100000 000 10 001 0 1 1 1
0
1 1 00
1 0
1
0 1 0001000 1
01 0 00100000011 0
1 00000000001000010 0 1 0
1 0 1 00 00000000000000 00001101 1
00 000 10
1 1
1
1
1
0 0 0 10 0100 0 0 1
00 0
11 0 111 0 00000010010111 101010 01
1 1 0 0 11 10 0010 100 1 0 0
1
0
1 0 01100 0 0 0 101 0 01 1
00 0
0 01 000 01 0 0
1
1 1000000 0 000 1 1
1
00
1 1 0 10 11 111 0000001 1 111 11 1 1
0001 01000000 1 11 1
1
0 0 0
0 011 0 00 0
0 01 1 1
1
0
1 1 1
1
0
1 11 11 1 1111
0 1 1 10 01000 0 1 1 1111 1 1
1 00 1
0 11 0
11 0 0 0 1
0
0
111 11 1 11 1001 0 1 10 01 1 1 1 1
0
1 1 1
1
1
0
1 1 1 1 11
1
1 1
10
1
1
1 1 1 1 11 1 1
1
1
1
11
1 11 1
1 1
1
1
1
1
1
1 11
1 11 11 1 1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
1
1
1
1
1
1

-6

-4

-2

• Data X and Y , with Y
taking values +1 or −1.

1

1

1

1

1

0

X1

2

4

• Here X =∈ IR2

1

1
1

6

• The black boundary
is the Bayes Decision
Boundary - the best
one can do. Here 25%
• Goal: given N training
pairs (Xi , Yi ) produce
ˆ
a classifier C(X) ∈
{−1, 1}

• Also estimate the probability of the class labels Pr(Y = +1|X).

231
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Toy Example — No Noise
Bayes Error Rate: 0
3

1
1
1 1 1
11
1
11
11 1
1 11
1 11 1
1
1 1
1
11 1 1 1 11 1 1 1 1
11
1 1
1
11 1 1
1
1
1 1 11 1 1 11 1 1 11 11 1 1
1 1
1
1 1 11 1 1 1
1 11
11
1 11 1 1 11 1 0 000
1
1 0 0
11 1
1
1
1
1
1 1 1 1111 1
1
00 00 101 1 1
1
00 0
1
1
1
0 0
1 1 1 11 00 0 00 0000 00 001 111 1 1
0 0 0 0 0 0 00 0 1 1 1
1 1 11 0 0
0
1 1 11
1 0 0 0 0 00 000 0 00
1
1
1
1 10 0 0
1
1
0 0 00 0
1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1
0
1 11 0 0 00000 00 0 0 0 0 0 0 1
1
0
1 0 0 0 0 0 0 0000000
11
0 0 0 0 0 000 0 00 001 11 1 1
1
1
0
1 0 0 00 00 0 00 0 00 0 0
0
1
1 0 00 000 0 0 0 00
0
0 0
1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1
1 1
0
1
1 0 000 00000 0 0000 0 000
1
000 0 000 0 00 00 0 0
0
1
0
0 0 00
1
00
0
11
00 0 0 0
11
1
1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1
1 1
1
0
0 00 0
1
1 1
00 0 0
1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1
1 1
0
0 0 00000 0 0 00 1 11 1
0 000000 0
1
11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1
0 0
0
1
0
0
00 0 000 00 0 0 0
1
1
1
1 1 1
1
0
1 11
0 00 00 0 00 00 1 1 11
1
1 00 0 0
0
0
1 1
1 1
1
1
1 1 01 00 0 0 00
1 0 0
11 1 1
0
1
1 1 111 001 0 010 1 1 111 1 1
1 1
1 1 0 11
1 1 01
1
1 1 1 1 1
1
1
1
1
1 1 1 1 1 11 1
1
1
1 11 11 1 1 111 11
1
11
1
1 11 1 1 1 1
1 11 1 1 1 1
1
1 1
11
1
1
1
1 1
1
1
1
1 1
11 1
1
1
1
1
1 11
1
1
1
11
1
11

0
-1
-2
-3

X2

1

2

1

-3

-2

1

11

-1

1
11

1
1

1
1 1

0

X1

1

2

3

• Deterministic problem;
noise comes from sampling distribution of X.
• Use a training sample
of size 200.
• Here Bayes Error is
0%.

232
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Classification Tree
1
94/200
x.2<-1.06711
x.2>-1.06711
0

1

72/166
x.2<1.14988
x.2>1.14988

0/34

1

0
40/134
x.1<1.13632
x.1>1.13632
0

1

23/117

0/32

0/17

x.1<-0.900735
x.1>-0.900735
1

0

5/26
x.1<-1.1668
x.1>-1.1668
1

1
0/12

2/91
x.2<-0.823968
x.2>-0.823968
0

5/14
x.1<-1.07831
x.1>-1.07831
1

1

1/5

4/9

0

2/8

0/83

233
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

234

Decision Boundary: Tree
Error Rate: 0.073

2

3

• The eight-node tree is
boxy, and has a testerror rate 0f 7.3%

1

1

0

1
1

-1

1
1

-2

1

1
1 1
1
1 1
1
1
1
1 1
1
1
1
11
1
0
0 01
1
1 11
0 11 1
1
1 1
0
000
1 0 00
0 000 0 00
0
0
00
0 0 0 00
1 0 0
00
1
1 11 1
1 0 0 0 0 0 0 0 00 0 1
1
1 0 0 00
0
0
0
00 0 0 0
00
0
00 0 0
0 00 0 0 0 0 0
1 0 0
0
0
0 0 00 0 1
00
0
1 1
0 0
1
00 0 0
1
1 000 00 0 0 0 1
0
1 1 1
0
11
1 10 01
11
1
11
1 11
1 1
1
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1

1

1

-3

X2

1

1

11

-3

-2

-1

0
X1

1

2

3

• When
the
nested
spheres are in 10dimensions, a classification tree produces
a rather noisy and
ˆ
inaccurate rule C(X),
with error rates around
30%.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Model Averaging
Classification trees can be simple, but often produce noisy (bushy)
or weak (stunted) classifiers.
• Bagging (Breiman, 1996): Fit many large trees to
bootstrap-resampled versions of the training data, and classify
by majority vote.
• Boosting (Freund & Shapire, 1996): Fit many large or small
trees to reweighted versions of the training data. Classify by
weighted majority vote.
• Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree.

235
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Bagging
• Bagging or bootstrap aggregation averages a noisy fitted
function refit to many bootstrap samples, to reduce its
variance. Bagging is a poor man’s Bayes; See

pp 282.

• Suppose C(S, x) is a fitted classifier, such as a tree, fit to our
training data S, producing a predicted class label at input
point x.
• To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size
N with replacement from the training data. Then
Cbag (x) = Majority Vote {C(S ∗b , x)}B .
b=1
• Bagging can dramatically reduce the variance of unstable
procedures (like trees), leading to improved prediction.
However any simple structure in C (e.g a tree) is lost.

236
SLDM III c Hastie & Tibshirani - October 9, 2013

Original Tree

Random Forests and Boosting

b=1

b=2

x.1 < 0.395

x.1 < 0.555

x.2 < 0.205

|

|

|

1

1
0

1

0

1

0

0

1

0

0
0

1

b=3

0

0

1

0

1

b=4

1

b=5

x.2 < 0.285

x.3 < 0.985

x.4 < −1.36

|

|

|
0
1

0

0
1

1

0

1

1

1

0
1

0

1

b=6

1

1

0

0

1

b=7

b=8

x.1 < 0.395

x.1 < 0.395

x.3 < 0.985

|

|

|

1
1

1
0
1

1

0

0

0

1

0
0

1

b=9

0

0

1

b = 10

b = 11

x.1 < 0.395

x.1 < 0.555

x.1 < 0.555

|

|

|

1
0
1
0

1

1
0

1

0

0
1

0

1
0

1

237
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Decision Boundary: Bagging

2

3

Error Rate: 0.032

1

1

0

1
1

-1

1
1

-2

1

1
1 1
1
1 1
1
1
1
1 1
1
1
1
11
1
0
0 01
1
1 11
0 11 1
1
1 1
0
000
1 0 00
0 000 0 00
0
0
00
0 0 0 00
1 0 0
00
1
1 11 1
1 0 0 0 0 0 0 0 00 0 1
1
1 0 0 00
0
0
0
00 0 0 0
00
0
00 0 0
0 00 0 0 0 0 0
1 0 0
0
0
0 0 00 0 1
00
0
1 1
0 0
1
00 0 0
1
1 000 00 0 0 0 1
0
1 1 1
0
11
1 10 01
11
1
11
1 11
1 1
1
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1

1

1

• Bagging reduces error
by variance reduction.

-3

X2

1

1

11

-3

-2

-1

0
X1

1

2

• Bagging averages many
trees, and produces
smoother
decision
boundaries.

3

• Bias is slightly increased because bagged
trees are slightly shallower.

238
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Random forests
• refinement of bagged trees; very popular—

chapter 15.

• at each tree split, a random sample of m features is drawn, and
only those m features are considered for splitting. Typically
√
m = p or log2 p, where p is the number of features
• For each tree grown on a bootstrap sample, the error rate for
observations left out of the bootstrap sample is monitored.
This is called the “out-of-bag” or OOB error rate.
• random forests tries to improve on bagging by “de-correlating”
the trees, and reduce the variance.
• Each tree has the same expectation, so increasing the number
of trees does not alter the bias of bagging or random forests.

239
SLDM III c Hastie & Tibshirani - October 9, 2013

Wisdom of Crowds

10

Wisdom of Crowds
Consensus
Individual

8

• 10 movie categories, 4 choices
in each category

6

• 50 people, for each question 15
informed and 35 guessers

4

Expected Correct out of 10

Random Forests and Boosting

• For informed people,

0
0.25

0.50

0.75

P − Probability of Informed Person Being Correct

1.00

=

p

Pr(incorrect)

2

Pr(correct)

=

(1 − p)/3

• For guessers,
Pr(all choices) = 1/4.

Consensus (orange) improve over individuals (teal) even for
moderate p.

240
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

241

1.0

ROC curve for TREE, SVM and Random Forest on SPAM data

0.8

TREE, SVM and RF
Random Forest − Error: 4.75%
SVM − Error: 6.7%
TREE − Error: 8.7%

Random Forest dominates
both other methods on the
SPAM data — 4.75% error.
Used 500 trees with default
settings for random Forest
package in R.

0.0

0.2

0.4

Sensitivity

0.6

o
o
o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0
SLDM III c Hastie & Tibshirani - October 9, 2013

242

0.07

0.08

RF: decorrelation

0.06

ˆ
• VarfRF (x) = ρ(x)σ 2 (x)

0.03

0.04

0.05

• The correlation is because
we are growing trees to the
same data.

0.02

Correlation between Trees

Random Forests and Boosting

1

4

7 10

16

22

28

34

40

Number of Randomly Selected Splitting Variables m

46

• The correlation decreases
as m decreases—m is the
number of variables sampled at each split.

Based on a simple experiment with random forest regression
models.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Random Forest Ensemble

0.15
0.10
0.05
Mean Squared Error
Squared Bias
Variance

0.0
0

10

20

30
m

40

50

Variance

0.80
0.70

0.75

0.20

0.85

Bias & Variance

0.65

Mean Squared Error and Squared Bias

243

• The smaller m, the
lower the variance of
the random forest ensemble
• Small m is also associated with higher
bias, because important variables can be
missed by the sampling.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

244

Boosting
• Breakthrough invention of Freund &
Schapire (1997)
Weighted Sample

CM (x)

• Average many trees, each grown to reweighted versions of the training data.
• Weighting decorrelates the trees, by focussing on regions missed by past trees.

Weighted Sample

C3 (x)

Weighted Sample

C2 (x)

• Final Classifier is weighted average of
classifiers:
M

αm Cm (x)

C(x) = sign
Training Sample

C1 (x)

m=1

Note: Cm (x) ∈ {−1, +1}.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

245

100 Node Trees

Boosting vs Bagging
0.4

Bagging
AdaBoost

• Bayes error rate is 0%.

0.2

• Trees are grown best
first without pruning.

0.1

• Leftmost term is a single tree.

0.0

Test Error

0.3

• 2000
points
from
Nested Spheres in R10

0

100

200
Number of Terms

300

400
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

AdaBoost (Freund & Schapire, 1996)
1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .
2. For m = 1 to M repeat steps (a)–(d):
(a) Fit a classifier Cm (x) to the training data using weights wi .
(b) Compute weighted error of newest tree
errm =

N
i=1

wi I(yi = Cm (xi ))
N
i=1

wi

(c) Compute αm = log[(1 − errm )/errm ].

(d) Update weights for i = 1, . . . , N :
wi ← wi · exp[αm · I(yi = Cm (xi ))]
and renormalize to wi to sum to 1.
3. Output C(x) = sign

M
m=1

αm Cm (x) .

.

246
Random Forests and Boosting

0.5

SLDM III c Hastie & Tibshirani - October 9, 2013

247

Boosting Stumps

0.4

Single Stump

0.3

• Boosting stumps works
remarkably well on
the
nested-spheres
problem.

0.1

0.2

400 Node Tree

0.0

Test Error

• A stump is a two-node
tree, after a single split.

0

100

200
Boosting Iterations

300

400
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

248

0.4

• Bayes error is 0%.

in

10-

0.2

• Boosting drives the training
error to zero (green curve).

0.1

• Further iterations continue to
improve test error in many examples (red curve).

0.0

Train and Test Error

• Nested
spheres
Dimensions.

0.3

0.5

Training Error

0

100

200

300
Number of Terms

400

500

600
Random Forests and Boosting

0.5

SLDM III c Hastie & Tibshirani - October 9, 2013

249

0.3

• Nested Gaussians in
10-Dimensions.
• Bayes error is 25%.

Bayes Error
0.2

• Boosting with stumps

0.1

• Here the test error
does increase, but quite
slowly.

0.0

Train and Test Error

0.4

Noisy Problems

0

100

200

300
Number of Terms

400

500

600
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Stagewise Additive Modeling
Boosting builds an additive model
M

βm b(x; γm ).

F (x) =
m=1

Here b(x, γm ) is a tree, and γm parametrizes the splits.
We do things like that in statistics all the time!
• GAMs: F (x) =

j

fj (xj )

• Basis expansions: F (x) =

M
m=1 θm hm (x)

Traditionally the parameters fm , θm are fit jointly (i.e. least
squares, maximum likelihood).
With boosting, the parameters (βm , γm ) are fit in a stagewise
fashion. This slows the process down, and overfits less quickly.

250
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Example: Least Squares Boosting
1. Start with function F0 (x) = 0 and residual r = y, m = 0
2. m ← m + 1
3. Fit a CART regression tree to r giving g(x)
4. Set fm (x) ← ǫ · g(x)
5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat
step 2–5 many times
• g(x) typically shallow tree; e.g. constrained to have k = 3 splits
only (depth).
• Stagewise fitting (no backfitting!). Slows down (over-) fitting
process.
• 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further.

251
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Least Squares Boosting and ǫ
Prediction Error

Single tree

ǫ=1

ǫ = .01

Number of steps m

• If instead of trees, we use linear regression terms, boosting with
small ǫ very similar to lasso!

252
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Adaboost: Stagewise Modeling
• AdaBoost builds an additive logistic regression model
M

Pr(Y = 1|x)
αm fm (x)
F (x) = log
=
Pr(Y = −1|x) m=1
by stagewise fitting using the loss function
L(y, F (x)) = exp(−y F (x)).
• Instead of fitting trees to residuals, the special form of the
exponential loss leads to fitting trees to weighted versions of
the original data. See

pp 343.

253
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

3.0

Why Exponential Loss?

2.0
0.5

1.0

1.5

• Leads to simple reweighting
scheme.

0.0

Loss

2.5

Misclassification
Exponential
Binomial Deviance
Squared Error
Support Vector

• e−yF (x) is a monotone,
smooth upper bound on
misclassification loss at x.

−2

−1

0

1

2

• Has logit transform as population minimizer

y·F

F ∗ (x) =
• y · F is the margin; positive
margin means correct classification.

1
Pr(Y = 1|x)
log
2
Pr(Y = −1|x)

• Other more robust loss functions, like binomial deviance.

254
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

General Boosting Algorithms
The boosting paradigm can be extended to general loss functions
— not only squared error or exponential.
1. Initialize F0 (x) = 0.
2. For m = 1 to M :
(a) Compute
(βm , γm ) = arg minβ,γ

N
i=1

L(yi , Fm−1 (xi ) + βb(xi ; γ)).

(b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ).
Sometimes we replace step (b) in item 2 by
(b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm )
Here ǫ is a shrinkage factor; in gbm package in R, the default is
ǫ = 0.001. Shrinkage slows the stagewise model-building even more,
and typically leads to better performance.

255
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Gradient Boosting
• General boosting algorithm that works with a variety of
different loss functions. Models include regression, resistant
regression, K-class classification and risk modeling.
• Gradient Boosting builds additive tree models, for example, for
representing the logits in logistic regression.
• Tree size is a parameter that determines the order of
interaction (next slide).
• Gradient Boosting inherits all the good features of trees
(variable selection, missing data, mixed predictors), and
improves on the weak features, such as prediction performance.
• Gradient Boosting is described in detail in
, section 10.10,
and is implemented in Greg Ridgeways gbm package in R.

256
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

0.070

Spam Data

0.055
0.050
0.045
0.040

Test Error

0.060

0.065

Bagging
Random Forest
Gradient Boosting (5 Node)

0

500

1000

1500

Number of Trees

2000

2500

257
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Spam Example Results
With 3000 training and 1500 test observations, Gradient Boosting
fits an additive logistic model
f (x) = log

Pr(spam|x)
Pr(email|x)

using trees with J = 6 terminal-node trees.
Gradient Boosting achieves a test error of 4.5%, compared to 5.3%
for an additive GAM, 4.75% for Random Forests, and 8.7% for
CART.

258
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Spam: Variable Importance
3d
addresses
labs
telnet
857
415
direct
cs
table
85
#
parts
credit
[
lab
conference
report
original
data
project
font
make
address
order
all
hpl
technology
people
pm
mail
over
650
;
meeting
email
000
internet
receive
(
re
business
1999
will
money
our
you
edu
CAPTOT
george
CAPMAX
your
CAPAVE
free
remove
hp
$
!

0

20

40

60

Relative importance

80

100

259
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

0.0

0.2

0.4

0.6

0.8

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

Spam: Partial Dependence

1.0

0.0

0.2

0.4

0.6

-0.2 0.0 0.2
-1.0

-0.6

-0.6

Partial Dependence

-0.2 0.0 0.2

remove

-1.0

Partial Dependence

!

0.0

0.2

0.4

0.6
edu

0.8

1.0

0.0

0.5

1.0

1.5
hp

2.0

2.5

3.0

260
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

California Housing Data

0.42
0.40
0.38
0.36
0.34
0.32

Test Average Absolute Error

0.44

RF m=2
RF m=6
GBM depth=4
GBM depth=6

0

200

400

600

Number of Trees

800

1000

261
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Tree Size

0.4

Stumps
10 Node
100 Node
Adaboost

0.2

0.3

The tree size J determines
the interaction order of the
model:
η(X)

=

ηj (Xj )

0.1

j

+

ηjk (Xj , Xk )
jk

0.0

Test Error

262

+
0

100

200
Number of Terms

300

400

ηjkl (Xj , Xk , Xl )
jkl

+···
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Stumps win!
Since the true decision boundary is the surface of a sphere, the
function that describes it has the form
2
2
2
f (X) = X1 + X2 + . . . + Xp − c = 0.

Boosted stumps via Gradient Boosting returns reasonable
approximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees

f1 (x1 )

f2 (x2 )

f3 (x3 )

f4 (x4 )

f5 (x5 )

f6 (x6 )

f7 (x7 )

f8 (x8 )

f9 (x9 )

f10 (x10 )

263
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Learning Ensembles
Ensemble Learning consists of two steps:
1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)}
of basis elements (weak learners) Tm (X).
2. Fitting a model f (X) =

m∈D

αm Tm (X).

Simple examples of ensembles
• Linear regression: The ensemble consists of the coordinate
functions Tm (X) = Xm . The fitting is done by least squares.
• Random Forests: The ensemble consists of trees grown to
booststrapped versions of the data, with additional
randomization at each split. The fitting simply averages.
• Gradient Boosting: The ensemble is grown in an adaptive
fashion, but then simply averaged at the end.

264
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Learning Ensembles and the Lasso
Proposed by Popescu and Friedman (2004):
Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso
regularization:
α(λ) = arg min
α

αm Tm (xi )] + λ

L[yi , α0 +
i=1

M

M

N

m=1

m=1

|αm |.

Advantages:
• Often ensembles are large, involving thousands of small trees.
The post-processing selects a smaller subset of these and
combines them efficiently.
• Often the post-processing improves the process that generated
the ensemble.

265
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Post-Processing Random Forest Ensembles

0.09

Spam Data

0.07
0.06
0.05
0.04

Test Error

0.08

Random Forest
Random Forest (5%, 6)
Gradient Boost (5 node)

0

100

200

300

Number of Trees

400

500

266
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Software
• R: free GPL statistical computing environment available from
CRAN, implements the S language. Includes:
– randomForest: implementation of Leo Breimans algorithms.
– rpart: Terry Therneau’s implementation of classification and
regression trees.
– gbm: Greg Ridgeway’s implementation of Friedman’s
gradient boosting algorithm.
• Salford Systems: Commercial implementation of trees, random
forests and gradient boosting.
• Splus (Insightful): Commerical version of S.
• Weka: GPL software from University of Waikato, New Zealand.
Includes Trees, Random Forests and many other procedures.

267

Mais conteúdo relacionado

Mais procurados

Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest Rupak Roy
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Data mining PPT
Data mining PPTData mining PPT
Data mining PPTKapil Rode
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
12. Random Forest
12. Random Forest12. Random Forest
12. Random ForestFAO
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 
Enhanced Deep Residual Networks for Single Image Super-Resolution
Enhanced Deep Residual Networks for Single Image Super-ResolutionEnhanced Deep Residual Networks for Single Image Super-Resolution
Enhanced Deep Residual Networks for Single Image Super-ResolutionNAVER Engineering
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmPınar Yahşi
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Lean Master Data Management
Lean Master Data ManagementLean Master Data Management
Lean Master Data Managementnnorthrup
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Similarity Measures (pptx)
Similarity Measures (pptx)Similarity Measures (pptx)
Similarity Measures (pptx)JackDi2
 
Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?
Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?
Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?Thomas Hjelde Thoresen
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lectureShreyas S K
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
 

Mais procurados (20)

Introduction to Random Forest
Introduction to Random Forest Introduction to Random Forest
Introduction to Random Forest
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
IMAGE SEGMENTATION.
IMAGE SEGMENTATION.IMAGE SEGMENTATION.
IMAGE SEGMENTATION.
 
Data mining PPT
Data mining PPTData mining PPT
Data mining PPT
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
12. Random Forest
12. Random Forest12. Random Forest
12. Random Forest
 
Ibm based mdm poc
Ibm based mdm pocIbm based mdm poc
Ibm based mdm poc
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Enhanced Deep Residual Networks for Single Image Super-Resolution
Enhanced Deep Residual Networks for Single Image Super-ResolutionEnhanced Deep Residual Networks for Single Image Super-Resolution
Enhanced Deep Residual Networks for Single Image Super-Resolution
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Random forest
Random forestRandom forest
Random forest
 
Lean Master Data Management
Lean Master Data ManagementLean Master Data Management
Lean Master Data Management
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Similarity Measures (pptx)
Similarity Measures (pptx)Similarity Measures (pptx)
Similarity Measures (pptx)
 
Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?
Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?
Temporal Convolutional Networks - Dethroning RNN's for sequence modelling?
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Introduction to random forest and gradient boosting methods a lecture
Introduction to random forest and gradient boosting methods   a lectureIntroduction to random forest and gradient boosting methods   a lecture
Introduction to random forest and gradient boosting methods a lecture
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
 
Pca ppt
Pca pptPca ppt
Pca ppt
 

Destaque

Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learningSri Ambati
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2OSri Ambati
 
Q trade presentation
Q trade presentationQ trade presentation
Q trade presentationewig123
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodHonglin Yu
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217Sri Ambati
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewAdam Pah
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesPier Luca Lanzi
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleSajith Edirisinghe
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning Shiraz316
 
Cybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith BarthurCybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith BarthurSri Ambati
 

Destaque (20)

Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2O
 
Q trade presentation
Q trade presentationQ trade presentation
Q trade presentation
 
Tree advanced
Tree advancedTree advanced
Tree advanced
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overview
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers Ensembles
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning
 
Cybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith BarthurCybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith Barthur
 
ISAX
ISAXISAX
ISAX
 

Mais de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Mais de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Último (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

  • 1. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Ensemble Learners • Classification Trees (Done) • Bagging: Averaging Trees • Random Forests: Cleverer Averaging of Trees • Boosting: Cleverest Averaging of Trees • Details in chaps 10, 15 and 16 Methods for dramatically improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not make good prediction models. The techniques discussed here enhance their performance considerably. 227
  • 2. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Properties of Trees • [ ]Can handle huge datasets • [ ]Can handle mixed predictors—quantitative and qualitative • [ ]Easily ignore redundant variables • [ ]Handle missing data elegantly through surrogate splits • [ ]Small trees are easy to interpret • ["]Large trees are hard to interpret • ["]Prediction performance is often poor 228
  • 3. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting email 600/1536 ch$<0.0555 ch$>0.0555 email 280/1177 spam 48/359 remove<0.06 hp<0.405 hp>0.405 remove>0.06 email 180/1065 spam ch!<0.191 spam george<0.005 george>0.005 email email spam 80/652 0/209 36/123 16/81 free<0.065 free>0.065 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 email email email spam 20/238 57/185 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12 0/3 CAPAVE<2.7505 CAPAVE>2.7505 email hp<0.03 hp>0.03 email 6/109 email 100/204 80/861 0/22 george<0.15 CAPAVE<2.907 george>0.15 CAPAVE>2.907 ch!>0.191 email email spam 26/337 9/112 spam 19/110 spam 7/227 1999<0.58 1999>0.58 spam 18/109 email 0/1 229
  • 4. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting ROC curve for pruned tree on SPAM data 0.8 1.0 SPAM Data TREE − Error: 8.7% 0.0 0.2 0.4 Sensitivity 0.6 o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0 Overall error rate on test data: 8.7%. ROC curve obtained by varying the threshold c0 of the classifier: C(X) = +1 if Pr(+1|X) > c0 . Sensitivity: proportion of true spam identified Specificity: proportion of true email identified. We may want specificity to be high, and suffer some spam: Specificity : 95% =⇒ Sensitivity : 79% 230
  • 5. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Classification Problem Bayes Error Rate: 0.25 6 4 1 2 0 -2 -4 -6 X2 1 1 1 1 1 1 111 11 1 1 1 1 1 1 0 110111 1 0 1 1 1 1 1 11 0 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 0 1 11 0 1 1 1 0 1 0 1 00 1 11 1 1 1 1 010 010 11 010 1 00 01 1 1 1 1 1 1 1 01 0 1 0 00 00 0 1 1 11 1 0 0 00 0 01011 0 0111 101 1 01 0 0 0 0 0 1 0011 1 11 0 1 1 1 101 1010 0 0 0000010 0 0 1 0 00 1 0 100 0 1 0 0 0 1 0 0 00 01 100 000 011 0 00 0 0 0 1 00 0 001000 000 1 1 1 000 00 1 000 0 11 1 0 1 0 0 1 0 0 0101 0 1 11 1 1 001 01001 000110 000 0111 0 1 1 11 0 00 0100000 000 10 001 0 1 1 1 0 1 1 00 1 0 1 0 1 0001000 1 01 0 00100000011 0 1 00000000001000010 0 1 0 1 0 1 00 00000000000000 00001101 1 00 000 10 1 1 1 1 1 0 0 0 10 0100 0 0 1 00 0 11 0 111 0 00000010010111 101010 01 1 1 0 0 11 10 0010 100 1 0 0 1 0 1 0 01100 0 0 0 101 0 01 1 00 0 0 01 000 01 0 0 1 1 1000000 0 000 1 1 1 00 1 1 0 10 11 111 0000001 1 111 11 1 1 0001 01000000 1 11 1 1 0 0 0 0 011 0 00 0 0 01 1 1 1 0 1 1 1 1 0 1 11 11 1 1111 0 1 1 10 01000 0 1 1 1111 1 1 1 00 1 0 11 0 11 0 0 0 1 0 0 111 11 1 11 1001 0 1 10 01 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 11 1 1 1 10 1 1 1 1 1 1 11 1 1 1 1 1 11 1 11 1 1 1 1 1 1 1 1 1 11 1 11 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 -6 -4 -2 • Data X and Y , with Y taking values +1 or −1. 1 1 1 1 1 0 X1 2 4 • Here X =∈ IR2 1 1 1 6 • The black boundary is the Bayes Decision Boundary - the best one can do. Here 25% • Goal: given N training pairs (Xi , Yi ) produce ˆ a classifier C(X) ∈ {−1, 1} • Also estimate the probability of the class labels Pr(Y = +1|X). 231
  • 6. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Example — No Noise Bayes Error Rate: 0 3 1 1 1 1 1 11 1 11 11 1 1 11 1 11 1 1 1 1 1 11 1 1 1 11 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 11 1 1 11 1 1 11 11 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 11 1 1 11 1 0 000 1 1 0 0 11 1 1 1 1 1 1 1 1 1111 1 1 00 00 101 1 1 1 00 0 1 1 1 0 0 1 1 1 11 00 0 00 0000 00 001 111 1 1 0 0 0 0 0 0 00 0 1 1 1 1 1 11 0 0 0 1 1 11 1 0 0 0 0 00 000 0 00 1 1 1 1 10 0 0 1 1 0 0 00 0 1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1 0 1 11 0 0 00000 00 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0000000 11 0 0 0 0 0 000 0 00 001 11 1 1 1 1 0 1 0 0 00 00 0 00 0 00 0 0 0 1 1 0 00 000 0 0 0 00 0 0 0 1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1 1 1 0 1 1 0 000 00000 0 0000 0 000 1 000 0 000 0 00 00 0 0 0 1 0 0 0 00 1 00 0 11 00 0 0 0 11 1 1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1 1 1 1 0 0 00 0 1 1 1 00 0 0 1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1 1 1 0 0 0 00000 0 0 00 1 11 1 0 000000 0 1 11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1 0 0 0 1 0 0 00 0 000 00 0 0 0 1 1 1 1 1 1 1 0 1 11 0 00 00 0 00 00 1 1 11 1 1 00 0 0 0 0 1 1 1 1 1 1 1 1 01 00 0 0 00 1 0 0 11 1 1 0 1 1 1 111 001 0 010 1 1 111 1 1 1 1 1 1 0 11 1 1 01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 1 111 11 1 11 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 11 1 11 0 -1 -2 -3 X2 1 2 1 -3 -2 1 11 -1 1 11 1 1 1 1 1 0 X1 1 2 3 • Deterministic problem; noise comes from sampling distribution of X. • Use a training sample of size 200. • Here Bayes Error is 0%. 232
  • 7. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Classification Tree 1 94/200 x.2<-1.06711 x.2>-1.06711 0 1 72/166 x.2<1.14988 x.2>1.14988 0/34 1 0 40/134 x.1<1.13632 x.1>1.13632 0 1 23/117 0/32 0/17 x.1<-0.900735 x.1>-0.900735 1 0 5/26 x.1<-1.1668 x.1>-1.1668 1 1 0/12 2/91 x.2<-0.823968 x.2>-0.823968 0 5/14 x.1<-1.07831 x.1>-1.07831 1 1 1/5 4/9 0 2/8 0/83 233
  • 8. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 234 Decision Boundary: Tree Error Rate: 0.073 2 3 • The eight-node tree is boxy, and has a testerror rate 0f 7.3% 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 3 • When the nested spheres are in 10dimensions, a classification tree produces a rather noisy and ˆ inaccurate rule C(X), with error rates around 30%.
  • 9. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Model Averaging Classification trees can be simple, but often produce noisy (bushy) or weak (stunted) classifiers. • Bagging (Breiman, 1996): Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote. • Boosting (Freund & Shapire, 1996): Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote. • Random Forests (Breiman 1999): Fancier version of bagging. In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree. 235
  • 10. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Bagging • Bagging or bootstrap aggregation averages a noisy fitted function refit to many bootstrap samples, to reduce its variance. Bagging is a poor man’s Bayes; See pp 282. • Suppose C(S, x) is a fitted classifier, such as a tree, fit to our training data S, producing a predicted class label at input point x. • To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size N with replacement from the training data. Then Cbag (x) = Majority Vote {C(S ∗b , x)}B . b=1 • Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction. However any simple structure in C (e.g a tree) is lost. 236
  • 11. SLDM III c Hastie & Tibshirani - October 9, 2013 Original Tree Random Forests and Boosting b=1 b=2 x.1 < 0.395 x.1 < 0.555 x.2 < 0.205 | | | 1 1 0 1 0 1 0 0 1 0 0 0 1 b=3 0 0 1 0 1 b=4 1 b=5 x.2 < 0.285 x.3 < 0.985 x.4 < −1.36 | | | 0 1 0 0 1 1 0 1 1 1 0 1 0 1 b=6 1 1 0 0 1 b=7 b=8 x.1 < 0.395 x.1 < 0.395 x.3 < 0.985 | | | 1 1 1 0 1 1 0 0 0 1 0 0 1 b=9 0 0 1 b = 10 b = 11 x.1 < 0.395 x.1 < 0.555 x.1 < 0.555 | | | 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 237
  • 12. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Decision Boundary: Bagging 2 3 Error Rate: 0.032 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 • Bagging reduces error by variance reduction. -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 • Bagging averages many trees, and produces smoother decision boundaries. 3 • Bias is slightly increased because bagged trees are slightly shallower. 238
  • 13. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random forests • refinement of bagged trees; very popular— chapter 15. • at each tree split, a random sample of m features is drawn, and only those m features are considered for splitting. Typically √ m = p or log2 p, where p is the number of features • For each tree grown on a bootstrap sample, the error rate for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” or OOB error rate. • random forests tries to improve on bagging by “de-correlating” the trees, and reduce the variance. • Each tree has the same expectation, so increasing the number of trees does not alter the bias of bagging or random forests. 239
  • 14. SLDM III c Hastie & Tibshirani - October 9, 2013 Wisdom of Crowds 10 Wisdom of Crowds Consensus Individual 8 • 10 movie categories, 4 choices in each category 6 • 50 people, for each question 15 informed and 35 guessers 4 Expected Correct out of 10 Random Forests and Boosting • For informed people, 0 0.25 0.50 0.75 P − Probability of Informed Person Being Correct 1.00 = p Pr(incorrect) 2 Pr(correct) = (1 − p)/3 • For guessers, Pr(all choices) = 1/4. Consensus (orange) improve over individuals (teal) even for moderate p. 240
  • 15. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 241 1.0 ROC curve for TREE, SVM and Random Forest on SPAM data 0.8 TREE, SVM and RF Random Forest − Error: 4.75% SVM − Error: 6.7% TREE − Error: 8.7% Random Forest dominates both other methods on the SPAM data — 4.75% error. Used 500 trees with default settings for random Forest package in R. 0.0 0.2 0.4 Sensitivity 0.6 o o o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0
  • 16. SLDM III c Hastie & Tibshirani - October 9, 2013 242 0.07 0.08 RF: decorrelation 0.06 ˆ • VarfRF (x) = ρ(x)σ 2 (x) 0.03 0.04 0.05 • The correlation is because we are growing trees to the same data. 0.02 Correlation between Trees Random Forests and Boosting 1 4 7 10 16 22 28 34 40 Number of Randomly Selected Splitting Variables m 46 • The correlation decreases as m decreases—m is the number of variables sampled at each split. Based on a simple experiment with random forest regression models.
  • 17. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random Forest Ensemble 0.15 0.10 0.05 Mean Squared Error Squared Bias Variance 0.0 0 10 20 30 m 40 50 Variance 0.80 0.70 0.75 0.20 0.85 Bias & Variance 0.65 Mean Squared Error and Squared Bias 243 • The smaller m, the lower the variance of the random forest ensemble • Small m is also associated with higher bias, because important variables can be missed by the sampling.
  • 18. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 244 Boosting • Breakthrough invention of Freund & Schapire (1997) Weighted Sample CM (x) • Average many trees, each grown to reweighted versions of the training data. • Weighting decorrelates the trees, by focussing on regions missed by past trees. Weighted Sample C3 (x) Weighted Sample C2 (x) • Final Classifier is weighted average of classifiers: M αm Cm (x) C(x) = sign Training Sample C1 (x) m=1 Note: Cm (x) ∈ {−1, +1}.
  • 19. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 245 100 Node Trees Boosting vs Bagging 0.4 Bagging AdaBoost • Bayes error rate is 0%. 0.2 • Trees are grown best first without pruning. 0.1 • Leftmost term is a single tree. 0.0 Test Error 0.3 • 2000 points from Nested Spheres in R10 0 100 200 Number of Terms 300 400
  • 20. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting AdaBoost (Freund & Schapire, 1996) 1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N . 2. For m = 1 to M repeat steps (a)–(d): (a) Fit a classifier Cm (x) to the training data using weights wi . (b) Compute weighted error of newest tree errm = N i=1 wi I(yi = Cm (xi )) N i=1 wi (c) Compute αm = log[(1 − errm )/errm ]. (d) Update weights for i = 1, . . . , N : wi ← wi · exp[αm · I(yi = Cm (xi ))] and renormalize to wi to sum to 1. 3. Output C(x) = sign M m=1 αm Cm (x) . . 246
  • 21. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 247 Boosting Stumps 0.4 Single Stump 0.3 • Boosting stumps works remarkably well on the nested-spheres problem. 0.1 0.2 400 Node Tree 0.0 Test Error • A stump is a two-node tree, after a single split. 0 100 200 Boosting Iterations 300 400
  • 22. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 248 0.4 • Bayes error is 0%. in 10- 0.2 • Boosting drives the training error to zero (green curve). 0.1 • Further iterations continue to improve test error in many examples (red curve). 0.0 Train and Test Error • Nested spheres Dimensions. 0.3 0.5 Training Error 0 100 200 300 Number of Terms 400 500 600
  • 23. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 249 0.3 • Nested Gaussians in 10-Dimensions. • Bayes error is 25%. Bayes Error 0.2 • Boosting with stumps 0.1 • Here the test error does increase, but quite slowly. 0.0 Train and Test Error 0.4 Noisy Problems 0 100 200 300 Number of Terms 400 500 600
  • 24. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stagewise Additive Modeling Boosting builds an additive model M βm b(x; γm ). F (x) = m=1 Here b(x, γm ) is a tree, and γm parametrizes the splits. We do things like that in statistics all the time! • GAMs: F (x) = j fj (xj ) • Basis expansions: F (x) = M m=1 θm hm (x) Traditionally the parameters fm , θm are fit jointly (i.e. least squares, maximum likelihood). With boosting, the parameters (βm , γm ) are fit in a stagewise fashion. This slows the process down, and overfits less quickly. 250
  • 25. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Example: Least Squares Boosting 1. Start with function F0 (x) = 0 and residual r = y, m = 0 2. m ← m + 1 3. Fit a CART regression tree to r giving g(x) 4. Set fm (x) ← ǫ · g(x) 5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat step 2–5 many times • g(x) typically shallow tree; e.g. constrained to have k = 3 splits only (depth). • Stagewise fitting (no backfitting!). Slows down (over-) fitting process. • 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further. 251
  • 26. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Least Squares Boosting and ǫ Prediction Error Single tree ǫ=1 ǫ = .01 Number of steps m • If instead of trees, we use linear regression terms, boosting with small ǫ very similar to lasso! 252
  • 27. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Adaboost: Stagewise Modeling • AdaBoost builds an additive logistic regression model M Pr(Y = 1|x) αm fm (x) F (x) = log = Pr(Y = −1|x) m=1 by stagewise fitting using the loss function L(y, F (x)) = exp(−y F (x)). • Instead of fitting trees to residuals, the special form of the exponential loss leads to fitting trees to weighted versions of the original data. See pp 343. 253
  • 28. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 3.0 Why Exponential Loss? 2.0 0.5 1.0 1.5 • Leads to simple reweighting scheme. 0.0 Loss 2.5 Misclassification Exponential Binomial Deviance Squared Error Support Vector • e−yF (x) is a monotone, smooth upper bound on misclassification loss at x. −2 −1 0 1 2 • Has logit transform as population minimizer y·F F ∗ (x) = • y · F is the margin; positive margin means correct classification. 1 Pr(Y = 1|x) log 2 Pr(Y = −1|x) • Other more robust loss functions, like binomial deviance. 254
  • 29. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting General Boosting Algorithms The boosting paradigm can be extended to general loss functions — not only squared error or exponential. 1. Initialize F0 (x) = 0. 2. For m = 1 to M : (a) Compute (βm , γm ) = arg minβ,γ N i=1 L(yi , Fm−1 (xi ) + βb(xi ; γ)). (b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ). Sometimes we replace step (b) in item 2 by (b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm ) Here ǫ is a shrinkage factor; in gbm package in R, the default is ǫ = 0.001. Shrinkage slows the stagewise model-building even more, and typically leads to better performance. 255
  • 30. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Gradient Boosting • General boosting algorithm that works with a variety of different loss functions. Models include regression, resistant regression, K-class classification and risk modeling. • Gradient Boosting builds additive tree models, for example, for representing the logits in logistic regression. • Tree size is a parameter that determines the order of interaction (next slide). • Gradient Boosting inherits all the good features of trees (variable selection, missing data, mixed predictors), and improves on the weak features, such as prediction performance. • Gradient Boosting is described in detail in , section 10.10, and is implemented in Greg Ridgeways gbm package in R. 256
  • 31. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.070 Spam Data 0.055 0.050 0.045 0.040 Test Error 0.060 0.065 Bagging Random Forest Gradient Boosting (5 Node) 0 500 1000 1500 Number of Trees 2000 2500 257
  • 32. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam Example Results With 3000 training and 1500 test observations, Gradient Boosting fits an additive logistic model f (x) = log Pr(spam|x) Pr(email|x) using trees with J = 6 terminal-node trees. Gradient Boosting achieves a test error of 4.5%, compared to 5.3% for an additive GAM, 4.75% for Random Forests, and 8.7% for CART. 258
  • 33. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam: Variable Importance 3d addresses labs telnet 857 415 direct cs table 85 # parts credit [ lab conference report original data project font make address order all hpl technology people pm mail over 650 ; meeting email 000 internet receive ( re business 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $ ! 0 20 40 60 Relative importance 80 100 259
  • 34. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.0 0.2 0.4 0.6 0.8 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence Spam: Partial Dependence 1.0 0.0 0.2 0.4 0.6 -0.2 0.0 0.2 -1.0 -0.6 -0.6 Partial Dependence -0.2 0.0 0.2 remove -1.0 Partial Dependence ! 0.0 0.2 0.4 0.6 edu 0.8 1.0 0.0 0.5 1.0 1.5 hp 2.0 2.5 3.0 260
  • 35. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting California Housing Data 0.42 0.40 0.38 0.36 0.34 0.32 Test Average Absolute Error 0.44 RF m=2 RF m=6 GBM depth=4 GBM depth=6 0 200 400 600 Number of Trees 800 1000 261
  • 36. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Tree Size 0.4 Stumps 10 Node 100 Node Adaboost 0.2 0.3 The tree size J determines the interaction order of the model: η(X) = ηj (Xj ) 0.1 j + ηjk (Xj , Xk ) jk 0.0 Test Error 262 + 0 100 200 Number of Terms 300 400 ηjkl (Xj , Xk , Xl ) jkl +···
  • 37. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stumps win! Since the true decision boundary is the surface of a sphere, the function that describes it has the form 2 2 2 f (X) = X1 + X2 + . . . + Xp − c = 0. Boosted stumps via Gradient Boosting returns reasonable approximations to these quadratic functions. Coordinate Functions for Additive Logistic Trees f1 (x1 ) f2 (x2 ) f3 (x3 ) f4 (x4 ) f5 (x5 ) f6 (x6 ) f7 (x7 ) f8 (x8 ) f9 (x9 ) f10 (x10 ) 263
  • 38. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles Ensemble Learning consists of two steps: 1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)} of basis elements (weak learners) Tm (X). 2. Fitting a model f (X) = m∈D αm Tm (X). Simple examples of ensembles • Linear regression: The ensemble consists of the coordinate functions Tm (X) = Xm . The fitting is done by least squares. • Random Forests: The ensemble consists of trees grown to booststrapped versions of the data, with additional randomization at each split. The fitting simply averages. • Gradient Boosting: The ensemble is grown in an adaptive fashion, but then simply averaged at the end. 264
  • 39. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles and the Lasso Proposed by Popescu and Friedman (2004): Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso regularization: α(λ) = arg min α αm Tm (xi )] + λ L[yi , α0 + i=1 M M N m=1 m=1 |αm |. Advantages: • Often ensembles are large, involving thousands of small trees. The post-processing selects a smaller subset of these and combines them efficiently. • Often the post-processing improves the process that generated the ensemble. 265
  • 40. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Post-Processing Random Forest Ensembles 0.09 Spam Data 0.07 0.06 0.05 0.04 Test Error 0.08 Random Forest Random Forest (5%, 6) Gradient Boost (5 node) 0 100 200 300 Number of Trees 400 500 266
  • 41. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Software • R: free GPL statistical computing environment available from CRAN, implements the S language. Includes: – randomForest: implementation of Leo Breimans algorithms. – rpart: Terry Therneau’s implementation of classification and regression trees. – gbm: Greg Ridgeway’s implementation of Friedman’s gradient boosting algorithm. • Salford Systems: Commercial implementation of trees, random forests and gradient boosting. • Splus (Insightful): Commerical version of S. • Weka: GPL software from University of Waikato, New Zealand. Includes Trees, Random Forests and many other procedures. 267