Using CART For Beginners with A Teclo Example Dataset

Salford Systems Webex Training
Salford Systems
http://www.salford-systems.com

CART® Decision Tree Basics
• We start with a simple analysis of some market research
data using CART
• This introduction assumes no background in data mining or
predictive analytics
• We do assume you have had some experience reviewing
data with the purpose of discovering interesting and or
predictive patterns
© Copyright Salford Systems 2013

Beginning with CART
• CART is the perfect place to start learning about data mining
• Widely regarded as one of the most important tools in data
mining and also the easiest to understand and master
– Decision trees are still the most popular data analysis tool among
experienced data miners
• Delivers easy to understand analyses of complex data
– Allows for very sophisticated analyses especially when a structured
series of trees are developed
– Effective Exploratory Data Analysis (EDA) to support more
conventional modeling (eg logistic regression)

Classification with CART®
Real world study early 1990s
• Fixed line service provider offering a
new mobile phone service
• Wants to identify customers most
likely to accept new mobile offer
• Data set based on limited market trial
• 830 usable records
• 67 attributes and target including
– Demographics
– Attitudes and Needs
– Pricing for handset & minutes

Mobile Phone Offer
• Data is a sample of land line telephone customers of a
European telco
• At the time mobile phones were very rare in the country in
question
• The Company realized the time was right to introduce
mobile phones on a substantial scale to their existing fixed
line customer base
• Key questions:
– WHO to target with the marketing campaigns for the new product
– HOW MUCH to charge for the handset

Nature of the Research
• Company arranged to make real world offers to about 1,000
existing land line customers
• Everyone was presented the same offer (only one model of
phone and one service plan available)
• The PRICE of the handset was varied randomly over a large
range prices from near zero to about $300
• Goal was to learn who responded positively and at what
price points
• Offers were made in person as part of a one hour visit in
which much was learned about the household (media
preferences, number of children, distance to work, etc)

• Target variable RESPONSE: Coded 0 or 1 (YES, NO)
• 65 available predictors include variables like:
Nature of the Data
HANDPRIC Cost of handset (one time fee)
USEPRICE Usage cost (per month, 100 minutes)
TELEBILC Landline home phone bill average
CITY Resident in which of 5 major cities
AGE Coded in 5 year increments
HOUSIZ Possible proxy for income, coded 1-6
SEX Male, Female, Unknown
EDUCATN Coded 1-7 ( thru postgrad)

Analysis File Overview in CART 6.0

Set Up the Model
(Select Target, allowable predictors)
Only requirement is to select TARGET (dependent) variable. CART will do everything
else automatically© Copyright Salford Systems 2013

CART:
Does its own variable selection
• Embedded variable (feature) selection means that modeler
can let the software make its own choice of predictors
• Modeler will often want to limit the model to focus on
selected inputs
– Exclude ID variables and merge keys
– Exclude clones of the dependent variable
– Exclude data pertaining to the future (relative to the dependent)
– E.g. restrict a model to easily available predictors
– Test predictive power of purchased external data
• Modeling automation can allow exploration of a vast space
of pre-selected predictors (see later slides)

In Example we run CART model
• CART completes analysis and gives access to all results
from the NAVIGATOR
– Shown on next slide
• Upper section displays tree of a selected size
– number of terminal nodes
• Lower section displays error rate for trees of all possible
sizes
• Green bar marks most accurate tree
• We display a compact 10 node tree for further scrutiny

CART Model Viewer
Access reports and drill into model details
Most accurate tree is marked with the green bar. Above we select the 10 node tree
for convenience of a more compact display. Note train/ test area under ROC curve

Root Node:Hover Mouse
Tree starts with all training data
Displays details of TARGET variable in overall training data
Above we see that 15.2% of the 830 households accepted the offer
Goal of the analysis is now to extract patterns characteristic of responders

Goal is to split node: separate responders
• Details of root node split
• If we could only use a single piece of information to separate responders from
non-responders CART chooses the HANDSET PRICE
•Those offered the phone with a price > 130 contain only 9.9% responders
•Those offered a lower price respond at 21.9%

CART Splitting Rules
• We discuss the details later
• Here we just point out that the split CART displays is
– ―the best of all possible splits‖
• Subject to the splitting criteria you have chosen and any
constraints imposed
• How do we know this split is ―best‖?
• Because CART actually tries all possible splits looking for
the best
– Exhaustive brute force search
– Advanced algorithms used to make this search fast
– As much as 100 times faster than other decision trees

Grow progressively bigger tree: One split at a time
• Binary recursive partitioning repeated until further splitting
impossible (e.g. data exhausted)
• This leads us to the largest possible or “maximal tree

Maximal tree is raw material for best model
• Goal is to find optimal tree embedded inside maximal tree
• Will find optimal tree via ―pruning‖
• Like backwards stepwise regression
• Challenge: A tree with 100 terminal nodes can be pruned back to 99
terminal nodes by eliminating any one 99 penultimate nodes
• Now the 99 new terminal nodes can be cut back to 98 by eliminating
any one of the surviving 98 penultimate nodes
• Something like 99! possible trees. How do we find the best?

Pruning Sequence
• CART automatically generates a pruning sequence which
develops a preferred sequence of progressively smaller
trees
• We can prove that for a given tree size the CART tree in the
sequence will be the best performing tree of all possible
trees of that size
• In our sequence, the 10 node tree is gauranteed to be more
accurate than any other 10 node tree you could extract from
the maximal tree
• You as the user never need to worry about this
• ―Better‖ is defined in terms of performance on the training
data as we need the tree sequence before we can test

Error Curve: Plots Accuracy vs Model Size
•Requires test data
•Can use cross-validation (sample reuse) if data is scarce
•Curve typically U-shaped
• Too small is not good and neither is too large
•Can look at any tree in the sequence of pruned subtrees
•Error is what BFOS call an “honest” estimate of model performance

Pick a modest sized tree to examine
Note high response in this RED colored node
Response of 38.5% in this segment vs. 15.2% overall
Lift = 2.53 © Copyright Salford Systems 2013

Navigator allows access to all model info
• The terminal nodes are color coded to represent results
– RED nodes are ―hot‖ and contain high concentrations of the class of
interest (buyers)
– BLUE nodes are ―cold‖ and contain very low concentrations of the
class of interest
– PINK and WHITE nodes have moderate concentrations
• We first look to see if we have any RED nodes
– Explore any red nodes via mouse hover
• Then we drill down to see a tree schematic revealing the
main drivers of the tree

Select ―splitters‖ View
Selects a streamlined overview of the tree showing ONLY primary splitters

Model Overview: Main Drivers
( Red= Good Response Blue=Poor Response)
High values of a split variable always go to the right; low values go left

Examine Extreme Right-most Terminal Node
• Hover mouse over node to see inside
• Even though this node is on the ―high price‖ of the tree it still
exhibits the strongest response across all terminal node
segments (43.5% response)
•Rules defining this node are shown on next slide

Rules can be extracted in a variety of languages
•
Here we select rules
expressed in C for
one node of interest
Entire tree can also
be rendered in Java,
XML/PMML, or
SAS

Continuing down the tree
• We note that even if the new product is offered at a high
price we can still find prospects very interested:
– Those that have a high average landline bill and own a pager
– This group displays greatest probability of response (43.5%)

Classic Detailed Tree Display
Analyst can select details to be displayed

Control Over Details Displayed in Nodes
At left an example
showing the class
bar chart is
displayed
Separate controls for
internal and terminal
nodes

Configure Print Image Interactively
Shrink to one page, include header/footer
•

Tree Performance Measures
and Principle Message
• In addition to the details of the tree (splits, split values)
• Variable Importance Ranking
• Confusion Matrix (Prediction Success Matrix)
• Gains, ROC

Variable Importance Ranking
(Relative impact on outcomes)
Three major ways of computing variable importance. Above
default display. © Copyright Salford Systems 2013

Predictive Accuracy
(How often right, how often wrong)
This model is not very accurate but ranks responders well

Gains Curve
In top decile model captures about 23% of responders

Performance Evaluation: ROC Curve

Observations on CART Tree
Contrasts with Conventional Stats
• CART leverages rank order of predictor to split
– Transforming predictor X into Log(X) will not change tree
– Of course tree will be expressed in terms of Log(X) but this will not
change the location of the split
– Traditional statisticians experiments with alternative transforms
unnecessary
• CART is immune to outliers in predictors
– Suppose X has values 1,2,3,…,100, 900
– To CART this is the same as 1,2,3,…,100, 101
– All CART ―sees‖ is the rank order
• We will see later that CART has built-in missing value
handling
• So no worry about outliers, missing values, transformations

CART Methodology: Partition Data
Into Two Segments
• Partitioning line parallel to
an axis
• Root node split first
– 2.450
– Isolates all the type 1
species from
rest of the sample
• This gives us two child
nodes
– One is a Terminal Node
with only type 1 species
– The other contains only
type 2 and 3
• Note: entire data set
divided into two separate
parts

Second Split: Partitions
Only Portion of the Data
• Again, partition with line
parallel to one of the two
axes
• CART selects
PETALWID to split this
NODE
– Split it at 1.75
– Gives a tree with a
misclassification rate of 4%
• Split applies only to a
single partition of the
data
• Each partition is
analyzed separately

Discriminant Analysis Uses Oblique Lines
•Linear Combinations are difficult to understand and explain
•CART does permit ―oblique‖ splits based on linear combinations of small sets
of variables but this is rarely desirable
© Copyright Salford Systems

CART Representation of a Surface
Model clearly non-linear
Height of bar represents probability of response
Remaining axes represent values of two predictors
Greatest prob of response here in corner to the right
0

CART Splitting Process
• Standard splits are based on ONE predictor and the form of
a database RULE
• A data record goes left if
splitter_variable <= split value
• Examples: A data record goes left
• if AGE<=35
• if CREDIT_SCORE <= 700
• if TELEPHONE_BILL <= 50

Searching all splits facilitated by sorting
• On left we sort by TELEBILC, on right by TRAVTIMR
• Test smallest value first, then next smallest, etc moving all the way down the column
• The arrow shows a split sending 10 cases to the left and all other data to the right

Example Root Node Split
Continuous Splitter
From our Euro_telco_mini.xls example
Split is TELEBILC <= 50

Alternative Split Points
What if split the data at TELEBILC<=25?
Note the response rate between the two nodes with this split are very similar
They are much different after splitting at the optimal value

Two splits separate quite differently
The first pane shows two segments with 14.3% and 15.5% response
The second pane shows two segments with 12.7% and 19.8%
Our goal in CART is to generate substantially different segments and we
accomplish this by experimenting with every possible split value for every predictor

CART Splitting Process: More
• Splitter variables need not be numeric, they can be text
• Splitter variables need not be ordered
• A data record goes left
• if CITY$ = ―London‖ OR ―Madrid‖ OR ―Paris‖
• if DIAGNOSIS = 111 OR 35 OR 9999

• CART considers all possible splits based on a categorical predictor
• Example: four regions - A, B, C, D can be split 7 ways (23 -1 = 7)
• Each decision is a possible split of the node and each is evaluated
• Note: A on the left and B,C,D on the right is the same split as its mirror image A
on the right and B,C,D on the left
• So we only list one version of this split
– It is which cases stay together that matters not which side of the tree they are on
Splits on K-level categorical predictors
2K-1 -1 ways to split
Left Right
1 A B, C, D
2 B A, C, B
3 C A, B, D
4 D A, B, C
5 A, B C, D
6 A, C B, D
7 A, D B, C

Categorical Split Caution:
Dangers of HLCs (High Level Categoricals)
• Because categorical variables generate 2K-1 -1 ways to split
the data high values of K can be problematic
• K=33 is not an unusually large number of levels yet allows
for about 4 billion ways to split the data
• When the number of possible splits exceeds the number of
records in the data the categorical variable has an
advantage over any continuous splitter
– A continuous variable with a unique value in every row of the data
gives us a choice of split points equal to the number of rows of data
• Later we will discuss several ways to deal with HLCs
including repackaging the high cardinality categoricals into
lower cardinality versions and penalties

Example Root Node Split
Categorical Splitter
From our Euro_telco_mini.xls example
Observe that we have to LIST the values that go to each child node

CART Competitor Splits
• The CART mechanism for splitting data is always the same
• We are given a block of data
– Could be all of our data and we are starting from scratch
– Could be a small part of our data obtained after already doing a lot of
slicing and dicing
• When we work with a block of data we do not take into
account how we got to that block of data
• We do not consider any information which might be
available outside of the block of data
• The block of data to be analyzed is our entire universe and
nothing else exists for us

Getting Ready to Split
• For a block of data to be split
– It must contain a sufficient number of data records (ATOM)
– We can tell CART what the minimum must be
– Default is just TWO records
– In large database analysis we might reasonably set the minimum
quite a bit higher
– ATOM values such as 10, 20, 50, 100, 200 have cropped up in our
practical work
• If you are working with a small database such as those
encountered in biomedical research (e.g. 200 records total) you
will want to allow the ATOM size to be small
• If you are working with hundreds of thousands or millions of
records there is no harm in trying a minimum size like 200

Still Getting Ready to Split
• If we have a classification problem such as modeling
response to a marketing offer where there are two outcomes
– Responded
– Did Not Respond
• To be splittable the block of data cannot be ―pure‖, i.e.
composed of all responders or all non-responders
– True regardless of how large the block of data is
– Splitting is designed to separate the responders from the non-
responders so we need a mixture to have something to do
• The data records cannot all have exactly the same values
for the predictors
– CART will be looking for a useful difference in a predictor between
responders and non-responders

Observation on Dummy Variable Predictors
• If you split a node using a continuous variable there is
always the chance that this same variable is used again in a
subsequent split for descendent nodes
• Once a node is split with a dummy variable this variable can
never be used again in descendant nodes
– Because a descendant node will contain either all 0 or all 1 values for
this variable. Hence it cannot split.
• If a dummy variable is introduced into the tree below the root
it might appear in more than one location in the tree
– But one use will never be the ancestor of the other use

Making The Split
• To split the block of data (which we will henceforth refer to
as splitting the node) we search each available predictor
• For every predictor we make a trial split at every distinct
value of the predictor
• For each trial split we compute a goodness of split measure
normally referred to as the ―improvement‖
• For each predictor we find the split value that yields the best
improvement
• Once every predictor has been searched to find the best
split point we rank the splitters in descending order and then
use the best overall splitter to grow the tree

Ranked List of Splitters
• The ranked list of splitters is also known as the competitor
list
• CART always computes the entire list as this is the only way
to know for sure which split is best
• To save space CART normally only displays the top 5
competitors within a node
– You can request a larger number in your options settings
• The root node at the top of the tree always displays the
complete list of competitors even if there are thousands of
predictors

Why Care about Competitor Splits?
• Useful to know if the best splitter is far better than all the rest
or only slightly better
• Useful to know which predictors show up near the top
– Are they very different from each or are they all reflecting the same
underlying information
• Useful to know if a strong but perhaps 2nd best predictor
splits the data more evenly than the best
– We might want to try FORCING that 2nd best predictor into the root to
see what happens
– Sometimes this yields an overall better tree
• Pattern of top splitters may reflect problems
– Top 3 competitors may all be ―too good to be true‖ and we might
need to drop them all from the analysis

Surrogate Splits
• Surrogate splits were first introduced by the authors of
CART in their classic monograph Classification and
Regression Trees, 1984.
• Surrogate splits are mimics or substitutes for the primary
splitter of a node
• An ideal surrogate splits the data in exactly the same way as
the primary split
– The ―association‖ measure reflects how close to perfect a given
surrogate is

Why Surrogates?
• Surrogates have two primary functions:
– To split data when the primary splitter is missing
– To reveal common patterns among predictors in a data set
• CART searches for surrogate splitters in every node in the
tree
– Surrogates are searched for even when there is no missing data
– No guarantee that useful surrogates can be found
– CART attempts to find at least five surrogates for every node but this
number can be modified
– Number of surrogates actually found normally varies from node to
node

CART and Missing Values in Deployment
• CART is the only learning machine that is prepared to deal
with any pattern of missing values in future data
• Even if the training data have no missings CART develops
strategies to deal with the eventuality of any variable or
variables being missing
• Some learning machines cannot handle missing values at all
• Other learning machines can only deal with missing value
patterns that they have been trained on (seen before)
– Eg Handle X5=missing only if X5 was ever missing in the training
data
• CART has no such restrictions and is always ready for any
pattern of missings

Surrogates in Action:
Euro_telco_mini.xls
Remember to check off CITY, MARITAL and RESPONSE as ―categorical‖

Manually Prune Back to the 10-node tree
Just click on the blue curve in the lower panel to select a smaller easier to manage
tree. The double click on left child of root node (see arrow above)

Look at the Left Child of the ―Root‖
The primary splitter predicting subscription to a new mobile phone offer is the
monthly telephone bill (TELEBILC) dividing node into spenders of more or less
than $50 per month

Surrogate for TELEBILC
• If this variable were missing for any reason (database error,
person recently moved, new customer) we do not know
whether to move down the tree to the left or to the right
• Surrogate variable can be used in place of the missing
primary splitter. In this case the surrogate is of the form
go to the left if MARITAL=1
• Left is associated with LOW spending on the telephone bill
• CART suggests that single person households spend less
while households headed by married or divorced persons
spend more

Surrogates and Direction
• A surrogate is intended to be a substitute for the primary
splitter making similar left/right decisions
• But surrogates may work in the opposite direction so every
continuous variable surrogate is supplied with a ―tag‖
– The letter ―s‖ after the split point stands for ―standard‖
– The letter ―r‖ after the split point stands for ―reverse‖
• If a surrogate is negatively correlated with the primary
splitter then it will split in the reverse direction
– Categorical splitters are always organized so that the levels that
correspond to left in the primary splitter go left in the surrogate

Normally Surrogates Make Sense
• Our primary splitter is the average monthly spend of a
household on a fixed line telephone account
• Our surrogates include marital status, commute time to
work, age, and the city of residence
– Longer commutes are associated with larger spend on the phone
– Older head of household also is associated with larger spend
– We cannot interpret the CITY variable at this point because we don’t
know the identity of the cities
• In general surrogates help us understand the primary splitter
– Especially helpful in survey research

How to Compute Surrogates?
• This is a technical question which we will not cover here
– The CART monograph contains a wealth of technical information
although it can be a challenging read
• However, we will discuss the main ideas
• The top surrogate is
– A single variable
– A single split (in the same format as any primary splitter)
– Intended to mimic as closely as possible how data is partitioned by
the primary segment into LEFT and RIGHT nodes
• To get a surrogate think of generating a one split CART tree
where the dependent variable is {LEFT or RIGHT} as
defined by the primary splitter. (There are many details)

What is ―Association‖?
• Association is a measure of the strength of the surrogate
• The lowest possible reported score is 0 (useless)
• The highest possible score is 1 (perfect clone)
• CART starts from the default rule: if you don’t kow which way to send a
data record down a tree go with the majority (sometimes weighted majority)
• If when training the tree most cases went left then in the absence of
other information also go left
• The default makes mistakes of course because it always sends every
record to the same majority side
– Association measures how much better the surrogate is than the
default rule (percent reduction in errors made)
• Default rule is the ―surrogate of last resort‖

Competitors and Surrogates:
Different Objectives
Competitors yield the best possible split when using that variable
Surrogate yields the best possible mimic of the primary splitter and goodness of
split may be sacrificed to match some aspect of the primary splitter
Note that C2 is a competitor with one split point and a surrogate with a different
split point

Grow another tree on GB2000.XLS
• We prefer this data set because it has no missing values
making working through examples much easier
• Don’t forget: CART always computes surrogates and in this
way the CART tree is always prepared for future missings
• We will not be trying to make sense of this tree
– will look just at the mechanics
• Note the root node splitter and the top surrogate

Root Node Split
Root Splitter:
M1 <= -.04645
Top Surrogate:
C2 <= -.10835

Main Splitter vs. Best Surrogate
Main Splitter Surrogate
Left Right Left Right
Class 1 672 328 626 374
Class 2 252 748 300 700
Total 924 1076 926 1074
Best Surrogate must closely match not only the record counts in the child nodes
but also the distribution of the target variable

Modeling ROOTSPLIT with CART
Observation: Modeling the root node split (we have to create a new variable
to reflect this) will not necessarily match the surrogate report
Other factors must be taken into account. Here we get the right variable but
not the right split point

Main Splitter VS Best Surrogate
Model Root Split As a Binary Target
Main
Splitter
Surrogate Alternate
Left Right Left Right Left Right
Class 1 672 328 626 374 598 402
Class 2 252 748 300 700 288 712
Total 924 1076 926 1074 886 1114
Best Surrogate must closely match record counts in the child nodes and the
distribution of the target variable
Modeling root split on available predictors will not match surrogate exactly

Variable Importance in CART
• It is hard to imagine now but in 1984 when the CART
monograph was first published data analysts did not
generally rank variables
• Although informally researchers would pay attention to t-
statistics or p-values associated with the coefficients of
regressions researchers frowned on the practice of ranking
predictors
• Since the advent of modern data analytic methods
researchers expect to see a variable importance ranking for
all models
• It all started with CART!

CART concept of Variable Importance
• Variable importance is intended to measure how much work a
variable does in a particular tree
• Variable importance is thus tied to a specific model
• A variable might be most important in one model and not
important at all in a different model built on the same data
• The fact that a variable is important does not mean that we need
it! If we were deprived of the use of an important variable it might
be that other available variables could substitute for it or do the
same predictive work
• Variable Importance describes the role of a variable in a specific
tree

Variable Importance and Tree Size
• Every tree in the CART sequence has its own variable
importance list
• A small tree will typically have only a few important variables
• A large tree will typically have many more important
variables
– Because with more nodes there are more chances for more variables
to play a role in the tree
• Usually we focus on the tree CART has identified as optimal
but this should not deter you from selecting another (usually
smaller) tree

Splitter Improvement Scores
• Recall that every splitter (and every surrogate) has an associated
―improvement‖ score which measures how good a splitter is
• The improvement score for a splitter in a node is always scaled down by
the percent of data that actually pass through the node
• 100% of all data pass through the root node so the root node splitter is
always scaled by 100%
• But a child node of the root might have say 30% of the data pass through
– whatever improvement we compute for split of that node will be multiplied by 0.30
• Splits lower in the tree have only a small fraction of full data passing
through so their adjusted improvement scores tend to be small

Variable Importance Computation
• To construct a variable importance score for a variable we
start by locating every node that variable split
• We add up all of the improvement scores generated by that
variable in those nodes
• Then we go through every node this variable acted as a
surrogate and add up all those improvement scores as well
• The grand total is the raw importance score
• After obtaining raw importance scores for every variable we
rescale the results so that the best score is always 100

Variations on Importance Scores
• Breiman, Friedman, Olshen and Stone discuss one idea
they ultimately rejected:
– Including competitor improvements scores as well
• This turns out to be a bad idea because it leads to double-
counting
– If a variable is the 2nd best splitter in a node there is an excellent
chance that the same split will score well in the child nodes
– If we were to give the splitter credit in the parent node for a being a
competitor we would probably end up giving the exact same split
credit again lower down in the tree
– Another way to think about this: a split is trying to enter the tree. If we
do not acept the split right away the same split may keep trying to
enter the tree lower down
– We only want to give this split credit once

BATTERY LOVO
• Leave One Variable Out (LOVO)
– Available in SPM PRO EX versions but you can accomplish the
process manually as well
• Take your best modeling set up including your preferred list
of predictors
• BATTERY LOVO runs a set of models that are identical to
your preferred set up except that one variable has been
excluded
• To be complete we run a ―drop just one variable‖ model for
each variable in your KEEP list
• If you have 20 variables then BATTERY LOVO will run 20
models (each of which will have 19 predictors)
– Now rank the models from worst to best

BATTERY LOVO Importance Ranking
• Using the LOVO procedure tests how much our model
deteriorates if we were to remove a given variable
• It is sensible to say that a variable is very important if losing
it damages the model substantially
• Conversely, if losing a variable does no harm then we could
conclude that the variable is useless
• CAUTION: the LOVO ranking could be quite different from
the CART internal ranking and both rankings are ―right‖
– CART measures how much work a variable actually does
– LOVO measures how it hurts to lose a variable

Randomization Test
• Leo Breiman introduced yet another concept of variable
importance measure related to his work on tree ensembles
• Start with your test data
– Score this data with your preferred model to obtain baseline
performance
– Take the first predictor in the test data and randomly shuffle its
values in the column of data
– The values are unchanged but values are relocated to rows they do
not belong on
– Now score again. We would expect performance to drop because
one predictor has been damaged. Repeat say 100 times and
average the performance deterioration.
– Doing this for al variables will produce performance degradation
scores and the larger the score the more important the variable

Randomization Test
• As of December 2011 this test is only available from the
command line of recent versions of SPM
• After growing a CART tree and saving the grove issue these
commands from the command line or an SPM Notepad
SCORE VARIMP=YES NPREPS=100
• You may readily run with NPREPS=30 but the results are
more reliable with a larger number of replications

Results from Random Shuffling:
Baseline ROC=.85320
Rank Score ROC_After Variable
1 100 0.82144 M1
2 63.21 0.83312 RES
3 45.57 0.83873 LS
4 25.9 0.84498 CR
5 22.66 0.84601 C2
6 21.29 0.84644 BU
7 5.84 0.85135 DT
8 4.25 0.85185 A1
9 4.23 0.85186 PRE
10 3.49 0.85209 OC
11 3.18 0.85219 MAR
12 2.29 0.85248 YM
13 1.64 0.85268 LT
14 0 0.8532 DP
15 0 0.8532 TRA
16 0 0.8532 GEN
17 0 0.8532 A2
18 0 0.8532 B
19 0 0.8532 CP2
20 0 0.8532 CD2
21 0 0.8532 D1
22 0 0.8532 E
23 0 0.8532 M
24 0 0.8532 CH
25 0 0.8532 TY$

Which Importance Score Should I Use?
• The internal CART variable importance scores are the
easiest and the fastest to obtain and are a great starting
point
• LOVO scores are useful when your goal is to assess
whether you can live without a predictor

• Importance is a function of the OVERALL tree including
deepest nodes
• Suppose you grow a large exploratory tree — review
importances
• Then find an optimal tree via test set or CV yielding smaller
tree
• Optimal tree SAME as exploratory tree in the top nodes
• YET importances might be quite different.
• WHY? Because larger tree uses more nodes to compute the
importance
• When comparing results be sure to compare similar or same
sized trees
Variable Importance Caution

Train/Test Consistency Checks
• Unlike classical statistics data mining models generally do
not rely on training data to assess model quality
• In the SPM data mining suite we are always focused on test
data model performance
– This is the only way to reliably protect against over fitting
• Every modeling method including our classical statistical
models in SPM 7.0 offers test data performance measures
• Generally these measures are overall model performance
indicators
– Measures say nothing about internal model details

CART Tree Assessment
• CART uses test data performance of every tree in the back-
pruned sequence of progressively smaller trees to identify
the overall best performer on classification accuracy
• CART also notes which tree achieves the best test data
Area Under the ROC (AUROC) curve on the Navigator

What more can we do?
• CART performance measures have always been overall-tree
scores
• No specific attention is paid to node-specific performance
• However, in real world applications we often want to pay
close attention to individual nodes
– Might use the rank order of the nodes in important decisions
– Prefer to rely on nodes that are most accurate in their predictions of
event rates (response)
• Therefore we need an additional tool for assessing CART
tree performance at the node level
• Provided by the PRO EX feature we call TTC
– Train/Test Consistency checks

Use the GB2000.XLS data set
Model setup to select TARGET as the dependent variable
CART as the modeling method
On the TEST tab we opt for 50% randomly selected test partition

TTC in CART and SPM PRO EX
• The TTC report is available from the navigator which
displays for every CART model
– Look for the TTC button near the bottom of the navigator
• TTC relies on separate train and test data partitions which
means that TTC is not available when using cross-validation

TTC Display
Upper panel of TTC display contains one line in the table for every sized tree
Bottom row represents the 2 node tree. Top line is for largest tree grown

TTC: Select Target Class
In this case TARGET=2 represents BAD which is our focus class
You the modeler get to choose which class to focus on, there is no ―right‖ class

TTC Upper Panel
Rank Match: Do the train and test samples rank order the nodes in the same way
(a statistical test allows for insignificant ―wobbles‖)
Direction Agreement: Do the train and test samples agree as to whether a node is
―above average‖ or ―below average‖ (response, lift, event rate). Again a statistical
test allows for insignificant violations

Click on 14 node tree in TTC upper panel
Red curve is training data and shows node specific lift (node response/ overall
response)
Dark Blue horizontal line is the LIFT=1.0 reference line
Light blue line with green triangles displays test data
3rd ranked node in train data would be ranked 1st or 2nd in test data

TTC Details
For the 14 node tree we are told that agreement on ―direction‖ fails 1 time
And the rank order agreement fails 5 times (scroll to right to see this)
The statistical sensitivity of the test is controlled by the z-score selected in the
Thresholds area to the right of the display. Defaults are 1.00
Setting this threshold to 2.00 will allow much more train/test divergence

Changing TTC Sensitivity Threshold
Changing the thresholds to 2.00 permits moderate deviations and treats them as
statistical noise. After changing thresholds click on ―Apply‖ if display has not updated
We prefer to use the 1.00 threshold as this points us to trees with very high
consistency that decision makers like to see. It does point to rather small trees.

TTC: Display for 6 node tree
Much more defensible tree as train and test data align very well

Summary
• TTC focuses on two types of train-test disagreement
• DIRECTION: Is this node a response node or not?
– We regard disagreement on this fundamental topic to be fatal
• RANK ORDER: Are the richest nodes as identified by the
training data confirmed in test data
– Without this we cannot defend deployment of a tree
• TTC allows us to quickly identify which tree in the pruning
sequence is the largest satisfying train/test consistency
• TTC optimal tree is often rather close in size to Breiman’s 1
SE rule tree
– But 1 SE rule does not look inside nodes at all
– 1 SE rule is available for cross-validation while TTC is not

Controlling Node Sizes In CART
With ATOM and MINCHILD
• Today’s topic is on the technical side but very easy to
understand
• Concepts are relevant to all Salford tree-based tools
including TreeNet and Random Forests
• Controlling the sizes of terminal nodes is a practical matter
• If you are using CART, for example, to segment a database
you might want to make it impossible to create segments
that are too small
• Altering terminal node size can also influence performance
details of the optimal tree

Background: Obtaining Optimal Trees
• CART theory teaches us that we cannot arrive at the optimal
tree via a stopping rule
• The CART authors devoted quite a bit of energy to
researching this topic
• For any stopping rule it is possible to construct data sets for
which that stopping rule will not work
• We will end up stopping too early and we will miss important
data structure
• Result discovered both by experimentation and via
mathematical construction

Grow First Then Prune
• CART methodology is thus to start with an unlimited growing
phase
• Grow the largest possible tree first
• Think of this as a search engine for discovering possibly
valuable trees
• THEN use pruning to arrive at the optimal tree or a set of
trees that yield both acceptable predictive performance and
simplicity
• CART also insists that we have a test method to make our
final tree selection. That is the topic of another session.

Maximum Tree Size
• CART theory tells us that trees should be grown to their
maximum size during the growing phase
• Thus, trees should be grown until we either
– Run out of data (1 record left and thus there is nothing to split)
– Node impossible to split because pure (all GOOD or all BAD)
– Node impossible to split because all records have identical values for
predictors
• Experience tells us that if you start with 1,000 records in a
typical binary classification problem you should expect about
500 terminal nodes in the largest possible tree
– But could be many less
• Let’s try for biggest possible tree with the GB2000.xls data

An Unlimited Tree
Using GB2000.xls
To get 349 nodes we set the test method to EXPLORE, MINCHILD=2, ATOM=1

Terminal Node Sample Sizes
We obtain this frequency chart by clicking the graph icon in the center left area
of the navigator. We can see that many but not all terminal nodes are small.

Bottom Left Most Part of Tree
We get a relatively large node to the
extreme left (all class 2)
Remaining three terminal nodes in this
snippet are also all ―pure‖ but much
smaller
Obvious why the tree has to stop here
as there is nothing left to do once a
node is pure
Obtained by right clicking the node of interest and selecting ―Display Tree‖

Practical Maximal Trees
• In real world practice it may not be necessary to push the
tree growth to the literal maximum
• Essential to grow a large tree
– Large enough to include the optimal tree
• We can control the size of the maximal CART tree in a
number of ways
– Some controls tell CART to stop early
– Other controls limit CART’s freedom to produce small nodes

Key Controls over Splits:
ATOM and MINCHILD
• ATOM
– ATOM terminates splitting along a branch of the tree when the node
sample size is to small
– If a node contain fewer than ATOM data records then STOP
– 10 is commonly used but you might set this much larger
• MINCHILD
– MINCHILD prevents creation of child nodes that are too small
– The smallest possible value is 1 meaning that in splitting a node we
would be permitted to send 1 solitary record to a child node and all
other records to the other child node
– Larger values are sensible and desirable. Values such as 5, 10, 20,
30, 50 could work well depending on the data. We have used values
as large as 200

Setting ATOM and MINCHILD
On Advanced Tab of
Model Setup
Parent control
(ATOM)
Terminal node min
(MINCHILD)

Setting ATOM and MINCHILD
• ATOM: Minimum size required for a node to be a parent
• MINCHILD: Minimum size allowed for a child
• We recommend that ATOM be set to three times MINCHILD
• ATOM must be at least twice MINCHILD to allow a split
consistent with MINCHILD
• If you set inconsistent values for ATOM and MINCHILD they
will be reset automatically to be consistent
• To get the control you want be sure that ATOM is at least
twice MINCHILD

ATOM and MINCHILD
• ATOM controls the right to be a parent
• Parent must generate two children
• Parent must contain enough data to be able to fill two child
nodes
• So parent must have at least 2*MINCHILD records

ATOM and MINCHILD
• By allowing ATOM to be three times MINCHILD you give
CART some flexibility in finding the split
10 records 10 records
• Min-------------------------------|-----------------------------------Max
split
Suppose ATOM=20 and MINCHILD=10. Then we must split
this node into two exactly equal child nodes of 10 records
each. There is no flexibility here
• If no such split can be found because of clumping of values
of the variable then the node cannot be split on that variable

ATOM is 3 times MINCHILD
10 records 10 records 10 records
• Min------------------*------|--------------------*--------------------Max
left child ..….. split region…... right child
• In the example above ATOM=30 and the region of possible splitting
points lies in between the two asterisks
• There can be just one split point. So long as the smaller side has at least
10 records (in this example of MINCHILD=10) there is freedom to
choose
• To give CART flexibility as to where to locate this last split (at the bottom
of the tree) when need to have ATOM> 2*MINCHILD
• Not mandatory but worth keeping in mind. So first choose MINCHILD
and then set ATOM sensibly

An Unappealing Node Split:
Could be prevented by using a larger MINCHILD
Only one record is sent to the right and the remaining 1999 records go left
Can prevent such splits with a control which does not allow a child to be created
with fewer than the specified number of records

Experiment to get Best Settings
SPM PRO EX
Battery Tab Model Setop
Select ATOM and
MINCHILD
Modify values to be
tested, optionally
We used a 50% random
sample for testing

Choosing ATOM and MINCHILD
Settings of ATOM=10
and MINCHILD=5 yield
a Rel. error within 1% of
the literal best

Direct Control Over Tree Size (Almost)
• You also have the option of LIMITing tree in a variety of
ways including limiting the DEPTH of the tree
• To get to the LIMITS menu item you must first go to the
Classic Output

Growing Limits Dialog
DEPTH=1 will allow just
one split
Controlling tree size via a
DEPTH limit may yield
Inferior results
We tend to use it only
When wanting extremely
small trees such as one split

LIMITS Details
• A tree of depth=1 can have only two terminal nodes
• With each additional depth level we allow for a doubling of
the number of terminal nodes
• Potential sizes are then 2,4,8,16 etc.
• However, depth limits do not guarantee a specific number of
terminal nodes only that no terminal node will deeper than
was allowed

LIMIT DEPTH=1
We sometimes want to start a CART analysis by splitting just the ROOT node and
then reviewing the entire ranked list of potential splitters
Mostly useful for very large data sets as this reduces compute time substantially

LIMIT DEPTH=2
Maximum length of any branch will allow two splits between the root node and
any terminal node. But some branches might stop early due to pre-pruning.

Depth Limit=3
Method GINI
With METHOD GINI you may not get every branch of the tree exhibited to
the full depth you wanted (due a technical matter – ―pre-pruning‖

Depth Limit=3
METHOD PROB
You have a better chance of getting every branch grown out to full depth using
METHOD PROB

Concluding Remarks
• Setting ATOM (smallest legal parent) and MINCHILD
(smallest legal child) can help to speed up large database
runs
• Modest limitation will not harm performance if we take care
with the settings
• Can and should use experimentation to find best settings
• In some circumstances setting these controls to values
larger than their minimums can improve performance on test
data

CART and the PRIORS Parameter
• If you are a casual user of CART you probably can get by
without knowing anything about PRIORS
• The default settings of CART handle PRIORS in a way that
is well suited for almost all classification problems
• A casual user will probably not want to review or understand
the more technical output‖ which is printed to the plain text
―classic output‖ window
• BUT there are some very effective uses of CART that
require judicious manipulation of the PRIORS settings
• Therefore a basic understanding of PRIORS may be helpful
and worth the effort

Classic Reference
• The original CART monograph published in 1984, remains
one of the great classics of machine learning
• Classification and Regression Trees by Leo Breiman,
Jerome Friedman, Richard Olshen, and Charles Stone, CRC
Press
• Available also in paperback and as e-book from Amazon:
• http://www.amazon.com/Classification-Regression-Trees-Leo-Breiman/dp/0412048418/
• Not the easiest reading but well worth having as a reference
and contains fascinating discussions regarding the decisions
the authors made in crafting CART
• Contains extensive discussion of priors as well as all major
concepts relevant to CART. Still wothwhile reading.

CART Monograph Details

For The Casual User
• Thinking about a binary 0/1 classification problem we have
two ways of evaluating a CART generated segment
– Assign the segment to the majority class (more than 50%)
– If there are more 1s then 0s then the segment is labeled ―1‖
– Assign the segment to the class with a LIFT greater than 1
– We start with a baseline event rate (fraction of 1 in the data)
– Look at the ratio of event rate in the node to event rate in sample
• Ratio of event rate in segment to event rate in root
– Any segment with a better than baseline event rate is labeled ―1‖
• CART by default uses the LIFT concept for making
decisions (known in CART-speak as PRIORS EQUAL)
• You can elect to use the first method via PRIORS DATA

Example Split: Priors Equal
Almost 80% GOOD (Class 0) Remainder BAD (Class 1)
Left child is considered a BAD dominant node because 36% BAD > 21.4% BAD
Priors equal simply ensures that we think in these ―relative to what we started
with‖ terms

PRIORS EQUAL or PRIORS DATA
• PRIORS EQUAL is almost always the right choice
– Is the DEFAULT and almost always yields useful results
• PRIORS DATA focuses on absolute majority and not relative
counts in the data
– Will rarely work with highly unbalanced data (eg 10:1 ratio of 0 to 1)
• PRIORS can be expressed as a ratio
– Default 1:1
– You can set priors to whatever ratio you like
• 1.2:1 as we did in the previous example
• 5:1
• 10:1
– Changing priors usually changes results, sometimes dramatically
– Extreme priors often make getting any tree impossible

Setting PRIORS
Mechanics
To set your own PRIORS
first click the SPECIFY
option
The default settings of 1:1
can now be changed
To the left the dialog is
allowing me to alter the
entry for Class 0
Once entered I will be
given the opportunity to
make an new entry for
Class 1

If PRIORS can change results then what is right?
• The results CART gives you are intended to reflect what you
consider important and what makes sense given your
objectives
• PRIORS EQUAL usually reflects what most people want
• If tweaking the PRIORS and changing them gives you better
results given your objectives then use the tweaked priors

Advice on PRIORS
• Start with the default of EQUAL
– Most users never get beyond this!
• BATTERY PRIORS
– CART PRO EX runs an automatic sweep across dozens of different
settings to display the consequences of tweaking the priors
– Results are then summarized in tables and charts
– Useful when you want to achieve a specific balance of accuracy
across the dependent variable classes
– Choose the setting that is practically best
• Otherwise, you can experiment manually to measure the
impact of a change

PRIORS: Under the Hood
• To understand how PRIORS affect core CART calculations
we need to start with a brief review of splitting rules
• We will only discuss the Gini to illustrate the key concepts

Start With Gini Splitting Rule:
Two classes
• Very simple formula for the two class (binary) dependent variable
• Label the classes as Class 0 and Class 1 and in a specific node in
a tree we represent the shares of the data for the two classes as
p0 and p1
These two must sum to 1 (p0 + p1 = 1)
• The measure of diversity (or impurity) in a given subset of data
(e.g. a node) is given by
Impurity = 1 – p0*p0 – p1*p1
• Impurity will equal 0 if either sample share is equal to 1 (100%)
• Impurity will equal 0.50 when both sample shares are equal (50%)
1 – (.5*.5) – (.5*.5) = 1 - .25 - .25 = .50

Splitting Criteria and Impurity
• The Gini measure is just a sensible way to represent how
diverse the data is in a node (for a classification problem)
– Extensive experience shows it works well, a good measure
– You do have a choice of 6 different splitting methods in CART
• Useful because it can be used for any number of classes
– Every class has a share
– Square the shares and subtract them all from 1
• We use the Gini measure as a way to rank competing splits
• Split A will be considered better it produces child nodes with
less diversity (on average) than does split B
• We measure the goodness of split by looking at the
reduction in impurity relative to the node being split (the
parent)

Improvement Calculation
• Hypothetical Example
Parent Node Impurity = 0.50
Left Child Impurity = .30 Right Child Impurity=.20
20% of data 80% of data
Left child improves diversity by 0.20 (0.50 – 0.30)
Right child improves diversity by 0.30 (0.50 – 0.20)
Weighted average impurity is .2*.3 + .8*.2=.22
Improvement from parent is .5 - .22 = .28

Graphing Gini Impurity (2 classes)
• Impurity formula here
simplifies to 2p(1-p)
• Impurity is greatest
when p=(1-p)= 0.5
• Impurity is low when p
is near either extreme
of 0 or 1 as the node is
dominated by one class
• Declines slowly near
p=.5 and accelerates as
it approaches 0 or 1
1
0 0.5 1
Graph is of 2*[2*p*(1-p)] to make it easier to read

Split Improvement Measurement
(No Missing Values for Splitter)
Parent Node N Percent
Left Child N Percent Right Child N Percent
Parent Impurity = 0.50
Left Child Impurity = 0.3967 Fraction of data in left child = 55%
Right Child Impurity=0.3457 Fraction of data in right child = 45%
Weighted average of child node diversity = .3737
Overall improvement of split = .1262

As expressed in the CART monograph
Parent node impurity minus weighted average of the impurities in each
child node
• pL = probability of case going left (fraction of node going left)
• pR = probability of case going right (fraction of node going right)
• t = node
• s = splitting rule
• i = impurity
( , ) ( ) ( ) ( )t s i t p i t p i tL L R R
impurityL impurityR
Impurity
Parent
probL probR

Unbalanced Data and PRIORS EQUAL
• Calculations for all key quantities become weighted when
we use the CART default and the original data is
unbalanced
• Weighting is used to calculate
– Fraction of the data belonging to each class
– Fraction of the data in the left and right child nodes
– Gini impurity in each node
– Resulting improvement of the split (reduction in impurity)
• We no longer can use simple ratios
• Good news is that the mechanism for weighting is very
simple and easy to remember
– All counts are expressed as count in the node divided by the
corresponding count in the root node

Calculations for Priors
• Our training sample starts with N0 examples of class 0 and
N1 examples of class 1
• Now look at any node t in the CART tree
– N0(t) examples of class 0
– N1(t) examples of class 1
• Fraction of class 0 will now be calculated as (simplified)
• In other words we convert every count to ratio of a count in a
node (t) to the corresponding count in the root (sample)
• Then the math is the same as usual
( 0N t( )/ 0N )
( 0N t( )/ 0N )+( 1N t( )/ 1N )

What fraction of the data is in a node
• Again we use ratios instead of counts to calculate
• For priors equal we just average
– Fraction of all the Class 0 in a node
– Frcation of all the Class 1 in a node
• If the priors are not equal then all ratios are first multiplied by
the corresponding prior (which acts as a weight)
• When priors are equal the terms all cancel out
( 0P0 N t( )/ 0N )
( 0P0 N t( )/ 0N )+ ( 1P1N t( )/ 1N )

• pi(t) = Proportion of class i in node t
• If priors DATA then
• Proportions of class i in node t with data priors
• Otherwise proportions are always calculated as weighted
shares using priors adjusted pi
Priors Incorporated Into Splitting
Gini = 1 - pi
2
(i)=
N
N
i
N(t)
(t)N
(t)N
(t)N
=t
i
j
i
)p(
Nj
Nj(t)
(j)
Ni
Ni(t)
(i)
tp )(

Run a Real World Example
79% Class 0 (Good) 21% Class 1 (Bad)
Data set BAD_RARE_X.XLS MODEL BAD = X15 just one predictor

Test method: 20% random sample for test
We only want to look at the root node split. But tree is quite predictive!

Root Node Split:
Under PRIORS EQUAL
Main splitter improvement is reported to be .06264
Observe that the left hand child is considered to be Class 1 because the
node Class 1 share of 41% is greater than the root share of 21.4%

Classic Output
Typical user rarely consults classic output
Start by confirming the total record counts in the parent and child nodes
Agrees with previous diagram in GUI

Next Confirm Target Class Breakdown
Here we see the same counts for Class 0 and Class 1 as in GUI

Priors Adjusted Computations
Note first that the parent node is reported to have 50% class 0 and 50% class 1
This is guaranteed for the root node under priors equal
With 2 classes each is treated as if it represented half the data
With 3 classes each would be treated as if it represented 1/3 of the data
Our calculations of the Gini impurity would be based on these priors adjusted
shares of the data (or node)
The class breakdowns in the child nodes (left and right) are priors adjusted
using the formulas presented earlier

Spreadsheet to Reproduce Results
Column C contains the counts for each class in the parent and child nodes
Column H at the top records the priors
Column G displays the priors adjusted shares (raw shares are in Column D)
Column F displays raw and priors adjusted child node probabilities
Column J displays the Gini diversity in the parent and child nodes and the
improvement generated by the weighted average of the child diversities
All we need to input are the class counts and the priors and formulas do the rest

Conclusion
• Priors are an advanced control that the casual user need not
worry about
• The default setting is almost always reasonable and almost
always yields valuable results
• Tweaking the priors can change the details of the tree and
can alter results
– Sometimes considerably
– Can be worth running some experiments
• Further discussion in another tutorial

Modeling automation Report
Develop model using a variety of strategies
Here we display results for each of the 6 major tree growing methods. Entropy
yields best performance here. This one of 18 different automation schemes.© Copyright Salford Systems 2013

Summary of Variable Importance Results
Across alternative modeling strategies

Alternative Modeling Automation Strategies
Analyst Can Run All Strategies if desired

Automated Modeling:
Vary Penalty on False Positives

Accuracy among YES and NO groups
As penalty on false positive is varied (automatically)

Automatic Shaving:
Backwards Elimination of Least Important Feature
•

Hot Spot Detection:
Search many trees for high value segments
Lift in node plotted against sample size: Examination of individual nodes
from many different trees to find best segments

Tabular detail: Hot spot search for special nodes
Tree 18 Node 25 defines a segment with 85.3% of the target class
Sample size in this segment is N=265 in the test set
Clicking on any row brings up tree for examination and review

Constrained Trees
• Many predictive models can benefit from Salford’s patent
pending ―Structured Trees‖
• Trees constrained in how they are grown to reflect decision
support requirements
• In mobile phone example: want tree to first segment on
customer characteristics and then complete using price
variables
– Price variables are under the control of the company
– Customer characteristics are not under company control

Visualizing separate regions of tree

Constrained Tree
Mobile Phone Price variables appear only at bottom
Demographic and spend information at top of tree
Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom

Model Deployment -1
Translate Model into Reusable Programming Code
New version supports JAVA, C, PMML, SQL, SAS®

Automatically Generated Code
Can be deployed directly

Deployment –II
Use Salford Scoring Engine/Server
Controllable via scripting can be deployed in batch mode on server

Cross-Validation: Part 1
• Built-in automatic method of self testing a model for
reliability
• Honest assessment of the performance characteristics of a
model
– Will model perform as expected on previously unseen (new) data
• Available for all principal Salford data mining engines
• CART monograph 1984 was decisive in introducing cross-
validation into data mining
• Many important details relevant to decision trees and
sequences of models developed in the monograph for the
first time

Cross-Validation is a Testing Method
• Why go through special trouble to construct a sophisticated testing
method when we can just hold back some test data?
• When working with plentiful data it makes perfect sense to reserve a
good portion for testing
– E.g. Credit risk data set with 150,000 training records and 100,000 test
records, real world example
– Direct Marketing data sets with 300,000 training records and 50,000 test
records
• Not all analytical projects have access to large volumes of data

Principal Reason for Cross-Validation
Data Scarcity
• When relevant data is scarce we face a data allocation
dilemma
– If we reserve sufficient data to conduct a reliable test we find
ourselves lacking training data
– If we insist on having enough training data to build a good model we
will have little or nothing left for testing
• Train Test
• o---------------------------------------------------------------|-------------o
• A common division of data is 80% train 20% test
• With 300 data records in total this would amount to 240 train and 60 test

Tough decision:
How much data to allocate to test
• Train Test
• o---------|-------------------------------------------------------------------o
• Train Test
• o------------------------------|----------------------------------------------o
• Train Test
• o-------------------------------------------------|---------------------------o
• Train Test
• o------------------------------------------------------------------------|----o

Unbalanced Target Data
• In most classification studies the target (dependent variable)
data distribution is unbalanced
• Usually one large data segment (non-event) and a smaller
data segment (event) which is the subject of the analysis
– Who purchases on an e-commerce website?
– Who clicks on a banner ad?
– Who benefits from a given medical treatment?
– What conditions lead to a manufacturing flaw?
• When the data is substantially unbalanced the sample size
problem is magnified dramatically
– Think of your sample size as being equal to the smaller class
– If you only have 100 clicks that is your data set size
– Does not matter much that you have 1 million non-clicks.

Cross-Validation Strategy:
Sample Re-use
• Any one train/test partition of the data that leaves enough
data for training will yield weak test results
– based on just a fragment of the available data
• But what if we were to repeat this process many times
– using different test partitions?
• Imagine the following: we divide the data into many 90/10
train/test partitions and repeat the modeling and testing
• Suppose that in every trial we get at least 75% of the test
data events classified correctly
• This would increase our confidence dramatically in the
reliability of the model performance
– Because we have multiple at least slightly different tests

Cross-Validation Technical Details
• Cross-Validation requires a specialized preparation of the data
somewhat different than our example of repeated train/test partitioning
• We start by dividing the data into K partitions. In the original CART
monongraph Breiman, Friedman, Olshen, and Stone set K=10
• K=10 has become an industry standard due both to Breiman et. al. and
other studies that followed (see final slides for details)
• The K partitions should all have the same distribution of the target
variable (same fraction of events) and if possible be equal in size
Takes care to get this right when data cannot be evenly divided into K parts
• This is all done automatically for you in SPM software

Cross-Validation Train/Test Procedure:
K mutually exclusive partitions, 1 Test, K-1 Train
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
Test
Test
Test
Test
Learn
Learn
Learn
LearnLearn
ETC...
Learn
Above each partition is in the train sample 9 times and in the test sample 1 time

Build K Models
• Once the data has been partitioned into the K parts we are ready to build
K models
– If we have 10 data partitions and we will build 10 models
• Each model is constructed by reserving one part for test and the
remaining K-1 parts for training
– If K=5 then each model will based on an 80/20 split of data
– If K=10 then each model will be based on a 90/10 split
– There is nothing wrong with considering K=15 or K=20 or more
• In this strategy it is important to observe that each of the K blocks of data
is used as a test sample exactly once
• If we could somehow combine all the test results we would have an
aggregated test sample equal in size to that of the training data

Euro_Telco_Mini.xls Data Set
Class=0 Class=1
CVCycle Learn Test Learn Test CVW
1 634 70 113 13 0.1026161
2 633 71 114 12 0.0960758
3 634 70 113 13 0.1026161
4 633 71 114 12 0.0960758
5 634 70 113 13 0.1026161
6 633 71 114 12 0.0960758
7 634 70 113 13 0.1026161
8 634 70 113 13 0.1026161
9 633 71 114 12 0.0960758
10 634 70 113 13 0.1026161
• Here we see the breakdown of the 830 record data set into the 10 CV folds
• Table shows sample counts for majority and minority classes for learn and test
partitions for each fold
• Observe that CART has succeeded in making each fold almost identical in the
learn/test division and in the balance between TARGET=0 and TARGET-1
• Last column is the WEIGHT that CART uses on each fold for certain
calculations

Confusion Matrix
Prediction Success Matrix
• In two-class (e.g. Yes/No) classification test results can be
represented via the 2x2 confusion matrix
Predicted Y=0 Predicted Y=1
Actual Y=0 20 4
Actual Y=1 1 5
Hypothetical results for the test set of a single Cross-validation fold
Note test sample is quite small but there will be a number of these (e.g. 10)

Aligning the CV Trees
All automatic and the user never sees this
Main CV1 CV2 CV3 CV4 CV5 CV6 CV7 CV8 CV9 CV10
Nodes 2 2 3 2 2 2 2 2 2 2 2
Complexity 0.01523 0.11543 0.04915 0.12949 0.08684 0.1178 0.09157 0.11464 0.11911 0.11201 0.10531
Nodes 4 6 4 4 4 5 4 4 5 4 4
Complexity 0.01487 0.01736 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.02285
Nodes 5 7 4 4 4 5 4 4 5 4 7
Complexity 0.01189 0.01455 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.01342
Nodes 9 8 4 8 4 9 4 9 6 8 10
Complexity 0.00893 0.01118 0.02034 0.01042 0.03128 0.01219 0.03642 0.01229 0.0114 0.01259 0.01157
• We would expect that the trees would be aligned by number of nodes and this is
approximately what happens
• CART aligns the trees by a measure of ―complexity‖ discussed in other sessions
• Alignment is required to determine the estimated error rate of the main tree when it has
been pruned to a specific size (complexity)
• Thus when the main tree is pruned to 4 terminal nodes align each CV trees appropriately.
Eight of the CV trees are also pruned to 4 nodes, but one CV tree is pruned to 5 nodes
and one to 6 nodes

Summing the Confusion Matrices
• Each CV fold generates a test confusion matrix based on a
completely separate subset of data
• When summed the test partitions are equal to the entire
original training data
• Summing the confusion matrices yields an aggregate matrix
that is based on a sample equal to the original data set
• If we started with 300 records the assembled confusion
matrix consists of 300 test records
• Not a ―trick‖. Each record was genuinely reserved for test
one time and was classified correctly or incorrectly in its fold
• We have thus arrived at the largest possible test sample we
could create: as if 100% of the data was used for test!

Test Results Extracted From Cross-Validation
• Cross-validation is not a method for building a model
• Cross-validation is a method for indirectly testing a model
that on its own has no test performance results
• In classic cross-validation we throw away the K models built
on parts of the data. We keep only test results.
• Modern options for using these K different models exist and
you can save them in SPM
– Could be used in a committee or ensemble of models
– One of the CV models might turn out to be more interesting than the
main model

Does Cross-Validation Really Work?
• We have tested CV by extracting a small training data from
a much larger database
• We used CV to obtain a ―simulated‖ test performance
• We then tested our main model against a genuine large test
sample extracted from the larger database
• Our results were always remarkably in agreement. CV gave
essentially the same results as the true test set method
• The CART monograph also discusses similar experiments
conducted by Breiman Friedman Olshen and Stone (BFOS)
• They come to the same conclusion while observing that 5-
fold cross-validation tends to understate model performance
and that 20-fold may be slightly more accurate than 10-fold

How Many Folds?
• How many folds do we need to run to obtain reliable results
• Think about 2 fold CV
– Divide the data into two parts
– First train on part 1 and test on part 2
– Then reverse roles of train and test
– Assemble results
• Problem with 2-fold CV is that we train on only half the
available data
– This is a severe disadvantage to the learning process unless we
have a large amount of data
• The spirit of CV is to use as much training as possible

How many CV folds?
• In the original CART monograph the authors Breiman,
Friedman, Olshen and Stone discussed some experiments
• Using small numbers such as 5-fold was typically
pessimistic
– Results suggested the model was not as good as it really was
• Using a substantial number of folds such as 20 was
generally only slightly more accurate than 10-fold
– CART authors suggested 10-fold as a default
– Results hold for classification problems
• These classification model results re-confirmed in a 1995
paper by Ronny Kohavi
– A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection. In International Joint Conference on Artificial Intelligence
(IJCAI 1995)

Creating Your Own Folds:
Needs to be done with care with smaller samples
• Suppose you have 100 records divided as
– 92 records Y=0
– 8 records Y=1
• Each fold must have at least one record for each target
class
• Best we can do then is to have 8 folds
• But we cannot divide 92 into 8 equal parts
– 7 parts with 11 records Y=0 (response rate=.0833)
– 1 part with 15 records Y=0 ( response rate=.0625)
• Better to divide as
– More equal balance across the folds yields more stable results

Points to Remember
• The ―main‖ model in CV is always built on all the training data
– Nothing is held back for testing
• If you were to run CV in several different ways
– Vary the number of folds
– Vary construction of CV folds by varying random number seed
• You would always get the exact same main model
– Only the estimates of test performance could differ
• Are the results sensitive to these parameters?
– BATTERY CV re-runs the analysis with different numbers of folds
• Larger numbers should converge
– BATTERY CVR using the same number of folds but creates the K partitions
based on different random number seeds
• Is expected to yield reasonably stable results
• Unstable results suggest considerable uncertainty regarding your model

Cross-Validation: Part II
• In part I we reveiwed the main ideas behind cross-validation
• We pointed out that CV is a method for testing a model
• Especially useful when there is a shortage of data but can
be used in any circumstance
• A main model is built on all training data with nothing held
back for testing
• An additional set of K different models are built on different
partitions of the data holding back some of the data for test
• The test results for the K models are aggregated and then
used as an estimate of the test set performance of the
―main‖ model

Alignment of Results
• In this session we discuss a somewhat technical topic
related to the mechanics of aligning test results from K CV
models and the main model
• Recall that CART grows a large tree and then prunes it back
• Back pruning is conducted via ―cost-complexity‖
• Back pruning might prune off more than one terminal node
at a time
• Back pruning might prune back several nodes along the
same branch
• CV generates K different models each with its own maximal
tree and its own sequence of back-pruned trees

CV Mechanics
• Main Model Has No Test Data Each CV Model has test Data
Main Model CV Model 1
CV Model 2
CV Model 3
CV Model 10
Combine test results from all CV
folds and attribute to main model

CART and CV Details
• A CART tree model is actually a family of progressively
smaller tree models one of which is normally deemed
―optimal‖
• So we don’t just have a main model and K CV models
• We have a main tree sequence and K CV tree sequences
• For every tree in the main sequence we need to match it up
with its corresponding tree in each CV sequence
• The most obvious way to do this is by tree size
• To estimate the error rate of the 2-node tree in the main tree
sequence match it up with the K 2-node trees found via CV
• Then proceed to match up every other tree size found

CART Tree Alignment
• Matching up trees from the different sequences is much
more complicated than this
• Each CV tree has its own sequence and its own maximal
size
• These sequences may not all contain the same tree sizes
• The main tree might contain a subtree with 8 terminal nodes
but not every CV tree will contain an 8 node tree
– Back pruning sometimes skips over certain sizes jumping directly say
from 9 terminal nodes to 7
• Not all tree sequences will have the same number of nodes
in the maximal tree

Alignment via Cost Complexity
• Cost complexity prunes trees by examining a trade off between error rate
(cost) and size of the tree (complexity)
• Error rate can be taken to be misclassification rate for this discussion (on
the training data)
• Suppose our maximal tree has a training data misclassification rate of
.00 (not uncommon on training data) but that the tree is very large (e.g.
1000 terminal nodes)
• Suppose we penalized terminal nodes at the rate of .0001
• Then the error rate of 0 would be counterbalanced by a penalty of
1000*(.0001)=0.10
• If we could prune off 500 nodes we would reduce the penalty to .05 but
of course our misclassification would probably increase
• If the increase in misclassification rate were say .04 then the total of
misclass rate + penalty would be only .04 + .05 = .09 a benefit!

CART Cost Complexity Pruning
• CART automatically tests different penalties to try to induce
a smaller tree
• We always start with a penalty of 0 and then start gradually
increasing it
• To prune back we prune off the so-called ―weakest-link‖
which is the node that increases the misclassification rate of
the whole tree the least
• Means that sample size of node is taken into account
• A progressive search algorithm for finding the next penalty is
described in the CART monograph

Cost-Complexity is the key to Alignment
• For every CART tree sequence a specific penalty on nodes
(e.g. .001) leads immediately to exactly one tree of a specific
size
• We can only find this tree by going through the pruning
sequence (no shortcuts)
• We align the CART CV trees by the penalty (complexity)
rather than the tree size
• So for a given penalty we find the tree that corresponds to it
both in the main tree sequence and also in each CV tree
• These aligned trees are used to extract the performance
measures that will finally be assigned to the main tree of that
size

Table of Alignments:
Special extract report not automatically generated
• Table displays the aligned trees corresponding to each tree in the main sequence
• In the first row the main tree has been pruned to 2 nodes as have all but one of the CV
trees
• When the main tree is pruned to 7 nodes it is aligned with trees of varying sizes ranging
from 4 to 7 terminal nodes
• The complexity penalties appear under the terminal node counts
• Complexity penalties always increase as the tree becomes smaller

Using CART For Beginners with A Teclo Example Dataset

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Using CART For Beginners with A Teclo Example Dataset

Semelhante a Using CART For Beginners with A Teclo Example Dataset (20)

Mais de Salford Systems

Mais de Salford Systems (20)

Último

Último (20)

Using CART For Beginners with A Teclo Example Dataset