SlideShare uma empresa Scribd logo
1 de 196
Salford Systems Webex Training
Salford Systems
http://www.salford-systems.com
CART® Decision Tree Basics
• We start with a simple analysis of some market research
data using CART
• This introduction assumes no background in data mining or
predictive analytics
• We do assume you have had some experience reviewing
data with the purpose of discovering interesting and or
predictive patterns
© Copyright Salford Systems 2013
Beginning with CART
• CART is the perfect place to start learning about data mining
• Widely regarded as one of the most important tools in data
mining and also the easiest to understand and master
– Decision trees are still the most popular data analysis tool among
experienced data miners
• Delivers easy to understand analyses of complex data
– Allows for very sophisticated analyses especially when a structured
series of trees are developed
– Effective Exploratory Data Analysis (EDA) to support more
conventional modeling (eg logistic regression)
© Copyright Salford Systems 2013
Classification with CART®
Real world study early 1990s
• Fixed line service provider offering a
new mobile phone service
• Wants to identify customers most
likely to accept new mobile offer
• Data set based on limited market trial
• 830 usable records
• 67 attributes and target including
– Demographics
– Attitudes and Needs
– Pricing for handset & minutes
© Copyright Salford Systems 2013
Mobile Phone Offer
• Data is a sample of land line telephone customers of a
European telco
• At the time mobile phones were very rare in the country in
question
• The Company realized the time was right to introduce
mobile phones on a substantial scale to their existing fixed
line customer base
• Key questions:
– WHO to target with the marketing campaigns for the new product
– HOW MUCH to charge for the handset
© Copyright Salford Systems 2013
Nature of the Research
• Company arranged to make real world offers to about 1,000
existing land line customers
• Everyone was presented the same offer (only one model of
phone and one service plan available)
• The PRICE of the handset was varied randomly over a large
range prices from near zero to about $300
• Goal was to learn who responded positively and at what
price points
• Offers were made in person as part of a one hour visit in
which much was learned about the household (media
preferences, number of children, distance to work, etc)
© Copyright Salford Systems 2013
• Target variable RESPONSE: Coded 0 or 1 (YES, NO)
• 65 available predictors include variables like:
Nature of the Data
HANDPRIC Cost of handset (one time fee)
USEPRICE Usage cost (per month, 100 minutes)
TELEBILC Landline home phone bill average
CITY Resident in which of 5 major cities
AGE Coded in 5 year increments
HOUSIZ Possible proxy for income, coded 1-6
SEX Male, Female, Unknown
EDUCATN Coded 1-7 ( thru postgrad)
© Copyright Salford Systems 2013
Analysis File Overview in CART 6.0
© Copyright Salford Systems 2013
Set Up the Model
(Select Target, allowable predictors)
Only requirement is to select TARGET (dependent) variable. CART will do everything
else automatically© Copyright Salford Systems 2013
CART:
Does its own variable selection
• Embedded variable (feature) selection means that modeler
can let the software make its own choice of predictors
• Modeler will often want to limit the model to focus on
selected inputs
– Exclude ID variables and merge keys
– Exclude clones of the dependent variable
– Exclude data pertaining to the future (relative to the dependent)
– E.g. restrict a model to easily available predictors
– Test predictive power of purchased external data
• Modeling automation can allow exploration of a vast space
of pre-selected predictors (see later slides)
© Copyright Salford Systems 2013
In Example we run CART model
• CART completes analysis and gives access to all results
from the NAVIGATOR
– Shown on next slide
• Upper section displays tree of a selected size
– number of terminal nodes
• Lower section displays error rate for trees of all possible
sizes
• Green bar marks most accurate tree
• We display a compact 10 node tree for further scrutiny
© Copyright Salford Systems 2013
CART Model Viewer
Access reports and drill into model details
Most accurate tree is marked with the green bar. Above we select the 10 node tree
for convenience of a more compact display. Note train/ test area under ROC curve
© Copyright Salford Systems 2013
Root Node:Hover Mouse
Tree starts with all training data
Displays details of TARGET variable in overall training data
Above we see that 15.2% of the 830 households accepted the offer
Goal of the analysis is now to extract patterns characteristic of responders
© Copyright Salford Systems 2013
Goal is to split node: separate responders
• Details of root node split
• If we could only use a single piece of information to separate responders from
non-responders CART chooses the HANDSET PRICE
•Those offered the phone with a price > 130 contain only 9.9% responders
•Those offered a lower price respond at 21.9%
© Copyright Salford Systems 2013
CART Splitting Rules
• We discuss the details later
• Here we just point out that the split CART displays is
– ―the best of all possible splits‖
• Subject to the splitting criteria you have chosen and any
constraints imposed
• How do we know this split is ―best‖?
• Because CART actually tries all possible splits looking for
the best
– Exhaustive brute force search
– Advanced algorithms used to make this search fast
– As much as 100 times faster than other decision trees
© Copyright Salford Systems 2013
Grow progressively bigger tree: One split at a time
© Copyright Salford Systems 2013
• Binary recursive partitioning repeated until further splitting
impossible (e.g. data exhausted)
• This leads us to the largest possible or “maximal tree
Maximal tree is raw material for best model
© Copyright Salford Systems 2013
• Goal is to find optimal tree embedded inside maximal tree
• Will find optimal tree via ―pruning‖
• Like backwards stepwise regression
• Challenge: A tree with 100 terminal nodes can be pruned back to 99
terminal nodes by eliminating any one 99 penultimate nodes
• Now the 99 new terminal nodes can be cut back to 98 by eliminating
any one of the surviving 98 penultimate nodes
• Something like 99! possible trees. How do we find the best?
Pruning Sequence
• CART automatically generates a pruning sequence which
develops a preferred sequence of progressively smaller
trees
• We can prove that for a given tree size the CART tree in the
sequence will be the best performing tree of all possible
trees of that size
• In our sequence, the 10 node tree is gauranteed to be more
accurate than any other 10 node tree you could extract from
the maximal tree
• You as the user never need to worry about this
• ―Better‖ is defined in terms of performance on the training
data as we need the tree sequence before we can test
© Copyright Salford Systems 2013
Error Curve: Plots Accuracy vs Model Size
© Copyright Salford Systems 2013
•Requires test data
•Can use cross-validation (sample reuse) if data is scarce
•Curve typically U-shaped
• Too small is not good and neither is too large
•Can look at any tree in the sequence of pruned subtrees
•Error is what BFOS call an “honest” estimate of model performance
Pick a modest sized tree to examine
Note high response in this RED colored node
Response of 38.5% in this segment vs. 15.2% overall
Lift = 2.53 © Copyright Salford Systems 2013
Navigator allows access to all model info
• The terminal nodes are color coded to represent results
– RED nodes are ―hot‖ and contain high concentrations of the class of
interest (buyers)
– BLUE nodes are ―cold‖ and contain very low concentrations of the
class of interest
– PINK and WHITE nodes have moderate concentrations
• We first look to see if we have any RED nodes
– Explore any red nodes via mouse hover
• Then we drill down to see a tree schematic revealing the
main drivers of the tree
© Copyright Salford Systems 2013
Select ―splitters‖ View
Selects a streamlined overview of the tree showing ONLY primary splitters
© Copyright Salford Systems 2013
Model Overview: Main Drivers
( Red= Good Response Blue=Poor Response)
High values of a split variable always go to the right; low values go left
© Copyright Salford Systems 2013
Examine Extreme Right-most Terminal Node
• Hover mouse over node to see inside
• Even though this node is on the ―high price‖ of the tree it still
exhibits the strongest response across all terminal node
segments (43.5% response)
•Rules defining this node are shown on next slide
© Copyright Salford Systems 2013
Rules can be extracted in a variety of languages
•
Here we select rules
expressed in C for
one node of interest
Entire tree can also
be rendered in Java,
XML/PMML, or
SAS
© Copyright Salford Systems 2013
Continuing down the tree
• We note that even if the new product is offered at a high
price we can still find prospects very interested:
– Those that have a high average landline bill and own a pager
– This group displays greatest probability of response (43.5%)
© Copyright Salford Systems 2013
Classic Detailed Tree Display
Analyst can select details to be displayed
© Copyright Salford Systems 2013
Control Over Details Displayed in Nodes
At left an example
showing the class
bar chart is
displayed
Separate controls for
internal and terminal
nodes
© Copyright Salford Systems 2013
Configure Print Image Interactively
Shrink to one page, include header/footer
•
© Copyright Salford Systems 2013
Tree Performance Measures
and Principle Message
• In addition to the details of the tree (splits, split values)
• Variable Importance Ranking
• Confusion Matrix (Prediction Success Matrix)
• Gains, ROC
© Copyright Salford Systems 2013
Variable Importance Ranking
(Relative impact on outcomes)
Three major ways of computing variable importance. Above
default display. © Copyright Salford Systems 2013
Predictive Accuracy
(How often right, how often wrong)
This model is not very accurate but ranks responders well
© Copyright Salford Systems 2013
Gains Curve
In top decile model captures about 23% of responders
© Copyright Salford Systems 2013
Performance Evaluation: ROC Curve
© Copyright Salford Systems 2013
Observations on CART Tree
Contrasts with Conventional Stats
• CART leverages rank order of predictor to split
– Transforming predictor X into Log(X) will not change tree
– Of course tree will be expressed in terms of Log(X) but this will not
change the location of the split
– Traditional statisticians experiments with alternative transforms
unnecessary
• CART is immune to outliers in predictors
– Suppose X has values 1,2,3,…,100, 900
– To CART this is the same as 1,2,3,…,100, 101
– All CART ―sees‖ is the rank order
• We will see later that CART has built-in missing value
handling
• So no worry about outliers, missing values, transformations
© Copyright Salford Systems 2013
CART Methodology: Partition Data
Into Two Segments
• Partitioning line parallel to
an axis
• Root node split first
– 2.450
– Isolates all the type 1
species from
rest of the sample
• This gives us two child
nodes
– One is a Terminal Node
with only type 1 species
– The other contains only
type 2 and 3
• Note: entire data set
divided into two separate
parts
© Copyright Salford Systems 2013
Second Split: Partitions
Only Portion of the Data
• Again, partition with line
parallel to one of the two
axes
• CART selects
PETALWID to split this
NODE
– Split it at 1.75
– Gives a tree with a
misclassification rate of 4%
• Split applies only to a
single partition of the
data
• Each partition is
analyzed separately
© Copyright Salford Systems 2013
Discriminant Analysis Uses Oblique Lines
•Linear Combinations are difficult to understand and explain
•CART does permit ―oblique‖ splits based on linear combinations of small sets
of variables but this is rarely desirable
© Copyright Salford Systems
CART Representation of a Surface
Model clearly non-linear
Height of bar represents probability of response
Remaining axes represent values of two predictors
Greatest prob of response here in corner to the right
0
CART Splitting Process
• Standard splits are based on ONE predictor and the form of
a database RULE
• A data record goes left if
splitter_variable <= split value
• Examples: A data record goes left
• if AGE<=35
• if CREDIT_SCORE <= 700
• if TELEPHONE_BILL <= 50
© Copyright Salford Systems 2013
Searching all splits facilitated by sorting
• On left we sort by TELEBILC, on right by TRAVTIMR
• Test smallest value first, then next smallest, etc moving all the way down the column
• The arrow shows a split sending 10 cases to the left and all other data to the right
Example Root Node Split
Continuous Splitter
© Copyright Salford Systems 2013
From our Euro_telco_mini.xls example
Split is TELEBILC <= 50
Alternative Split Points
What if split the data at TELEBILC<=25?
© Copyright Salford Systems 2013
Note the response rate between the two nodes with this split are very similar
They are much different after splitting at the optimal value
Two splits separate quite differently
© Copyright Salford Systems 2013
The first pane shows two segments with 14.3% and 15.5% response
The second pane shows two segments with 12.7% and 19.8%
Our goal in CART is to generate substantially different segments and we
accomplish this by experimenting with every possible split value for every predictor
CART Splitting Process: More
• Splitter variables need not be numeric, they can be text
• Splitter variables need not be ordered
• A data record goes left
• if CITY$ = ―London‖ OR ―Madrid‖ OR ―Paris‖
• if DIAGNOSIS = 111 OR 35 OR 9999
© Copyright Salford Systems 2013
• CART considers all possible splits based on a categorical predictor
• Example: four regions - A, B, C, D can be split 7 ways (23 -1 = 7)
• Each decision is a possible split of the node and each is evaluated
• Note: A on the left and B,C,D on the right is the same split as its mirror image A
on the right and B,C,D on the left
• So we only list one version of this split
– It is which cases stay together that matters not which side of the tree they are on
Splits on K-level categorical predictors
2K-1 -1 ways to split
Left Right
1 A B, C, D
2 B A, C, B
3 C A, B, D
4 D A, B, C
5 A, B C, D
6 A, C B, D
7 A, D B, C
© Copyright Salford Systems 2013
Categorical Split Caution:
Dangers of HLCs (High Level Categoricals)
• Because categorical variables generate 2K-1 -1 ways to split
the data high values of K can be problematic
• K=33 is not an unusually large number of levels yet allows
for about 4 billion ways to split the data
• When the number of possible splits exceeds the number of
records in the data the categorical variable has an
advantage over any continuous splitter
– A continuous variable with a unique value in every row of the data
gives us a choice of split points equal to the number of rows of data
• Later we will discuss several ways to deal with HLCs
including repackaging the high cardinality categoricals into
lower cardinality versions and penalties
© Copyright Salford Systems 2013
Example Root Node Split
Categorical Splitter
© Copyright Salford Systems 2013
From our Euro_telco_mini.xls example
Observe that we have to LIST the values that go to each child node
CART Competitor Splits
• The CART mechanism for splitting data is always the same
• We are given a block of data
– Could be all of our data and we are starting from scratch
– Could be a small part of our data obtained after already doing a lot of
slicing and dicing
• When we work with a block of data we do not take into
account how we got to that block of data
• We do not consider any information which might be
available outside of the block of data
• The block of data to be analyzed is our entire universe and
nothing else exists for us
© Copyright Salford Systems 2013
Getting Ready to Split
• For a block of data to be split
– It must contain a sufficient number of data records (ATOM)
– We can tell CART what the minimum must be
– Default is just TWO records
– In large database analysis we might reasonably set the minimum
quite a bit higher
– ATOM values such as 10, 20, 50, 100, 200 have cropped up in our
practical work
• If you are working with a small database such as those
encountered in biomedical research (e.g. 200 records total) you
will want to allow the ATOM size to be small
• If you are working with hundreds of thousands or millions of
records there is no harm in trying a minimum size like 200
© Copyright Salford Systems 2013
Still Getting Ready to Split
• If we have a classification problem such as modeling
response to a marketing offer where there are two outcomes
– Responded
– Did Not Respond
• To be splittable the block of data cannot be ―pure‖, i.e.
composed of all responders or all non-responders
– True regardless of how large the block of data is
– Splitting is designed to separate the responders from the non-
responders so we need a mixture to have something to do
• The data records cannot all have exactly the same values
for the predictors
– CART will be looking for a useful difference in a predictor between
responders and non-responders
© Copyright Salford Systems 2013
Observation on Dummy Variable Predictors
• If you split a node using a continuous variable there is
always the chance that this same variable is used again in a
subsequent split for descendent nodes
• Once a node is split with a dummy variable this variable can
never be used again in descendant nodes
– Because a descendant node will contain either all 0 or all 1 values for
this variable. Hence it cannot split.
• If a dummy variable is introduced into the tree below the root
it might appear in more than one location in the tree
– But one use will never be the ancestor of the other use
© Copyright Salford Systems 2013
Making The Split
• To split the block of data (which we will henceforth refer to
as splitting the node) we search each available predictor
• For every predictor we make a trial split at every distinct
value of the predictor
• For each trial split we compute a goodness of split measure
normally referred to as the ―improvement‖
• For each predictor we find the split value that yields the best
improvement
• Once every predictor has been searched to find the best
split point we rank the splitters in descending order and then
use the best overall splitter to grow the tree
© Copyright Salford Systems 2013
Ranked List of Splitters
• The ranked list of splitters is also known as the competitor
list
• CART always computes the entire list as this is the only way
to know for sure which split is best
• To save space CART normally only displays the top 5
competitors within a node
– You can request a larger number in your options settings
• The root node at the top of the tree always displays the
complete list of competitors even if there are thousands of
predictors
© Copyright Salford Systems 2013
Why Care about Competitor Splits?
• Useful to know if the best splitter is far better than all the rest
or only slightly better
• Useful to know which predictors show up near the top
– Are they very different from each or are they all reflecting the same
underlying information
• Useful to know if a strong but perhaps 2nd best predictor
splits the data more evenly than the best
– We might want to try FORCING that 2nd best predictor into the root to
see what happens
– Sometimes this yields an overall better tree
• Pattern of top splitters may reflect problems
– Top 3 competitors may all be ―too good to be true‖ and we might
need to drop them all from the analysis
© Copyright Salford Systems 2013
Surrogate Splits
• Surrogate splits were first introduced by the authors of
CART in their classic monograph Classification and
Regression Trees, 1984.
• Surrogate splits are mimics or substitutes for the primary
splitter of a node
• An ideal surrogate splits the data in exactly the same way as
the primary split
– The ―association‖ measure reflects how close to perfect a given
surrogate is
© Copyright Salford Systems 2013
Why Surrogates?
• Surrogates have two primary functions:
– To split data when the primary splitter is missing
– To reveal common patterns among predictors in a data set
• CART searches for surrogate splitters in every node in the
tree
– Surrogates are searched for even when there is no missing data
– No guarantee that useful surrogates can be found
– CART attempts to find at least five surrogates for every node but this
number can be modified
– Number of surrogates actually found normally varies from node to
node
© Copyright Salford Systems 2013
CART and Missing Values in Deployment
• CART is the only learning machine that is prepared to deal
with any pattern of missing values in future data
• Even if the training data have no missings CART develops
strategies to deal with the eventuality of any variable or
variables being missing
• Some learning machines cannot handle missing values at all
• Other learning machines can only deal with missing value
patterns that they have been trained on (seen before)
– Eg Handle X5=missing only if X5 was ever missing in the training
data
• CART has no such restrictions and is always ready for any
pattern of missings
© Copyright Salford Systems 2013
Surrogates in Action:
Euro_telco_mini.xls
© Copyright Salford Systems 2013
Remember to check off CITY, MARITAL and RESPONSE as ―categorical‖
Manually Prune Back to the 10-node tree
© Copyright Salford Systems 2013
Just click on the blue curve in the lower panel to select a smaller easier to manage
tree. The double click on left child of root node (see arrow above)
Look at the Left Child of the ―Root‖
© Copyright Salford Systems 2013
The primary splitter predicting subscription to a new mobile phone offer is the
monthly telephone bill (TELEBILC) dividing node into spenders of more or less
than $50 per month
Surrogate for TELEBILC
• If this variable were missing for any reason (database error,
person recently moved, new customer) we do not know
whether to move down the tree to the left or to the right
• Surrogate variable can be used in place of the missing
primary splitter. In this case the surrogate is of the form
go to the left if MARITAL=1
• Left is associated with LOW spending on the telephone bill
• CART suggests that single person households spend less
while households headed by married or divorced persons
spend more
© Copyright Salford Systems 2013
Surrogates and Direction
• A surrogate is intended to be a substitute for the primary
splitter making similar left/right decisions
• But surrogates may work in the opposite direction so every
continuous variable surrogate is supplied with a ―tag‖
– The letter ―s‖ after the split point stands for ―standard‖
– The letter ―r‖ after the split point stands for ―reverse‖
• If a surrogate is negatively correlated with the primary
splitter then it will split in the reverse direction
– Categorical splitters are always organized so that the levels that
correspond to left in the primary splitter go left in the surrogate
© Copyright Salford Systems 2013
Normally Surrogates Make Sense
• Our primary splitter is the average monthly spend of a
household on a fixed line telephone account
• Our surrogates include marital status, commute time to
work, age, and the city of residence
– Longer commutes are associated with larger spend on the phone
– Older head of household also is associated with larger spend
– We cannot interpret the CITY variable at this point because we don’t
know the identity of the cities
• In general surrogates help us understand the primary splitter
– Especially helpful in survey research
© Copyright Salford Systems 2013
How to Compute Surrogates?
• This is a technical question which we will not cover here
– The CART monograph contains a wealth of technical information
although it can be a challenging read
• However, we will discuss the main ideas
• The top surrogate is
– A single variable
– A single split (in the same format as any primary splitter)
– Intended to mimic as closely as possible how data is partitioned by
the primary segment into LEFT and RIGHT nodes
• To get a surrogate think of generating a one split CART tree
where the dependent variable is {LEFT or RIGHT} as
defined by the primary splitter. (There are many details)
© Copyright Salford Systems 2013
What is ―Association‖?
• Association is a measure of the strength of the surrogate
• The lowest possible reported score is 0 (useless)
• The highest possible score is 1 (perfect clone)
• CART starts from the default rule: if you don’t kow which way to send a
data record down a tree go with the majority (sometimes weighted majority)
• If when training the tree most cases went left then in the absence of
other information also go left
• The default makes mistakes of course because it always sends every
record to the same majority side
– Association measures how much better the surrogate is than the
default rule (percent reduction in errors made)
• Default rule is the ―surrogate of last resort‖
© Copyright Salford Systems 2013
Competitors and Surrogates:
Different Objectives
© Copyright Salford Systems 2013
Competitors yield the best possible split when using that variable
Surrogate yields the best possible mimic of the primary splitter and goodness of
split may be sacrificed to match some aspect of the primary splitter
Note that C2 is a competitor with one split point and a surrogate with a different
split point
Grow another tree on GB2000.XLS
• We prefer this data set because it has no missing values
making working through examples much easier
• Don’t forget: CART always computes surrogates and in this
way the CART tree is always prepared for future missings
• We will not be trying to make sense of this tree
– will look just at the mechanics
• Note the root node splitter and the top surrogate
© Copyright Salford Systems 2013
Root Node Split
© Copyright Salford Systems 2013
Root Splitter:
M1 <= -.04645
Top Surrogate:
C2 <= -.10835
Main Splitter vs. Best Surrogate
Main Splitter Surrogate
Left Right Left Right
Class 1 672 328 626 374
Class 2 252 748 300 700
Total 924 1076 926 1074
© Copyright Salford Systems 2013
Best Surrogate must closely match not only the record counts in the child nodes
but also the distribution of the target variable
Modeling ROOTSPLIT with CART
© Copyright Salford Systems 2013
Observation: Modeling the root node split (we have to create a new variable
to reflect this) will not necessarily match the surrogate report
Other factors must be taken into account. Here we get the right variable but
not the right split point
Main Splitter VS Best Surrogate
Model Root Split As a Binary Target
Main
Splitter
Surrogate Alternate
Left Right Left Right Left Right
Class 1 672 328 626 374 598 402
Class 2 252 748 300 700 288 712
Total 924 1076 926 1074 886 1114
© Copyright Salford Systems 2013
Best Surrogate must closely match record counts in the child nodes and the
distribution of the target variable
Modeling root split on available predictors will not match surrogate exactly
Variable Importance in CART
• It is hard to imagine now but in 1984 when the CART
monograph was first published data analysts did not
generally rank variables
• Although informally researchers would pay attention to t-
statistics or p-values associated with the coefficients of
regressions researchers frowned on the practice of ranking
predictors
• Since the advent of modern data analytic methods
researchers expect to see a variable importance ranking for
all models
• It all started with CART!
© Copyright Salford Systems 2013
CART concept of Variable Importance
• Variable importance is intended to measure how much work a
variable does in a particular tree
• Variable importance is thus tied to a specific model
• A variable might be most important in one model and not
important at all in a different model built on the same data
• The fact that a variable is important does not mean that we need
it! If we were deprived of the use of an important variable it might
be that other available variables could substitute for it or do the
same predictive work
• Variable Importance describes the role of a variable in a specific
tree
© Copyright Salford Systems 2013
Variable Importance and Tree Size
• Every tree in the CART sequence has its own variable
importance list
• A small tree will typically have only a few important variables
• A large tree will typically have many more important
variables
– Because with more nodes there are more chances for more variables
to play a role in the tree
• Usually we focus on the tree CART has identified as optimal
but this should not deter you from selecting another (usually
smaller) tree
© Copyright Salford Systems 2013
Splitter Improvement Scores
• Recall that every splitter (and every surrogate) has an associated
―improvement‖ score which measures how good a splitter is
• The improvement score for a splitter in a node is always scaled down by
the percent of data that actually pass through the node
• 100% of all data pass through the root node so the root node splitter is
always scaled by 100%
• But a child node of the root might have say 30% of the data pass through
– whatever improvement we compute for split of that node will be multiplied by 0.30
• Splits lower in the tree have only a small fraction of full data passing
through so their adjusted improvement scores tend to be small
© Copyright Salford Systems 2013
Variable Importance Computation
• To construct a variable importance score for a variable we
start by locating every node that variable split
• We add up all of the improvement scores generated by that
variable in those nodes
• Then we go through every node this variable acted as a
surrogate and add up all those improvement scores as well
• The grand total is the raw importance score
• After obtaining raw importance scores for every variable we
rescale the results so that the best score is always 100
© Copyright Salford Systems 2013
Variations on Importance Scores
• Breiman, Friedman, Olshen and Stone discuss one idea
they ultimately rejected:
– Including competitor improvements scores as well
• This turns out to be a bad idea because it leads to double-
counting
– If a variable is the 2nd best splitter in a node there is an excellent
chance that the same split will score well in the child nodes
– If we were to give the splitter credit in the parent node for a being a
competitor we would probably end up giving the exact same split
credit again lower down in the tree
– Another way to think about this: a split is trying to enter the tree. If we
do not acept the split right away the same split may keep trying to
enter the tree lower down
– We only want to give this split credit once
© Copyright Salford Systems 2013
BATTERY LOVO
• Leave One Variable Out (LOVO)
– Available in SPM PRO EX versions but you can accomplish the
process manually as well
• Take your best modeling set up including your preferred list
of predictors
• BATTERY LOVO runs a set of models that are identical to
your preferred set up except that one variable has been
excluded
• To be complete we run a ―drop just one variable‖ model for
each variable in your KEEP list
• If you have 20 variables then BATTERY LOVO will run 20
models (each of which will have 19 predictors)
– Now rank the models from worst to best
© Copyright Salford Systems 2013
BATTERY LOVO Importance Ranking
• Using the LOVO procedure tests how much our model
deteriorates if we were to remove a given variable
• It is sensible to say that a variable is very important if losing
it damages the model substantially
• Conversely, if losing a variable does no harm then we could
conclude that the variable is useless
• CAUTION: the LOVO ranking could be quite different from
the CART internal ranking and both rankings are ―right‖
– CART measures how much work a variable actually does
– LOVO measures how it hurts to lose a variable
© Copyright Salford Systems 2013
Randomization Test
• Leo Breiman introduced yet another concept of variable
importance measure related to his work on tree ensembles
• Start with your test data
– Score this data with your preferred model to obtain baseline
performance
– Take the first predictor in the test data and randomly shuffle its
values in the column of data
– The values are unchanged but values are relocated to rows they do
not belong on
– Now score again. We would expect performance to drop because
one predictor has been damaged. Repeat say 100 times and
average the performance deterioration.
– Doing this for al variables will produce performance degradation
scores and the larger the score the more important the variable
© Copyright Salford Systems 2013
Randomization Test
• As of December 2011 this test is only available from the
command line of recent versions of SPM
• After growing a CART tree and saving the grove issue these
commands from the command line or an SPM Notepad
SCORE VARIMP=YES NPREPS=100
• You may readily run with NPREPS=30 but the results are
more reliable with a larger number of replications
© Copyright Salford Systems 2013
Results from Random Shuffling:
Baseline ROC=.85320
© Copyright Salford Systems 2013
Rank Score ROC_After Variable
1 100 0.82144 M1
2 63.21 0.83312 RES
3 45.57 0.83873 LS
4 25.9 0.84498 CR
5 22.66 0.84601 C2
6 21.29 0.84644 BU
7 5.84 0.85135 DT
8 4.25 0.85185 A1
9 4.23 0.85186 PRE
10 3.49 0.85209 OC
11 3.18 0.85219 MAR
12 2.29 0.85248 YM
13 1.64 0.85268 LT
14 0 0.8532 DP
15 0 0.8532 TRA
16 0 0.8532 GEN
17 0 0.8532 A2
18 0 0.8532 B
19 0 0.8532 CP2
20 0 0.8532 CD2
21 0 0.8532 D1
22 0 0.8532 E
23 0 0.8532 M
24 0 0.8532 CH
25 0 0.8532 TY$
Which Importance Score Should I Use?
• The internal CART variable importance scores are the
easiest and the fastest to obtain and are a great starting
point
• LOVO scores are useful when your goal is to assess
whether you can live without a predictor
© Copyright Salford Systems 2013
• Importance is a function of the OVERALL tree including
deepest nodes
• Suppose you grow a large exploratory tree — review
importances
• Then find an optimal tree via test set or CV yielding smaller
tree
• Optimal tree SAME as exploratory tree in the top nodes
• YET importances might be quite different.
• WHY? Because larger tree uses more nodes to compute the
importance
• When comparing results be sure to compare similar or same
sized trees
Variable Importance Caution
© Copyright Salford Systems 2013
Train/Test Consistency Checks
• Unlike classical statistics data mining models generally do
not rely on training data to assess model quality
• In the SPM data mining suite we are always focused on test
data model performance
– This is the only way to reliably protect against over fitting
• Every modeling method including our classical statistical
models in SPM 7.0 offers test data performance measures
• Generally these measures are overall model performance
indicators
– Measures say nothing about internal model details
© Copyright Salford Systems 2013
CART Tree Assessment
• CART uses test data performance of every tree in the back-
pruned sequence of progressively smaller trees to identify
the overall best performer on classification accuracy
• CART also notes which tree achieves the best test data
Area Under the ROC (AUROC) curve on the Navigator
© Copyright Salford Systems 2013
What more can we do?
• CART performance measures have always been overall-tree
scores
• No specific attention is paid to node-specific performance
• However, in real world applications we often want to pay
close attention to individual nodes
– Might use the rank order of the nodes in important decisions
– Prefer to rely on nodes that are most accurate in their predictions of
event rates (response)
• Therefore we need an additional tool for assessing CART
tree performance at the node level
• Provided by the PRO EX feature we call TTC
– Train/Test Consistency checks
© Copyright Salford Systems 2013
Use the GB2000.XLS data set
© Copyright Salford Systems 2013
Model setup to select TARGET as the dependent variable
CART as the modeling method
On the TEST tab we opt for 50% randomly selected test partition
TTC in CART and SPM PRO EX
• The TTC report is available from the navigator which
displays for every CART model
– Look for the TTC button near the bottom of the navigator
• TTC relies on separate train and test data partitions which
means that TTC is not available when using cross-validation
© Copyright Salford Systems 2013
TTC Display
© Copyright Salford Systems 2013
Upper panel of TTC display contains one line in the table for every sized tree
Bottom row represents the 2 node tree. Top line is for largest tree grown
TTC: Select Target Class
© Copyright Salford Systems 2013
In this case TARGET=2 represents BAD which is our focus class
You the modeler get to choose which class to focus on, there is no ―right‖ class
TTC Upper Panel
© Copyright Salford Systems 2013
Rank Match: Do the train and test samples rank order the nodes in the same way
(a statistical test allows for insignificant ―wobbles‖)
Direction Agreement: Do the train and test samples agree as to whether a node is
―above average‖ or ―below average‖ (response, lift, event rate). Again a statistical
test allows for insignificant violations
Click on 14 node tree in TTC upper panel
© Copyright Salford Systems 2013
Red curve is training data and shows node specific lift (node response/ overall
response)
Dark Blue horizontal line is the LIFT=1.0 reference line
Light blue line with green triangles displays test data
3rd ranked node in train data would be ranked 1st or 2nd in test data
TTC Details
© Copyright Salford Systems 2013
For the 14 node tree we are told that agreement on ―direction‖ fails 1 time
And the rank order agreement fails 5 times (scroll to right to see this)
The statistical sensitivity of the test is controlled by the z-score selected in the
Thresholds area to the right of the display. Defaults are 1.00
Setting this threshold to 2.00 will allow much more train/test divergence
Changing TTC Sensitivity Threshold
© Copyright Salford Systems 2013
Changing the thresholds to 2.00 permits moderate deviations and treats them as
statistical noise. After changing thresholds click on ―Apply‖ if display has not updated
We prefer to use the 1.00 threshold as this points us to trees with very high
consistency that decision makers like to see. It does point to rather small trees.
TTC: Display for 6 node tree
© Copyright Salford Systems 2013
Much more defensible tree as train and test data align very well
Summary
• TTC focuses on two types of train-test disagreement
• DIRECTION: Is this node a response node or not?
– We regard disagreement on this fundamental topic to be fatal
• RANK ORDER: Are the richest nodes as identified by the
training data confirmed in test data
– Without this we cannot defend deployment of a tree
• TTC allows us to quickly identify which tree in the pruning
sequence is the largest satisfying train/test consistency
• TTC optimal tree is often rather close in size to Breiman’s 1
SE rule tree
– But 1 SE rule does not look inside nodes at all
– 1 SE rule is available for cross-validation while TTC is not
© Copyright Salford Systems 2013
Controlling Node Sizes In CART
With ATOM and MINCHILD
• Today’s topic is on the technical side but very easy to
understand
• Concepts are relevant to all Salford tree-based tools
including TreeNet and Random Forests
• Controlling the sizes of terminal nodes is a practical matter
• If you are using CART, for example, to segment a database
you might want to make it impossible to create segments
that are too small
• Altering terminal node size can also influence performance
details of the optimal tree
© Copyright Salford Systems 2013
Background: Obtaining Optimal Trees
• CART theory teaches us that we cannot arrive at the optimal
tree via a stopping rule
• The CART authors devoted quite a bit of energy to
researching this topic
• For any stopping rule it is possible to construct data sets for
which that stopping rule will not work
• We will end up stopping too early and we will miss important
data structure
• Result discovered both by experimentation and via
mathematical construction
© Copyright Salford Systems 2013
Grow First Then Prune
• CART methodology is thus to start with an unlimited growing
phase
• Grow the largest possible tree first
• Think of this as a search engine for discovering possibly
valuable trees
• THEN use pruning to arrive at the optimal tree or a set of
trees that yield both acceptable predictive performance and
simplicity
• CART also insists that we have a test method to make our
final tree selection. That is the topic of another session.
© Copyright Salford Systems 2013
Maximum Tree Size
• CART theory tells us that trees should be grown to their
maximum size during the growing phase
• Thus, trees should be grown until we either
– Run out of data (1 record left and thus there is nothing to split)
– Node impossible to split because pure (all GOOD or all BAD)
– Node impossible to split because all records have identical values for
predictors
• Experience tells us that if you start with 1,000 records in a
typical binary classification problem you should expect about
500 terminal nodes in the largest possible tree
– But could be many less
• Let’s try for biggest possible tree with the GB2000.xls data
© Copyright Salford Systems 2013
An Unlimited Tree
Using GB2000.xls
© Copyright Salford Systems 2013
To get 349 nodes we set the test method to EXPLORE, MINCHILD=2, ATOM=1
Terminal Node Sample Sizes
© Copyright Salford Systems 2013
We obtain this frequency chart by clicking the graph icon in the center left area
of the navigator. We can see that many but not all terminal nodes are small.
Bottom Left Most Part of Tree
© Copyright Salford Systems 2013
We get a relatively large node to the
extreme left (all class 2)
Remaining three terminal nodes in this
snippet are also all ―pure‖ but much
smaller
Obvious why the tree has to stop here
as there is nothing left to do once a
node is pure
Obtained by right clicking the node of interest and selecting ―Display Tree‖
Practical Maximal Trees
• In real world practice it may not be necessary to push the
tree growth to the literal maximum
• Essential to grow a large tree
– Large enough to include the optimal tree
• We can control the size of the maximal CART tree in a
number of ways
– Some controls tell CART to stop early
– Other controls limit CART’s freedom to produce small nodes
© Copyright Salford Systems 2013
Key Controls over Splits:
ATOM and MINCHILD
• ATOM
– ATOM terminates splitting along a branch of the tree when the node
sample size is to small
– If a node contain fewer than ATOM data records then STOP
– 10 is commonly used but you might set this much larger
• MINCHILD
– MINCHILD prevents creation of child nodes that are too small
– The smallest possible value is 1 meaning that in splitting a node we
would be permitted to send 1 solitary record to a child node and all
other records to the other child node
– Larger values are sensible and desirable. Values such as 5, 10, 20,
30, 50 could work well depending on the data. We have used values
as large as 200
© Copyright Salford Systems 2013
Setting ATOM and MINCHILD
© Copyright Salford Systems 2013
On Advanced Tab of
Model Setup
Parent control
(ATOM)
Terminal node min
(MINCHILD)
Setting ATOM and MINCHILD
• ATOM: Minimum size required for a node to be a parent
• MINCHILD: Minimum size allowed for a child
• We recommend that ATOM be set to three times MINCHILD
• ATOM must be at least twice MINCHILD to allow a split
consistent with MINCHILD
• If you set inconsistent values for ATOM and MINCHILD they
will be reset automatically to be consistent
• To get the control you want be sure that ATOM is at least
twice MINCHILD
© Copyright Salford Systems 2013
ATOM and MINCHILD
• ATOM controls the right to be a parent
• Parent must generate two children
• Parent must contain enough data to be able to fill two child
nodes
• So parent must have at least 2*MINCHILD records
© Copyright Salford Systems 2013
ATOM and MINCHILD
• By allowing ATOM to be three times MINCHILD you give
CART some flexibility in finding the split
10 records 10 records
• Min-------------------------------|-----------------------------------Max
split
Suppose ATOM=20 and MINCHILD=10. Then we must split
this node into two exactly equal child nodes of 10 records
each. There is no flexibility here
• If no such split can be found because of clumping of values
of the variable then the node cannot be split on that variable
© Copyright Salford Systems 2013
ATOM is 3 times MINCHILD
10 records 10 records 10 records
• Min------------------*------|--------------------*--------------------Max
left child ..….. split region…... right child
• In the example above ATOM=30 and the region of possible splitting
points lies in between the two asterisks
• There can be just one split point. So long as the smaller side has at least
10 records (in this example of MINCHILD=10) there is freedom to
choose
• To give CART flexibility as to where to locate this last split (at the bottom
of the tree) when need to have ATOM> 2*MINCHILD
• Not mandatory but worth keeping in mind. So first choose MINCHILD
and then set ATOM sensibly
© Copyright Salford Systems 2013
An Unappealing Node Split:
Could be prevented by using a larger MINCHILD
© Copyright Salford Systems 2013
Only one record is sent to the right and the remaining 1999 records go left
Can prevent such splits with a control which does not allow a child to be created
with fewer than the specified number of records
Experiment to get Best Settings
© Copyright Salford Systems 2013
SPM PRO EX
Battery Tab Model Setop
Select ATOM and
MINCHILD
Modify values to be
tested, optionally
We used a 50% random
sample for testing
Choosing ATOM and MINCHILD
© Copyright Salford Systems 2013
Settings of ATOM=10
and MINCHILD=5 yield
a Rel. error within 1% of
the literal best
Direct Control Over Tree Size (Almost)
• You also have the option of LIMITing tree in a variety of
ways including limiting the DEPTH of the tree
• To get to the LIMITS menu item you must first go to the
Classic Output
© Copyright Salford Systems 2013
Growing Limits Dialog
© Copyright Salford Systems 2013
DEPTH=1 will allow just
one split
Controlling tree size via a
DEPTH limit may yield
Inferior results
We tend to use it only
When wanting extremely
small trees such as one split
LIMITS Details
• A tree of depth=1 can have only two terminal nodes
• With each additional depth level we allow for a doubling of
the number of terminal nodes
• Potential sizes are then 2,4,8,16 etc.
• However, depth limits do not guarantee a specific number of
terminal nodes only that no terminal node will deeper than
was allowed
© Copyright Salford Systems 2013
LIMIT DEPTH=1
© Copyright Salford Systems 2013
We sometimes want to start a CART analysis by splitting just the ROOT node and
then reviewing the entire ranked list of potential splitters
Mostly useful for very large data sets as this reduces compute time substantially
LIMIT DEPTH=2
© Copyright Salford Systems 2013
Maximum length of any branch will allow two splits between the root node and
any terminal node. But some branches might stop early due to pre-pruning.
Depth Limit=3
Method GINI
© Copyright Salford Systems 2013
With METHOD GINI you may not get every branch of the tree exhibited to
the full depth you wanted (due a technical matter – ―pre-pruning‖
Depth Limit=3
METHOD PROB
© Copyright Salford Systems 2013
You have a better chance of getting every branch grown out to full depth using
METHOD PROB
Concluding Remarks
• Setting ATOM (smallest legal parent) and MINCHILD
(smallest legal child) can help to speed up large database
runs
• Modest limitation will not harm performance if we take care
with the settings
• Can and should use experimentation to find best settings
• In some circumstances setting these controls to values
larger than their minimums can improve performance on test
data
© Copyright Salford Systems 2013
CART and the PRIORS Parameter
• If you are a casual user of CART you probably can get by
without knowing anything about PRIORS
• The default settings of CART handle PRIORS in a way that
is well suited for almost all classification problems
• A casual user will probably not want to review or understand
the more technical output‖ which is printed to the plain text
―classic output‖ window
• BUT there are some very effective uses of CART that
require judicious manipulation of the PRIORS settings
• Therefore a basic understanding of PRIORS may be helpful
and worth the effort
© Copyright Salford Systems 2013
Classic Reference
• The original CART monograph published in 1984, remains
one of the great classics of machine learning
• Classification and Regression Trees by Leo Breiman,
Jerome Friedman, Richard Olshen, and Charles Stone, CRC
Press
• Available also in paperback and as e-book from Amazon:
• http://www.amazon.com/Classification-Regression-Trees-Leo-Breiman/dp/0412048418/
• Not the easiest reading but well worth having as a reference
and contains fascinating discussions regarding the decisions
the authors made in crafting CART
• Contains extensive discussion of priors as well as all major
concepts relevant to CART. Still wothwhile reading.
© Copyright Salford Systems 2013
CART Monograph Details
© Copyright Salford Systems 2013
For The Casual User
• Thinking about a binary 0/1 classification problem we have
two ways of evaluating a CART generated segment
– Assign the segment to the majority class (more than 50%)
– If there are more 1s then 0s then the segment is labeled ―1‖
– Assign the segment to the class with a LIFT greater than 1
– We start with a baseline event rate (fraction of 1 in the data)
– Look at the ratio of event rate in the node to event rate in sample
• Ratio of event rate in segment to event rate in root
– Any segment with a better than baseline event rate is labeled ―1‖
• CART by default uses the LIFT concept for making
decisions (known in CART-speak as PRIORS EQUAL)
• You can elect to use the first method via PRIORS DATA
© Copyright Salford Systems 2013
Example Split: Priors Equal
© Copyright Salford Systems 2013
Almost 80% GOOD (Class 0) Remainder BAD (Class 1)
Left child is considered a BAD dominant node because 36% BAD > 21.4% BAD
Priors equal simply ensures that we think in these ―relative to what we started
with‖ terms
PRIORS EQUAL or PRIORS DATA
• PRIORS EQUAL is almost always the right choice
– Is the DEFAULT and almost always yields useful results
• PRIORS DATA focuses on absolute majority and not relative
counts in the data
– Will rarely work with highly unbalanced data (eg 10:1 ratio of 0 to 1)
• PRIORS can be expressed as a ratio
– Default 1:1
– You can set priors to whatever ratio you like
• 1.2:1 as we did in the previous example
• 5:1
• 10:1
– Changing priors usually changes results, sometimes dramatically
– Extreme priors often make getting any tree impossible
© Copyright Salford Systems 2013
Setting PRIORS
Mechanics
© Copyright Salford Systems 2013
To set your own PRIORS
first click the SPECIFY
option
The default settings of 1:1
can now be changed
To the left the dialog is
allowing me to alter the
entry for Class 0
Once entered I will be
given the opportunity to
make an new entry for
Class 1
If PRIORS can change results then what is right?
• The results CART gives you are intended to reflect what you
consider important and what makes sense given your
objectives
• PRIORS EQUAL usually reflects what most people want
• If tweaking the PRIORS and changing them gives you better
results given your objectives then use the tweaked priors
© Copyright Salford Systems 2013
Advice on PRIORS
• Start with the default of EQUAL
– Most users never get beyond this!
• BATTERY PRIORS
– CART PRO EX runs an automatic sweep across dozens of different
settings to display the consequences of tweaking the priors
– Results are then summarized in tables and charts
– Useful when you want to achieve a specific balance of accuracy
across the dependent variable classes
– Choose the setting that is practically best
• Otherwise, you can experiment manually to measure the
impact of a change
© Copyright Salford Systems 2013
PRIORS: Under the Hood
• To understand how PRIORS affect core CART calculations
we need to start with a brief review of splitting rules
• We will only discuss the Gini to illustrate the key concepts
© Copyright Salford Systems 2013
Start With Gini Splitting Rule:
Two classes
• Very simple formula for the two class (binary) dependent variable
• Label the classes as Class 0 and Class 1 and in a specific node in
a tree we represent the shares of the data for the two classes as
p0 and p1
These two must sum to 1 (p0 + p1 = 1)
• The measure of diversity (or impurity) in a given subset of data
(e.g. a node) is given by
Impurity = 1 – p0*p0 – p1*p1
• Impurity will equal 0 if either sample share is equal to 1 (100%)
• Impurity will equal 0.50 when both sample shares are equal (50%)
1 – (.5*.5) – (.5*.5) = 1 - .25 - .25 = .50
© Copyright Salford Systems 2013
Splitting Criteria and Impurity
• The Gini measure is just a sensible way to represent how
diverse the data is in a node (for a classification problem)
– Extensive experience shows it works well, a good measure
– You do have a choice of 6 different splitting methods in CART
• Useful because it can be used for any number of classes
– Every class has a share
– Square the shares and subtract them all from 1
• We use the Gini measure as a way to rank competing splits
• Split A will be considered better it produces child nodes with
less diversity (on average) than does split B
• We measure the goodness of split by looking at the
reduction in impurity relative to the node being split (the
parent)
© Copyright Salford Systems 2013
Improvement Calculation
• Hypothetical Example
© Copyright Salford Systems 2013
Parent Node Impurity = 0.50
Left Child Impurity = .30 Right Child Impurity=.20
20% of data 80% of data
Left child improves diversity by 0.20 (0.50 – 0.30)
Right child improves diversity by 0.30 (0.50 – 0.20)
Weighted average impurity is .2*.3 + .8*.2=.22
Improvement from parent is .5 - .22 = .28
Graphing Gini Impurity (2 classes)
• Impurity formula here
simplifies to 2p(1-p)
• Impurity is greatest
when p=(1-p)= 0.5
• Impurity is low when p
is near either extreme
of 0 or 1 as the node is
dominated by one class
• Declines slowly near
p=.5 and accelerates as
it approaches 0 or 1
1
0 0.5 1
Graph is of 2*[2*p*(1-p)] to make it easier to read
© Copyright Salford Systems 2013
Split Improvement Measurement
(No Missing Values for Splitter)
© Copyright Salford Systems 2013
Parent Node N Percent
Left Child N Percent Right Child N Percent
Parent Impurity = 0.50
Left Child Impurity = 0.3967 Fraction of data in left child = 55%
Right Child Impurity=0.3457 Fraction of data in right child = 45%
Weighted average of child node diversity = .3737
Overall improvement of split = .1262
As expressed in the CART monograph
Parent node impurity minus weighted average of the impurities in each
child node
• pL = probability of case going left (fraction of node going left)
• pR = probability of case going right (fraction of node going right)
• t = node
• s = splitting rule
• i = impurity
( , ) ( ) ( ) ( )t s i t p i t p i tL L R R
impurityL impurityR
Impurity
Parent
probL probR
© Copyright Salford Systems 2013
Unbalanced Data and PRIORS EQUAL
• Calculations for all key quantities become weighted when
we use the CART default and the original data is
unbalanced
• Weighting is used to calculate
– Fraction of the data belonging to each class
– Fraction of the data in the left and right child nodes
– Gini impurity in each node
– Resulting improvement of the split (reduction in impurity)
• We no longer can use simple ratios
• Good news is that the mechanism for weighting is very
simple and easy to remember
– All counts are expressed as count in the node divided by the
corresponding count in the root node
© Copyright Salford Systems 2013
Calculations for Priors
• Our training sample starts with N0 examples of class 0 and
N1 examples of class 1
• Now look at any node t in the CART tree
– N0(t) examples of class 0
– N1(t) examples of class 1
• Fraction of class 0 will now be calculated as (simplified)
• In other words we convert every count to ratio of a count in a
node (t) to the corresponding count in the root (sample)
• Then the math is the same as usual
© Copyright Salford Systems 2013
( 0N t( )/ 0N )
( 0N t( )/ 0N )+( 1N t( )/ 1N )
What fraction of the data is in a node
• Again we use ratios instead of counts to calculate
• For priors equal we just average
– Fraction of all the Class 0 in a node
– Frcation of all the Class 1 in a node
• If the priors are not equal then all ratios are first multiplied by
the corresponding prior (which acts as a weight)
• When priors are equal the terms all cancel out
© Copyright Salford Systems 2013
( 0P0 N t( )/ 0N )
( 0P0 N t( )/ 0N )+ ( 1P1N t( )/ 1N )
• pi(t) = Proportion of class i in node t
• If priors DATA then
• Proportions of class i in node t with data priors
• Otherwise proportions are always calculated as weighted
shares using priors adjusted pi
Priors Incorporated Into Splitting
Gini = 1 - pi
2
(i)=
N
N
i
N(t)
(t)N
(t)N
(t)N
=t
i
j
i
)p(
Nj
Nj(t)
(j)
Ni
Ni(t)
(i)
tp )(
© Copyright Salford Systems 2013
Run a Real World Example
79% Class 0 (Good) 21% Class 1 (Bad)
© Copyright Salford Systems 2013
Data set BAD_RARE_X.XLS MODEL BAD = X15 just one predictor
Test method: 20% random sample for test
© Copyright Salford Systems 2013
We only want to look at the root node split. But tree is quite predictive!
Root Node Split:
Under PRIORS EQUAL
© Copyright Salford Systems 2013
Main splitter improvement is reported to be .06264
Observe that the left hand child is considered to be Class 1 because the
node Class 1 share of 41% is greater than the root share of 21.4%
Classic Output
Typical user rarely consults classic output
© Copyright Salford Systems 2013
Start by confirming the total record counts in the parent and child nodes
Agrees with previous diagram in GUI
Next Confirm Target Class Breakdown
© Copyright Salford Systems 2013
Here we see the same counts for Class 0 and Class 1 as in GUI
Priors Adjusted Computations
© Copyright Salford Systems 2013
Note first that the parent node is reported to have 50% class 0 and 50% class 1
This is guaranteed for the root node under priors equal
With 2 classes each is treated as if it represented half the data
With 3 classes each would be treated as if it represented 1/3 of the data
Our calculations of the Gini impurity would be based on these priors adjusted
shares of the data (or node)
The class breakdowns in the child nodes (left and right) are priors adjusted
using the formulas presented earlier
Spreadsheet to Reproduce Results
© Copyright Salford Systems 2013
Column C contains the counts for each class in the parent and child nodes
Column H at the top records the priors
Column G displays the priors adjusted shares (raw shares are in Column D)
Column F displays raw and priors adjusted child node probabilities
Column J displays the Gini diversity in the parent and child nodes and the
improvement generated by the weighted average of the child diversities
All we need to input are the class counts and the priors and formulas do the rest
Conclusion
• Priors are an advanced control that the casual user need not
worry about
• The default setting is almost always reasonable and almost
always yields valuable results
• Tweaking the priors can change the details of the tree and
can alter results
– Sometimes considerably
– Can be worth running some experiments
• Further discussion in another tutorial
© Copyright Salford Systems 2013
Modeling automation Report
Develop model using a variety of strategies
Here we display results for each of the 6 major tree growing methods. Entropy
yields best performance here. This one of 18 different automation schemes.© Copyright Salford Systems 2013
Summary of Variable Importance Results
Across alternative modeling strategies
© Copyright Salford Systems 2013
Performance Curves of Alternative Models
Error plotted against model complexity
Four strategies yield similar results; one yields much worse© Copyright Salford Systems 2013
Alternative Modeling Automation Strategies
Analyst Can Run All Strategies if desired
© Copyright Salford Systems 2013
Automated Modeling:
Vary Penalty on False Positives
© Copyright Salford Systems 2013
Accuracy among YES and NO groups
As penalty on false positive is varied (automatically)
© Copyright Salford Systems 2013
Automatic Shaving:
Backwards Elimination of Least Important Feature
•
© Copyright Salford Systems 2013
Hot Spot Detection:
Search many trees for high value segments
Lift in node plotted against sample size: Examination of individual nodes
from many different trees to find best segments
© Copyright Salford Systems 2013
Tabular detail: Hot spot search for special nodes
Tree 18 Node 25 defines a segment with 85.3% of the target class
Sample size in this segment is N=265 in the test set
Clicking on any row brings up tree for examination and review
© Copyright Salford Systems 2013
Constrained Trees
• Many predictive models can benefit from Salford’s patent
pending ―Structured Trees‖
• Trees constrained in how they are grown to reflect decision
support requirements
• In mobile phone example: want tree to first segment on
customer characteristics and then complete using price
variables
– Price variables are under the control of the company
– Customer characteristics are not under company control
© Copyright Salford Systems 2013
Visualizing separate regions of tree
© Copyright Salford Systems 2013
Constraint Dialog
Model set up specifying allowable ranges for predictors
Green indicates where in the tree variables of group are allowed to
appear © Copyright Salford Systems 2013
Constrained Tree
Mobile Phone Price variables appear only at bottom
Demographic and spend information at top of tree
Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom
© Copyright Salford Systems 2013
Model Deployment -1
Translate Model into Reusable Programming Code
New version supports JAVA, C, PMML, SQL, SAS®
© Copyright Salford Systems 2013
Automatically Generated Code
Can be deployed directly
© Copyright Salford Systems 2013
Deployment –II
Use Salford Scoring Engine/Server
Controllable via scripting can be deployed in batch mode on server
© Copyright Salford Systems 2013
Cross-Validation: Part 1
• Built-in automatic method of self testing a model for
reliability
• Honest assessment of the performance characteristics of a
model
– Will model perform as expected on previously unseen (new) data
• Available for all principal Salford data mining engines
• CART monograph 1984 was decisive in introducing cross-
validation into data mining
• Many important details relevant to decision trees and
sequences of models developed in the monograph for the
first time
© Copyright Salford Systems 2013
Cross-Validation is a Testing Method
• Why go through special trouble to construct a sophisticated testing
method when we can just hold back some test data?
• When working with plentiful data it makes perfect sense to reserve a
good portion for testing
– E.g. Credit risk data set with 150,000 training records and 100,000 test
records, real world example
– Direct Marketing data sets with 300,000 training records and 50,000 test
records
• Not all analytical projects have access to large volumes of data
© Copyright Salford Systems 2013
Principal Reason for Cross-Validation
Data Scarcity
• When relevant data is scarce we face a data allocation
dilemma
– If we reserve sufficient data to conduct a reliable test we find
ourselves lacking training data
– If we insist on having enough training data to build a good model we
will have little or nothing left for testing
• Train Test
• o---------------------------------------------------------------|-------------o
• A common division of data is 80% train 20% test
• With 300 data records in total this would amount to 240 train and 60 test
© Copyright Salford Systems 2013
Tough decision:
How much data to allocate to test
• Train Test
• o---------|-------------------------------------------------------------------o
• Train Test
• o------------------------------|----------------------------------------------o
• Train Test
• o-------------------------------------------------|---------------------------o
• Train Test
• o------------------------------------------------------------------------|----o
© Copyright Salford Systems 2013
Unbalanced Target Data
• In most classification studies the target (dependent variable)
data distribution is unbalanced
• Usually one large data segment (non-event) and a smaller
data segment (event) which is the subject of the analysis
– Who purchases on an e-commerce website?
– Who clicks on a banner ad?
– Who benefits from a given medical treatment?
– What conditions lead to a manufacturing flaw?
• When the data is substantially unbalanced the sample size
problem is magnified dramatically
– Think of your sample size as being equal to the smaller class
– If you only have 100 clicks that is your data set size
– Does not matter much that you have 1 million non-clicks.
© Copyright Salford Systems 2013
Cross-Validation Strategy:
Sample Re-use
• Any one train/test partition of the data that leaves enough
data for training will yield weak test results
– based on just a fragment of the available data
• But what if we were to repeat this process many times
– using different test partitions?
• Imagine the following: we divide the data into many 90/10
train/test partitions and repeat the modeling and testing
• Suppose that in every trial we get at least 75% of the test
data events classified correctly
• This would increase our confidence dramatically in the
reliability of the model performance
– Because we have multiple at least slightly different tests
© Copyright Salford Systems 2013
Cross-Validation Technical Details
• Cross-Validation requires a specialized preparation of the data
somewhat different than our example of repeated train/test partitioning
• We start by dividing the data into K partitions. In the original CART
monongraph Breiman, Friedman, Olshen, and Stone set K=10
• K=10 has become an industry standard due both to Breiman et. al. and
other studies that followed (see final slides for details)
• The K partitions should all have the same distribution of the target
variable (same fraction of events) and if possible be equal in size
Takes care to get this right when data cannot be evenly divided into K parts
• This is all done automatically for you in SPM software
© Copyright Salford Systems 2013
Cross-Validation Train/Test Procedure:
K mutually exclusive partitions, 1 Test, K-1 Train
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
Test
Test
Test
Test
Learn
Learn
Learn
LearnLearn
ETC...
Learn
Above each partition is in the train sample 9 times and in the test sample 1 time
Build K Models
• Once the data has been partitioned into the K parts we are ready to build
K models
– If we have 10 data partitions and we will build 10 models
• Each model is constructed by reserving one part for test and the
remaining K-1 parts for training
– If K=5 then each model will based on an 80/20 split of data
– If K=10 then each model will be based on a 90/10 split
– There is nothing wrong with considering K=15 or K=20 or more
• In this strategy it is important to observe that each of the K blocks of data
is used as a test sample exactly once
• If we could somehow combine all the test results we would have an
aggregated test sample equal in size to that of the training data
© Copyright Salford Systems 2013
Euro_Telco_Mini.xls Data Set
Class=0 Class=1
CVCycle Learn Test Learn Test CVW
1 634 70 113 13 0.1026161
2 633 71 114 12 0.0960758
3 634 70 113 13 0.1026161
4 633 71 114 12 0.0960758
5 634 70 113 13 0.1026161
6 633 71 114 12 0.0960758
7 634 70 113 13 0.1026161
8 634 70 113 13 0.1026161
9 633 71 114 12 0.0960758
10 634 70 113 13 0.1026161
• Here we see the breakdown of the 830 record data set into the 10 CV folds
• Table shows sample counts for majority and minority classes for learn and test
partitions for each fold
• Observe that CART has succeeded in making each fold almost identical in the
learn/test division and in the balance between TARGET=0 and TARGET-1
• Last column is the WEIGHT that CART uses on each fold for certain
calculations
Confusion Matrix
Prediction Success Matrix
• In two-class (e.g. Yes/No) classification test results can be
represented via the 2x2 confusion matrix
© Copyright Salford Systems 2013
Predicted Y=0 Predicted Y=1
Actual Y=0 20 4
Actual Y=1 1 5
Hypothetical results for the test set of a single Cross-validation fold
Note test sample is quite small but there will be a number of these (e.g. 10)
Aligning the CV Trees
All automatic and the user never sees this
Main CV1 CV2 CV3 CV4 CV5 CV6 CV7 CV8 CV9 CV10
Nodes 2 2 3 2 2 2 2 2 2 2 2
Complexity 0.01523 0.11543 0.04915 0.12949 0.08684 0.1178 0.09157 0.11464 0.11911 0.11201 0.10531
Nodes 4 6 4 4 4 5 4 4 5 4 4
Complexity 0.01487 0.01736 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.02285
Nodes 5 7 4 4 4 5 4 4 5 4 7
Complexity 0.01189 0.01455 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.01342
Nodes 9 8 4 8 4 9 4 9 6 8 10
Complexity 0.00893 0.01118 0.02034 0.01042 0.03128 0.01219 0.03642 0.01229 0.0114 0.01259 0.01157
• We would expect that the trees would be aligned by number of nodes and this is
approximately what happens
• CART aligns the trees by a measure of ―complexity‖ discussed in other sessions
• Alignment is required to determine the estimated error rate of the main tree when it has
been pruned to a specific size (complexity)
• Thus when the main tree is pruned to 4 terminal nodes align each CV trees appropriately.
Eight of the CV trees are also pruned to 4 nodes, but one CV tree is pruned to 5 nodes
and one to 6 nodes
Summing the Confusion Matrices
• Each CV fold generates a test confusion matrix based on a
completely separate subset of data
• When summed the test partitions are equal to the entire
original training data
• Summing the confusion matrices yields an aggregate matrix
that is based on a sample equal to the original data set
• If we started with 300 records the assembled confusion
matrix consists of 300 test records
• Not a ―trick‖. Each record was genuinely reserved for test
one time and was classified correctly or incorrectly in its fold
• We have thus arrived at the largest possible test sample we
could create: as if 100% of the data was used for test!
© Copyright Salford Systems 2013
Test Results Extracted From Cross-Validation
• Cross-validation is not a method for building a model
• Cross-validation is a method for indirectly testing a model
that on its own has no test performance results
• In classic cross-validation we throw away the K models built
on parts of the data. We keep only test results.
• Modern options for using these K different models exist and
you can save them in SPM
– Could be used in a committee or ensemble of models
– One of the CV models might turn out to be more interesting than the
main model
© Copyright Salford Systems 2013
Does Cross-Validation Really Work?
• We have tested CV by extracting a small training data from
a much larger database
• We used CV to obtain a ―simulated‖ test performance
• We then tested our main model against a genuine large test
sample extracted from the larger database
• Our results were always remarkably in agreement. CV gave
essentially the same results as the true test set method
• The CART monograph also discusses similar experiments
conducted by Breiman Friedman Olshen and Stone (BFOS)
• They come to the same conclusion while observing that 5-
fold cross-validation tends to understate model performance
and that 20-fold may be slightly more accurate than 10-fold
© Copyright Salford Systems 2013
How Many Folds?
• How many folds do we need to run to obtain reliable results
• Think about 2 fold CV
– Divide the data into two parts
– First train on part 1 and test on part 2
– Then reverse roles of train and test
– Assemble results
• Problem with 2-fold CV is that we train on only half the
available data
– This is a severe disadvantage to the learning process unless we
have a large amount of data
• The spirit of CV is to use as much training as possible
© Copyright Salford Systems 2013
How many CV folds?
• In the original CART monograph the authors Breiman,
Friedman, Olshen and Stone discussed some experiments
• Using small numbers such as 5-fold was typically
pessimistic
– Results suggested the model was not as good as it really was
• Using a substantial number of folds such as 20 was
generally only slightly more accurate than 10-fold
– CART authors suggested 10-fold as a default
– Results hold for classification problems
• These classification model results re-confirmed in a 1995
paper by Ronny Kohavi
– A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection. In International Joint Conference on Artificial Intelligence
(IJCAI 1995)
© Copyright Salford Systems 2013
Creating Your Own Folds:
Needs to be done with care with smaller samples
• Suppose you have 100 records divided as
– 92 records Y=0
– 8 records Y=1
• Each fold must have at least one record for each target
class
• Best we can do then is to have 8 folds
• But we cannot divide 92 into 8 equal parts
– 7 parts with 11 records Y=0 (response rate=.0833)
– 1 part with 15 records Y=0 ( response rate=.0625)
• Better to divide as
– 4 parts with 11 records Y=0 (response rate=.0833)
– 4 parts with 12 records Y=0 (response rate=.0769)
– More equal balance across the folds yields more stable results
© Copyright Salford Systems 2013
Points to Remember
• The ―main‖ model in CV is always built on all the training data
– Nothing is held back for testing
• If you were to run CV in several different ways
– Vary the number of folds
– Vary construction of CV folds by varying random number seed
• You would always get the exact same main model
– Only the estimates of test performance could differ
• Are the results sensitive to these parameters?
– BATTERY CV re-runs the analysis with different numbers of folds
• Larger numbers should converge
– BATTERY CVR using the same number of folds but creates the K partitions
based on different random number seeds
• Is expected to yield reasonably stable results
• Unstable results suggest considerable uncertainty regarding your model
© Copyright Salford Systems 2013
Cross-Validation: Part II
• In part I we reveiwed the main ideas behind cross-validation
• We pointed out that CV is a method for testing a model
• Especially useful when there is a shortage of data but can
be used in any circumstance
• A main model is built on all training data with nothing held
back for testing
• An additional set of K different models are built on different
partitions of the data holding back some of the data for test
• The test results for the K models are aggregated and then
used as an estimate of the test set performance of the
―main‖ model
© Copyright Salford Systems 2013
Cross-Validation Train/Test Procedure:
K mutually exclusive partitions, 1 Test, K-1 Train
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
1 102 93 4 5 6 7 8
Test
Test
Test
Test
Learn
Learn
Learn
LearnLearn
ETC...
Learn
Above each partition is in the train sample 9 times and in the test sample 1 time
Alignment of Results
• In this session we discuss a somewhat technical topic
related to the mechanics of aligning test results from K CV
models and the main model
• Recall that CART grows a large tree and then prunes it back
• Back pruning is conducted via ―cost-complexity‖
• Back pruning might prune off more than one terminal node
at a time
• Back pruning might prune back several nodes along the
same branch
• CV generates K different models each with its own maximal
tree and its own sequence of back-pruned trees
© Copyright Salford Systems 2013
CV Mechanics
• Main Model Has No Test Data Each CV Model has test Data
© Copyright Salford Systems 2013
Main Model CV Model 1
CV Model 2
CV Model 3
CV Model 10
Combine test results from all CV
folds and attribute to main model
CART and CV Details
• A CART tree model is actually a family of progressively
smaller tree models one of which is normally deemed
―optimal‖
• So we don’t just have a main model and K CV models
• We have a main tree sequence and K CV tree sequences
• For every tree in the main sequence we need to match it up
with its corresponding tree in each CV sequence
• The most obvious way to do this is by tree size
• To estimate the error rate of the 2-node tree in the main tree
sequence match it up with the K 2-node trees found via CV
• Then proceed to match up every other tree size found
© Copyright Salford Systems 2013
CART Tree Alignment
• Matching up trees from the different sequences is much
more complicated than this
• Each CV tree has its own sequence and its own maximal
size
• These sequences may not all contain the same tree sizes
• The main tree might contain a subtree with 8 terminal nodes
but not every CV tree will contain an 8 node tree
– Back pruning sometimes skips over certain sizes jumping directly say
from 9 terminal nodes to 7
• Not all tree sequences will have the same number of nodes
in the maximal tree
© Copyright Salford Systems 2013
Alignment via Cost Complexity
• Cost complexity prunes trees by examining a trade off between error rate
(cost) and size of the tree (complexity)
• Error rate can be taken to be misclassification rate for this discussion (on
the training data)
• Suppose our maximal tree has a training data misclassification rate of
.00 (not uncommon on training data) but that the tree is very large (e.g.
1000 terminal nodes)
• Suppose we penalized terminal nodes at the rate of .0001
• Then the error rate of 0 would be counterbalanced by a penalty of
1000*(.0001)=0.10
• If we could prune off 500 nodes we would reduce the penalty to .05 but
of course our misclassification would probably increase
• If the increase in misclassification rate were say .04 then the total of
misclass rate + penalty would be only .04 + .05 = .09 a benefit!
© Copyright Salford Systems 2013
CART Cost Complexity Pruning
• CART automatically tests different penalties to try to induce
a smaller tree
• We always start with a penalty of 0 and then start gradually
increasing it
• To prune back we prune off the so-called ―weakest-link‖
which is the node that increases the misclassification rate of
the whole tree the least
• Means that sample size of node is taken into account
• A progressive search algorithm for finding the next penalty is
described in the CART monograph
© Copyright Salford Systems 2013
Cost-Complexity is the key to Alignment
• For every CART tree sequence a specific penalty on nodes
(e.g. .001) leads immediately to exactly one tree of a specific
size
• We can only find this tree by going through the pruning
sequence (no shortcuts)
• We align the CART CV trees by the penalty (complexity)
rather than the tree size
• So for a given penalty we find the tree that corresponds to it
both in the main tree sequence and also in each CV tree
• These aligned trees are used to extract the performance
measures that will finally be assigned to the main tree of that
size
© Copyright Salford Systems 2013
Table of Alignments:
Special extract report not automatically generated
© Copyright Salford Systems 2013
• Table displays the aligned trees corresponding to each tree in the main sequence
• In the first row the main tree has been pruned to 2 nodes as have all but one of the CV
trees
• When the main tree is pruned to 7 nodes it is aligned with trees of varying sizes ranging
from 4 to 7 terminal nodes
• The complexity penalties appear under the terminal node counts
• Complexity penalties always increase as the tree becomes smaller

Mais conteúdo relacionado

Mais procurados

Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringChakrit Phain
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering Ashek Farabi
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)Learnbay Datascience
 
Succession planting for continuous vegetable harvests 2015 Pam Dawling 90mins
Succession planting for continuous vegetable harvests 2015 Pam Dawling 90minsSuccession planting for continuous vegetable harvests 2015 Pam Dawling 90mins
Succession planting for continuous vegetable harvests 2015 Pam Dawling 90minsPam Dawling
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Autoencoder Forest for Anomaly Detection from IoT Time Series
Autoencoder Forest for Anomaly Detection from IoT Time SeriesAutoencoder Forest for Anomaly Detection from IoT Time Series
Autoencoder Forest for Anomaly Detection from IoT Time SeriesYiqun Hu
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forestsMarc Garcia
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptxRoshan86572
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 

Mais procurados (20)

Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)
 
Succession planting for continuous vegetable harvests 2015 Pam Dawling 90mins
Succession planting for continuous vegetable harvests 2015 Pam Dawling 90minsSuccession planting for continuous vegetable harvests 2015 Pam Dawling 90mins
Succession planting for continuous vegetable harvests 2015 Pam Dawling 90mins
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Decision tree
Decision treeDecision tree
Decision tree
 
Time series forecasting
Time series forecastingTime series forecasting
Time series forecasting
 
Decision tree
Decision treeDecision tree
Decision tree
 
Autoencoder Forest for Anomaly Detection from IoT Time Series
Autoencoder Forest for Anomaly Detection from IoT Time SeriesAutoencoder Forest for Anomaly Detection from IoT Time Series
Autoencoder Forest for Anomaly Detection from IoT Time Series
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Predicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine LearningPredicting Moscow Real Estate Prices with Azure Machine Learning
Predicting Moscow Real Estate Prices with Azure Machine Learning
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forests
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
Genetic Algorithm
Genetic AlgorithmGenetic Algorithm
Genetic Algorithm
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptx
 
EM Algorithm
EM AlgorithmEM Algorithm
EM Algorithm
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 

Destaque

Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
 
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATEREGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATEChaoyi WU
 
Analysis Of A Binary Outcome Variable
Analysis Of A Binary Outcome VariableAnalysis Of A Binary Outcome Variable
Analysis Of A Binary Outcome VariableArthur8898
 
Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...
Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...
Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...Salford Systems
 
Case Study: American Family Insurance Best Practices for Automating Guidewire...
Case Study: American Family Insurance Best Practices for Automating Guidewire...Case Study: American Family Insurance Best Practices for Automating Guidewire...
Case Study: American Family Insurance Best Practices for Automating Guidewire...CA Technologies
 
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetSalford Systems
 
Data mining for diabetes readmission
Data mining for diabetes readmissionData mining for diabetes readmission
Data mining for diabetes readmissionYi Chun (Nancy) Chien
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataArthur Charpentier
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Mohammed Musah
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Ryan Rosario
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3Laila Fatehy
 
Decision tree powerpoint presentation templates
Decision tree powerpoint presentation templatesDecision tree powerpoint presentation templates
Decision tree powerpoint presentation templatesSlideTeam.net
 

Destaque (20)

Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATEREGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE
REGRESSION ANALYSIS ON HEALTH INSURANCE COVERAGE RATE
 
Analysis Of A Binary Outcome Variable
Analysis Of A Binary Outcome VariableAnalysis Of A Binary Outcome Variable
Analysis Of A Binary Outcome Variable
 
Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...
Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...
Applied Multivariable Modeling in Public Health: Use of CART and Logistic Reg...
 
Case Study: American Family Insurance Best Practices for Automating Guidewire...
Case Study: American Family Insurance Best Practices for Automating Guidewire...Case Study: American Family Insurance Best Practices for Automating Guidewire...
Case Study: American Family Insurance Best Practices for Automating Guidewire...
 
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNet
 
Data mining for diabetes readmission
Data mining for diabetes readmissionData mining for diabetes readmission
Data mining for diabetes readmission
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
R crash course
R crash courseR crash course
R crash course
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
PCA
PCAPCA
PCA
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
 
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Principal component analysis
Principal component analysisPrincipal component analysis
Principal component analysis
 
Decision tree lecture 3
Decision tree lecture 3Decision tree lecture 3
Decision tree lecture 3
 
Decision tree powerpoint presentation templates
Decision tree powerpoint presentation templatesDecision tree powerpoint presentation templates
Decision tree powerpoint presentation templates
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Decision trees
Decision treesDecision trees
Decision trees
 

Semelhante a Using CART For Beginners with A Teclo Example Dataset

Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Maarten Smeets
 
Presentation 7.pptx
Presentation 7.pptxPresentation 7.pptx
Presentation 7.pptxShivam327815
 
Oracle communications data model product overview
Oracle communications data model   product overviewOracle communications data model   product overview
Oracle communications data model product overviewGreenHamster
 
Introducing morphit -EUSPRIG 2013
Introducing morphit -EUSPRIG 2013Introducing morphit -EUSPRIG 2013
Introducing morphit -EUSPRIG 2013The Edge
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
Modernising the data warehouse - January 2019
Modernising the data warehouse - January 2019Modernising the data warehouse - January 2019
Modernising the data warehouse - January 2019Phil Watt
 
Oracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & TuningOracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & TuningChris Muir
 
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologySalford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologyVladyslav Frolov
 
Chapter 6- Classification and Prediction Methods
Chapter 6- Classification and Prediction MethodsChapter 6- Classification and Prediction Methods
Chapter 6- Classification and Prediction MethodsBenjJamiesonDuag2
 
Data analytics for engineers- introduction
Data analytics for engineers-  introductionData analytics for engineers-  introduction
Data analytics for engineers- introductionRINUSATHYAN
 
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Johnny Miller
 
Cassandra Applications Benchmarking
Cassandra Applications BenchmarkingCassandra Applications Benchmarking
Cassandra Applications Benchmarkingniallmilton
 
Data structures and algorithms Module-1.pdf
Data structures and algorithms Module-1.pdfData structures and algorithms Module-1.pdf
Data structures and algorithms Module-1.pdfDukeCalvin
 

Semelhante a Using CART For Beginners with A Teclo Example Dataset (20)

Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
2013 Fieldbus Update
2013 Fieldbus Update2013 Fieldbus Update
2013 Fieldbus Update
 
EuSpRIG 2013 Introducing Morphit
EuSpRIG 2013 Introducing MorphitEuSpRIG 2013 Introducing Morphit
EuSpRIG 2013 Introducing Morphit
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Presentation 7.pptx
Presentation 7.pptxPresentation 7.pptx
Presentation 7.pptx
 
Oracle communications data model product overview
Oracle communications data model   product overviewOracle communications data model   product overview
Oracle communications data model product overview
 
Introducing morphit -EUSPRIG 2013
Introducing morphit -EUSPRIG 2013Introducing morphit -EUSPRIG 2013
Introducing morphit -EUSPRIG 2013
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
Modernising the data warehouse - January 2019
Modernising the data warehouse - January 2019Modernising the data warehouse - January 2019
Modernising the data warehouse - January 2019
 
Oracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & TuningOracle ADF Architecture TV - Development - Performance & Tuning
Oracle ADF Architecture TV - Development - Performance & Tuning
 
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologySalford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of Technology
 
Chapter 6- Classification and Prediction Methods
Chapter 6- Classification and Prediction MethodsChapter 6- Classification and Prediction Methods
Chapter 6- Classification and Prediction Methods
 
Data analytics for engineers- introduction
Data analytics for engineers-  introductionData analytics for engineers-  introduction
Data analytics for engineers- introduction
 
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Cassandra Applications Benchmarking
Cassandra Applications BenchmarkingCassandra Applications Benchmarking
Cassandra Applications Benchmarking
 
Data structures and algorithms Module-1.pdf
Data structures and algorithms Module-1.pdfData structures and algorithms Module-1.pdf
Data structures and algorithms Module-1.pdf
 

Mais de Salford Systems

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4Salford Systems
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012Salford Systems
 
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART  Decision Trees:  A Winning CombinationTreeNet Tree Ensembles and CART  Decision Trees:  A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning CombinationSalford Systems
 
Paradigm shifts in wildlife and biodiversity management through machine learning
Paradigm shifts in wildlife and biodiversity management through machine learningParadigm shifts in wildlife and biodiversity management through machine learning
Paradigm shifts in wildlife and biodiversity management through machine learningSalford Systems
 
Global Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate ChangeGlobal Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate ChangeSalford Systems
 

Mais de Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
TreeNet Overview - Updated October 2012
TreeNet Overview  - Updated October 2012TreeNet Overview  - Updated October 2012
TreeNet Overview - Updated October 2012
 
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles and CART  Decision Trees:  A Winning CombinationTreeNet Tree Ensembles and CART  Decision Trees:  A Winning Combination
TreeNet Tree Ensembles and CART Decision Trees: A Winning Combination
 
Text mining tutorial
Text mining tutorialText mining tutorial
Text mining tutorial
 
Paradigm shifts in wildlife and biodiversity management through machine learning
Paradigm shifts in wildlife and biodiversity management through machine learningParadigm shifts in wildlife and biodiversity management through machine learning
Paradigm shifts in wildlife and biodiversity management through machine learning
 
Global Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate ChangeGlobal Modeling of Biodiversity and Climate Change
Global Modeling of Biodiversity and Climate Change
 

Último

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Using CART For Beginners with A Teclo Example Dataset

  • 1. Salford Systems Webex Training Salford Systems http://www.salford-systems.com
  • 2. CART® Decision Tree Basics • We start with a simple analysis of some market research data using CART • This introduction assumes no background in data mining or predictive analytics • We do assume you have had some experience reviewing data with the purpose of discovering interesting and or predictive patterns © Copyright Salford Systems 2013
  • 3. Beginning with CART • CART is the perfect place to start learning about data mining • Widely regarded as one of the most important tools in data mining and also the easiest to understand and master – Decision trees are still the most popular data analysis tool among experienced data miners • Delivers easy to understand analyses of complex data – Allows for very sophisticated analyses especially when a structured series of trees are developed – Effective Exploratory Data Analysis (EDA) to support more conventional modeling (eg logistic regression) © Copyright Salford Systems 2013
  • 4. Classification with CART® Real world study early 1990s • Fixed line service provider offering a new mobile phone service • Wants to identify customers most likely to accept new mobile offer • Data set based on limited market trial • 830 usable records • 67 attributes and target including – Demographics – Attitudes and Needs – Pricing for handset & minutes © Copyright Salford Systems 2013
  • 5. Mobile Phone Offer • Data is a sample of land line telephone customers of a European telco • At the time mobile phones were very rare in the country in question • The Company realized the time was right to introduce mobile phones on a substantial scale to their existing fixed line customer base • Key questions: – WHO to target with the marketing campaigns for the new product – HOW MUCH to charge for the handset © Copyright Salford Systems 2013
  • 6. Nature of the Research • Company arranged to make real world offers to about 1,000 existing land line customers • Everyone was presented the same offer (only one model of phone and one service plan available) • The PRICE of the handset was varied randomly over a large range prices from near zero to about $300 • Goal was to learn who responded positively and at what price points • Offers were made in person as part of a one hour visit in which much was learned about the household (media preferences, number of children, distance to work, etc) © Copyright Salford Systems 2013
  • 7. • Target variable RESPONSE: Coded 0 or 1 (YES, NO) • 65 available predictors include variables like: Nature of the Data HANDPRIC Cost of handset (one time fee) USEPRICE Usage cost (per month, 100 minutes) TELEBILC Landline home phone bill average CITY Resident in which of 5 major cities AGE Coded in 5 year increments HOUSIZ Possible proxy for income, coded 1-6 SEX Male, Female, Unknown EDUCATN Coded 1-7 ( thru postgrad) © Copyright Salford Systems 2013
  • 8. Analysis File Overview in CART 6.0 © Copyright Salford Systems 2013
  • 9. Set Up the Model (Select Target, allowable predictors) Only requirement is to select TARGET (dependent) variable. CART will do everything else automatically© Copyright Salford Systems 2013
  • 10. CART: Does its own variable selection • Embedded variable (feature) selection means that modeler can let the software make its own choice of predictors • Modeler will often want to limit the model to focus on selected inputs – Exclude ID variables and merge keys – Exclude clones of the dependent variable – Exclude data pertaining to the future (relative to the dependent) – E.g. restrict a model to easily available predictors – Test predictive power of purchased external data • Modeling automation can allow exploration of a vast space of pre-selected predictors (see later slides) © Copyright Salford Systems 2013
  • 11. In Example we run CART model • CART completes analysis and gives access to all results from the NAVIGATOR – Shown on next slide • Upper section displays tree of a selected size – number of terminal nodes • Lower section displays error rate for trees of all possible sizes • Green bar marks most accurate tree • We display a compact 10 node tree for further scrutiny © Copyright Salford Systems 2013
  • 12. CART Model Viewer Access reports and drill into model details Most accurate tree is marked with the green bar. Above we select the 10 node tree for convenience of a more compact display. Note train/ test area under ROC curve © Copyright Salford Systems 2013
  • 13. Root Node:Hover Mouse Tree starts with all training data Displays details of TARGET variable in overall training data Above we see that 15.2% of the 830 households accepted the offer Goal of the analysis is now to extract patterns characteristic of responders © Copyright Salford Systems 2013
  • 14. Goal is to split node: separate responders • Details of root node split • If we could only use a single piece of information to separate responders from non-responders CART chooses the HANDSET PRICE •Those offered the phone with a price > 130 contain only 9.9% responders •Those offered a lower price respond at 21.9% © Copyright Salford Systems 2013
  • 15. CART Splitting Rules • We discuss the details later • Here we just point out that the split CART displays is – ―the best of all possible splits‖ • Subject to the splitting criteria you have chosen and any constraints imposed • How do we know this split is ―best‖? • Because CART actually tries all possible splits looking for the best – Exhaustive brute force search – Advanced algorithms used to make this search fast – As much as 100 times faster than other decision trees © Copyright Salford Systems 2013
  • 16. Grow progressively bigger tree: One split at a time © Copyright Salford Systems 2013 • Binary recursive partitioning repeated until further splitting impossible (e.g. data exhausted) • This leads us to the largest possible or “maximal tree
  • 17. Maximal tree is raw material for best model © Copyright Salford Systems 2013 • Goal is to find optimal tree embedded inside maximal tree • Will find optimal tree via ―pruning‖ • Like backwards stepwise regression • Challenge: A tree with 100 terminal nodes can be pruned back to 99 terminal nodes by eliminating any one 99 penultimate nodes • Now the 99 new terminal nodes can be cut back to 98 by eliminating any one of the surviving 98 penultimate nodes • Something like 99! possible trees. How do we find the best?
  • 18. Pruning Sequence • CART automatically generates a pruning sequence which develops a preferred sequence of progressively smaller trees • We can prove that for a given tree size the CART tree in the sequence will be the best performing tree of all possible trees of that size • In our sequence, the 10 node tree is gauranteed to be more accurate than any other 10 node tree you could extract from the maximal tree • You as the user never need to worry about this • ―Better‖ is defined in terms of performance on the training data as we need the tree sequence before we can test © Copyright Salford Systems 2013
  • 19. Error Curve: Plots Accuracy vs Model Size © Copyright Salford Systems 2013 •Requires test data •Can use cross-validation (sample reuse) if data is scarce •Curve typically U-shaped • Too small is not good and neither is too large •Can look at any tree in the sequence of pruned subtrees •Error is what BFOS call an “honest” estimate of model performance
  • 20. Pick a modest sized tree to examine Note high response in this RED colored node Response of 38.5% in this segment vs. 15.2% overall Lift = 2.53 © Copyright Salford Systems 2013
  • 21. Navigator allows access to all model info • The terminal nodes are color coded to represent results – RED nodes are ―hot‖ and contain high concentrations of the class of interest (buyers) – BLUE nodes are ―cold‖ and contain very low concentrations of the class of interest – PINK and WHITE nodes have moderate concentrations • We first look to see if we have any RED nodes – Explore any red nodes via mouse hover • Then we drill down to see a tree schematic revealing the main drivers of the tree © Copyright Salford Systems 2013
  • 22. Select ―splitters‖ View Selects a streamlined overview of the tree showing ONLY primary splitters © Copyright Salford Systems 2013
  • 23. Model Overview: Main Drivers ( Red= Good Response Blue=Poor Response) High values of a split variable always go to the right; low values go left © Copyright Salford Systems 2013
  • 24. Examine Extreme Right-most Terminal Node • Hover mouse over node to see inside • Even though this node is on the ―high price‖ of the tree it still exhibits the strongest response across all terminal node segments (43.5% response) •Rules defining this node are shown on next slide © Copyright Salford Systems 2013
  • 25. Rules can be extracted in a variety of languages • Here we select rules expressed in C for one node of interest Entire tree can also be rendered in Java, XML/PMML, or SAS © Copyright Salford Systems 2013
  • 26. Continuing down the tree • We note that even if the new product is offered at a high price we can still find prospects very interested: – Those that have a high average landline bill and own a pager – This group displays greatest probability of response (43.5%) © Copyright Salford Systems 2013
  • 27. Classic Detailed Tree Display Analyst can select details to be displayed © Copyright Salford Systems 2013
  • 28. Control Over Details Displayed in Nodes At left an example showing the class bar chart is displayed Separate controls for internal and terminal nodes © Copyright Salford Systems 2013
  • 29. Configure Print Image Interactively Shrink to one page, include header/footer • © Copyright Salford Systems 2013
  • 30. Tree Performance Measures and Principle Message • In addition to the details of the tree (splits, split values) • Variable Importance Ranking • Confusion Matrix (Prediction Success Matrix) • Gains, ROC © Copyright Salford Systems 2013
  • 31. Variable Importance Ranking (Relative impact on outcomes) Three major ways of computing variable importance. Above default display. © Copyright Salford Systems 2013
  • 32. Predictive Accuracy (How often right, how often wrong) This model is not very accurate but ranks responders well © Copyright Salford Systems 2013
  • 33. Gains Curve In top decile model captures about 23% of responders © Copyright Salford Systems 2013
  • 34. Performance Evaluation: ROC Curve © Copyright Salford Systems 2013
  • 35. Observations on CART Tree Contrasts with Conventional Stats • CART leverages rank order of predictor to split – Transforming predictor X into Log(X) will not change tree – Of course tree will be expressed in terms of Log(X) but this will not change the location of the split – Traditional statisticians experiments with alternative transforms unnecessary • CART is immune to outliers in predictors – Suppose X has values 1,2,3,…,100, 900 – To CART this is the same as 1,2,3,…,100, 101 – All CART ―sees‖ is the rank order • We will see later that CART has built-in missing value handling • So no worry about outliers, missing values, transformations © Copyright Salford Systems 2013
  • 36. CART Methodology: Partition Data Into Two Segments • Partitioning line parallel to an axis • Root node split first – 2.450 – Isolates all the type 1 species from rest of the sample • This gives us two child nodes – One is a Terminal Node with only type 1 species – The other contains only type 2 and 3 • Note: entire data set divided into two separate parts © Copyright Salford Systems 2013
  • 37. Second Split: Partitions Only Portion of the Data • Again, partition with line parallel to one of the two axes • CART selects PETALWID to split this NODE – Split it at 1.75 – Gives a tree with a misclassification rate of 4% • Split applies only to a single partition of the data • Each partition is analyzed separately © Copyright Salford Systems 2013
  • 38. Discriminant Analysis Uses Oblique Lines •Linear Combinations are difficult to understand and explain •CART does permit ―oblique‖ splits based on linear combinations of small sets of variables but this is rarely desirable © Copyright Salford Systems
  • 39. CART Representation of a Surface Model clearly non-linear Height of bar represents probability of response Remaining axes represent values of two predictors Greatest prob of response here in corner to the right 0
  • 40. CART Splitting Process • Standard splits are based on ONE predictor and the form of a database RULE • A data record goes left if splitter_variable <= split value • Examples: A data record goes left • if AGE<=35 • if CREDIT_SCORE <= 700 • if TELEPHONE_BILL <= 50 © Copyright Salford Systems 2013
  • 41. Searching all splits facilitated by sorting • On left we sort by TELEBILC, on right by TRAVTIMR • Test smallest value first, then next smallest, etc moving all the way down the column • The arrow shows a split sending 10 cases to the left and all other data to the right
  • 42. Example Root Node Split Continuous Splitter © Copyright Salford Systems 2013 From our Euro_telco_mini.xls example Split is TELEBILC <= 50
  • 43. Alternative Split Points What if split the data at TELEBILC<=25? © Copyright Salford Systems 2013 Note the response rate between the two nodes with this split are very similar They are much different after splitting at the optimal value
  • 44. Two splits separate quite differently © Copyright Salford Systems 2013 The first pane shows two segments with 14.3% and 15.5% response The second pane shows two segments with 12.7% and 19.8% Our goal in CART is to generate substantially different segments and we accomplish this by experimenting with every possible split value for every predictor
  • 45. CART Splitting Process: More • Splitter variables need not be numeric, they can be text • Splitter variables need not be ordered • A data record goes left • if CITY$ = ―London‖ OR ―Madrid‖ OR ―Paris‖ • if DIAGNOSIS = 111 OR 35 OR 9999 © Copyright Salford Systems 2013
  • 46. • CART considers all possible splits based on a categorical predictor • Example: four regions - A, B, C, D can be split 7 ways (23 -1 = 7) • Each decision is a possible split of the node and each is evaluated • Note: A on the left and B,C,D on the right is the same split as its mirror image A on the right and B,C,D on the left • So we only list one version of this split – It is which cases stay together that matters not which side of the tree they are on Splits on K-level categorical predictors 2K-1 -1 ways to split Left Right 1 A B, C, D 2 B A, C, B 3 C A, B, D 4 D A, B, C 5 A, B C, D 6 A, C B, D 7 A, D B, C © Copyright Salford Systems 2013
  • 47. Categorical Split Caution: Dangers of HLCs (High Level Categoricals) • Because categorical variables generate 2K-1 -1 ways to split the data high values of K can be problematic • K=33 is not an unusually large number of levels yet allows for about 4 billion ways to split the data • When the number of possible splits exceeds the number of records in the data the categorical variable has an advantage over any continuous splitter – A continuous variable with a unique value in every row of the data gives us a choice of split points equal to the number of rows of data • Later we will discuss several ways to deal with HLCs including repackaging the high cardinality categoricals into lower cardinality versions and penalties © Copyright Salford Systems 2013
  • 48. Example Root Node Split Categorical Splitter © Copyright Salford Systems 2013 From our Euro_telco_mini.xls example Observe that we have to LIST the values that go to each child node
  • 49. CART Competitor Splits • The CART mechanism for splitting data is always the same • We are given a block of data – Could be all of our data and we are starting from scratch – Could be a small part of our data obtained after already doing a lot of slicing and dicing • When we work with a block of data we do not take into account how we got to that block of data • We do not consider any information which might be available outside of the block of data • The block of data to be analyzed is our entire universe and nothing else exists for us © Copyright Salford Systems 2013
  • 50. Getting Ready to Split • For a block of data to be split – It must contain a sufficient number of data records (ATOM) – We can tell CART what the minimum must be – Default is just TWO records – In large database analysis we might reasonably set the minimum quite a bit higher – ATOM values such as 10, 20, 50, 100, 200 have cropped up in our practical work • If you are working with a small database such as those encountered in biomedical research (e.g. 200 records total) you will want to allow the ATOM size to be small • If you are working with hundreds of thousands or millions of records there is no harm in trying a minimum size like 200 © Copyright Salford Systems 2013
  • 51. Still Getting Ready to Split • If we have a classification problem such as modeling response to a marketing offer where there are two outcomes – Responded – Did Not Respond • To be splittable the block of data cannot be ―pure‖, i.e. composed of all responders or all non-responders – True regardless of how large the block of data is – Splitting is designed to separate the responders from the non- responders so we need a mixture to have something to do • The data records cannot all have exactly the same values for the predictors – CART will be looking for a useful difference in a predictor between responders and non-responders © Copyright Salford Systems 2013
  • 52. Observation on Dummy Variable Predictors • If you split a node using a continuous variable there is always the chance that this same variable is used again in a subsequent split for descendent nodes • Once a node is split with a dummy variable this variable can never be used again in descendant nodes – Because a descendant node will contain either all 0 or all 1 values for this variable. Hence it cannot split. • If a dummy variable is introduced into the tree below the root it might appear in more than one location in the tree – But one use will never be the ancestor of the other use © Copyright Salford Systems 2013
  • 53. Making The Split • To split the block of data (which we will henceforth refer to as splitting the node) we search each available predictor • For every predictor we make a trial split at every distinct value of the predictor • For each trial split we compute a goodness of split measure normally referred to as the ―improvement‖ • For each predictor we find the split value that yields the best improvement • Once every predictor has been searched to find the best split point we rank the splitters in descending order and then use the best overall splitter to grow the tree © Copyright Salford Systems 2013
  • 54. Ranked List of Splitters • The ranked list of splitters is also known as the competitor list • CART always computes the entire list as this is the only way to know for sure which split is best • To save space CART normally only displays the top 5 competitors within a node – You can request a larger number in your options settings • The root node at the top of the tree always displays the complete list of competitors even if there are thousands of predictors © Copyright Salford Systems 2013
  • 55. Why Care about Competitor Splits? • Useful to know if the best splitter is far better than all the rest or only slightly better • Useful to know which predictors show up near the top – Are they very different from each or are they all reflecting the same underlying information • Useful to know if a strong but perhaps 2nd best predictor splits the data more evenly than the best – We might want to try FORCING that 2nd best predictor into the root to see what happens – Sometimes this yields an overall better tree • Pattern of top splitters may reflect problems – Top 3 competitors may all be ―too good to be true‖ and we might need to drop them all from the analysis © Copyright Salford Systems 2013
  • 56. Surrogate Splits • Surrogate splits were first introduced by the authors of CART in their classic monograph Classification and Regression Trees, 1984. • Surrogate splits are mimics or substitutes for the primary splitter of a node • An ideal surrogate splits the data in exactly the same way as the primary split – The ―association‖ measure reflects how close to perfect a given surrogate is © Copyright Salford Systems 2013
  • 57. Why Surrogates? • Surrogates have two primary functions: – To split data when the primary splitter is missing – To reveal common patterns among predictors in a data set • CART searches for surrogate splitters in every node in the tree – Surrogates are searched for even when there is no missing data – No guarantee that useful surrogates can be found – CART attempts to find at least five surrogates for every node but this number can be modified – Number of surrogates actually found normally varies from node to node © Copyright Salford Systems 2013
  • 58. CART and Missing Values in Deployment • CART is the only learning machine that is prepared to deal with any pattern of missing values in future data • Even if the training data have no missings CART develops strategies to deal with the eventuality of any variable or variables being missing • Some learning machines cannot handle missing values at all • Other learning machines can only deal with missing value patterns that they have been trained on (seen before) – Eg Handle X5=missing only if X5 was ever missing in the training data • CART has no such restrictions and is always ready for any pattern of missings © Copyright Salford Systems 2013
  • 59. Surrogates in Action: Euro_telco_mini.xls © Copyright Salford Systems 2013 Remember to check off CITY, MARITAL and RESPONSE as ―categorical‖
  • 60. Manually Prune Back to the 10-node tree © Copyright Salford Systems 2013 Just click on the blue curve in the lower panel to select a smaller easier to manage tree. The double click on left child of root node (see arrow above)
  • 61. Look at the Left Child of the ―Root‖ © Copyright Salford Systems 2013 The primary splitter predicting subscription to a new mobile phone offer is the monthly telephone bill (TELEBILC) dividing node into spenders of more or less than $50 per month
  • 62. Surrogate for TELEBILC • If this variable were missing for any reason (database error, person recently moved, new customer) we do not know whether to move down the tree to the left or to the right • Surrogate variable can be used in place of the missing primary splitter. In this case the surrogate is of the form go to the left if MARITAL=1 • Left is associated with LOW spending on the telephone bill • CART suggests that single person households spend less while households headed by married or divorced persons spend more © Copyright Salford Systems 2013
  • 63. Surrogates and Direction • A surrogate is intended to be a substitute for the primary splitter making similar left/right decisions • But surrogates may work in the opposite direction so every continuous variable surrogate is supplied with a ―tag‖ – The letter ―s‖ after the split point stands for ―standard‖ – The letter ―r‖ after the split point stands for ―reverse‖ • If a surrogate is negatively correlated with the primary splitter then it will split in the reverse direction – Categorical splitters are always organized so that the levels that correspond to left in the primary splitter go left in the surrogate © Copyright Salford Systems 2013
  • 64. Normally Surrogates Make Sense • Our primary splitter is the average monthly spend of a household on a fixed line telephone account • Our surrogates include marital status, commute time to work, age, and the city of residence – Longer commutes are associated with larger spend on the phone – Older head of household also is associated with larger spend – We cannot interpret the CITY variable at this point because we don’t know the identity of the cities • In general surrogates help us understand the primary splitter – Especially helpful in survey research © Copyright Salford Systems 2013
  • 65. How to Compute Surrogates? • This is a technical question which we will not cover here – The CART monograph contains a wealth of technical information although it can be a challenging read • However, we will discuss the main ideas • The top surrogate is – A single variable – A single split (in the same format as any primary splitter) – Intended to mimic as closely as possible how data is partitioned by the primary segment into LEFT and RIGHT nodes • To get a surrogate think of generating a one split CART tree where the dependent variable is {LEFT or RIGHT} as defined by the primary splitter. (There are many details) © Copyright Salford Systems 2013
  • 66. What is ―Association‖? • Association is a measure of the strength of the surrogate • The lowest possible reported score is 0 (useless) • The highest possible score is 1 (perfect clone) • CART starts from the default rule: if you don’t kow which way to send a data record down a tree go with the majority (sometimes weighted majority) • If when training the tree most cases went left then in the absence of other information also go left • The default makes mistakes of course because it always sends every record to the same majority side – Association measures how much better the surrogate is than the default rule (percent reduction in errors made) • Default rule is the ―surrogate of last resort‖ © Copyright Salford Systems 2013
  • 67. Competitors and Surrogates: Different Objectives © Copyright Salford Systems 2013 Competitors yield the best possible split when using that variable Surrogate yields the best possible mimic of the primary splitter and goodness of split may be sacrificed to match some aspect of the primary splitter Note that C2 is a competitor with one split point and a surrogate with a different split point
  • 68. Grow another tree on GB2000.XLS • We prefer this data set because it has no missing values making working through examples much easier • Don’t forget: CART always computes surrogates and in this way the CART tree is always prepared for future missings • We will not be trying to make sense of this tree – will look just at the mechanics • Note the root node splitter and the top surrogate © Copyright Salford Systems 2013
  • 69. Root Node Split © Copyright Salford Systems 2013 Root Splitter: M1 <= -.04645 Top Surrogate: C2 <= -.10835
  • 70. Main Splitter vs. Best Surrogate Main Splitter Surrogate Left Right Left Right Class 1 672 328 626 374 Class 2 252 748 300 700 Total 924 1076 926 1074 © Copyright Salford Systems 2013 Best Surrogate must closely match not only the record counts in the child nodes but also the distribution of the target variable
  • 71. Modeling ROOTSPLIT with CART © Copyright Salford Systems 2013 Observation: Modeling the root node split (we have to create a new variable to reflect this) will not necessarily match the surrogate report Other factors must be taken into account. Here we get the right variable but not the right split point
  • 72. Main Splitter VS Best Surrogate Model Root Split As a Binary Target Main Splitter Surrogate Alternate Left Right Left Right Left Right Class 1 672 328 626 374 598 402 Class 2 252 748 300 700 288 712 Total 924 1076 926 1074 886 1114 © Copyright Salford Systems 2013 Best Surrogate must closely match record counts in the child nodes and the distribution of the target variable Modeling root split on available predictors will not match surrogate exactly
  • 73. Variable Importance in CART • It is hard to imagine now but in 1984 when the CART monograph was first published data analysts did not generally rank variables • Although informally researchers would pay attention to t- statistics or p-values associated with the coefficients of regressions researchers frowned on the practice of ranking predictors • Since the advent of modern data analytic methods researchers expect to see a variable importance ranking for all models • It all started with CART! © Copyright Salford Systems 2013
  • 74. CART concept of Variable Importance • Variable importance is intended to measure how much work a variable does in a particular tree • Variable importance is thus tied to a specific model • A variable might be most important in one model and not important at all in a different model built on the same data • The fact that a variable is important does not mean that we need it! If we were deprived of the use of an important variable it might be that other available variables could substitute for it or do the same predictive work • Variable Importance describes the role of a variable in a specific tree © Copyright Salford Systems 2013
  • 75. Variable Importance and Tree Size • Every tree in the CART sequence has its own variable importance list • A small tree will typically have only a few important variables • A large tree will typically have many more important variables – Because with more nodes there are more chances for more variables to play a role in the tree • Usually we focus on the tree CART has identified as optimal but this should not deter you from selecting another (usually smaller) tree © Copyright Salford Systems 2013
  • 76. Splitter Improvement Scores • Recall that every splitter (and every surrogate) has an associated ―improvement‖ score which measures how good a splitter is • The improvement score for a splitter in a node is always scaled down by the percent of data that actually pass through the node • 100% of all data pass through the root node so the root node splitter is always scaled by 100% • But a child node of the root might have say 30% of the data pass through – whatever improvement we compute for split of that node will be multiplied by 0.30 • Splits lower in the tree have only a small fraction of full data passing through so their adjusted improvement scores tend to be small © Copyright Salford Systems 2013
  • 77. Variable Importance Computation • To construct a variable importance score for a variable we start by locating every node that variable split • We add up all of the improvement scores generated by that variable in those nodes • Then we go through every node this variable acted as a surrogate and add up all those improvement scores as well • The grand total is the raw importance score • After obtaining raw importance scores for every variable we rescale the results so that the best score is always 100 © Copyright Salford Systems 2013
  • 78. Variations on Importance Scores • Breiman, Friedman, Olshen and Stone discuss one idea they ultimately rejected: – Including competitor improvements scores as well • This turns out to be a bad idea because it leads to double- counting – If a variable is the 2nd best splitter in a node there is an excellent chance that the same split will score well in the child nodes – If we were to give the splitter credit in the parent node for a being a competitor we would probably end up giving the exact same split credit again lower down in the tree – Another way to think about this: a split is trying to enter the tree. If we do not acept the split right away the same split may keep trying to enter the tree lower down – We only want to give this split credit once © Copyright Salford Systems 2013
  • 79. BATTERY LOVO • Leave One Variable Out (LOVO) – Available in SPM PRO EX versions but you can accomplish the process manually as well • Take your best modeling set up including your preferred list of predictors • BATTERY LOVO runs a set of models that are identical to your preferred set up except that one variable has been excluded • To be complete we run a ―drop just one variable‖ model for each variable in your KEEP list • If you have 20 variables then BATTERY LOVO will run 20 models (each of which will have 19 predictors) – Now rank the models from worst to best © Copyright Salford Systems 2013
  • 80. BATTERY LOVO Importance Ranking • Using the LOVO procedure tests how much our model deteriorates if we were to remove a given variable • It is sensible to say that a variable is very important if losing it damages the model substantially • Conversely, if losing a variable does no harm then we could conclude that the variable is useless • CAUTION: the LOVO ranking could be quite different from the CART internal ranking and both rankings are ―right‖ – CART measures how much work a variable actually does – LOVO measures how it hurts to lose a variable © Copyright Salford Systems 2013
  • 81. Randomization Test • Leo Breiman introduced yet another concept of variable importance measure related to his work on tree ensembles • Start with your test data – Score this data with your preferred model to obtain baseline performance – Take the first predictor in the test data and randomly shuffle its values in the column of data – The values are unchanged but values are relocated to rows they do not belong on – Now score again. We would expect performance to drop because one predictor has been damaged. Repeat say 100 times and average the performance deterioration. – Doing this for al variables will produce performance degradation scores and the larger the score the more important the variable © Copyright Salford Systems 2013
  • 82. Randomization Test • As of December 2011 this test is only available from the command line of recent versions of SPM • After growing a CART tree and saving the grove issue these commands from the command line or an SPM Notepad SCORE VARIMP=YES NPREPS=100 • You may readily run with NPREPS=30 but the results are more reliable with a larger number of replications © Copyright Salford Systems 2013
  • 83. Results from Random Shuffling: Baseline ROC=.85320 © Copyright Salford Systems 2013 Rank Score ROC_After Variable 1 100 0.82144 M1 2 63.21 0.83312 RES 3 45.57 0.83873 LS 4 25.9 0.84498 CR 5 22.66 0.84601 C2 6 21.29 0.84644 BU 7 5.84 0.85135 DT 8 4.25 0.85185 A1 9 4.23 0.85186 PRE 10 3.49 0.85209 OC 11 3.18 0.85219 MAR 12 2.29 0.85248 YM 13 1.64 0.85268 LT 14 0 0.8532 DP 15 0 0.8532 TRA 16 0 0.8532 GEN 17 0 0.8532 A2 18 0 0.8532 B 19 0 0.8532 CP2 20 0 0.8532 CD2 21 0 0.8532 D1 22 0 0.8532 E 23 0 0.8532 M 24 0 0.8532 CH 25 0 0.8532 TY$
  • 84. Which Importance Score Should I Use? • The internal CART variable importance scores are the easiest and the fastest to obtain and are a great starting point • LOVO scores are useful when your goal is to assess whether you can live without a predictor © Copyright Salford Systems 2013
  • 85. • Importance is a function of the OVERALL tree including deepest nodes • Suppose you grow a large exploratory tree — review importances • Then find an optimal tree via test set or CV yielding smaller tree • Optimal tree SAME as exploratory tree in the top nodes • YET importances might be quite different. • WHY? Because larger tree uses more nodes to compute the importance • When comparing results be sure to compare similar or same sized trees Variable Importance Caution © Copyright Salford Systems 2013
  • 86. Train/Test Consistency Checks • Unlike classical statistics data mining models generally do not rely on training data to assess model quality • In the SPM data mining suite we are always focused on test data model performance – This is the only way to reliably protect against over fitting • Every modeling method including our classical statistical models in SPM 7.0 offers test data performance measures • Generally these measures are overall model performance indicators – Measures say nothing about internal model details © Copyright Salford Systems 2013
  • 87. CART Tree Assessment • CART uses test data performance of every tree in the back- pruned sequence of progressively smaller trees to identify the overall best performer on classification accuracy • CART also notes which tree achieves the best test data Area Under the ROC (AUROC) curve on the Navigator © Copyright Salford Systems 2013
  • 88. What more can we do? • CART performance measures have always been overall-tree scores • No specific attention is paid to node-specific performance • However, in real world applications we often want to pay close attention to individual nodes – Might use the rank order of the nodes in important decisions – Prefer to rely on nodes that are most accurate in their predictions of event rates (response) • Therefore we need an additional tool for assessing CART tree performance at the node level • Provided by the PRO EX feature we call TTC – Train/Test Consistency checks © Copyright Salford Systems 2013
  • 89. Use the GB2000.XLS data set © Copyright Salford Systems 2013 Model setup to select TARGET as the dependent variable CART as the modeling method On the TEST tab we opt for 50% randomly selected test partition
  • 90. TTC in CART and SPM PRO EX • The TTC report is available from the navigator which displays for every CART model – Look for the TTC button near the bottom of the navigator • TTC relies on separate train and test data partitions which means that TTC is not available when using cross-validation © Copyright Salford Systems 2013
  • 91. TTC Display © Copyright Salford Systems 2013 Upper panel of TTC display contains one line in the table for every sized tree Bottom row represents the 2 node tree. Top line is for largest tree grown
  • 92. TTC: Select Target Class © Copyright Salford Systems 2013 In this case TARGET=2 represents BAD which is our focus class You the modeler get to choose which class to focus on, there is no ―right‖ class
  • 93. TTC Upper Panel © Copyright Salford Systems 2013 Rank Match: Do the train and test samples rank order the nodes in the same way (a statistical test allows for insignificant ―wobbles‖) Direction Agreement: Do the train and test samples agree as to whether a node is ―above average‖ or ―below average‖ (response, lift, event rate). Again a statistical test allows for insignificant violations
  • 94. Click on 14 node tree in TTC upper panel © Copyright Salford Systems 2013 Red curve is training data and shows node specific lift (node response/ overall response) Dark Blue horizontal line is the LIFT=1.0 reference line Light blue line with green triangles displays test data 3rd ranked node in train data would be ranked 1st or 2nd in test data
  • 95. TTC Details © Copyright Salford Systems 2013 For the 14 node tree we are told that agreement on ―direction‖ fails 1 time And the rank order agreement fails 5 times (scroll to right to see this) The statistical sensitivity of the test is controlled by the z-score selected in the Thresholds area to the right of the display. Defaults are 1.00 Setting this threshold to 2.00 will allow much more train/test divergence
  • 96. Changing TTC Sensitivity Threshold © Copyright Salford Systems 2013 Changing the thresholds to 2.00 permits moderate deviations and treats them as statistical noise. After changing thresholds click on ―Apply‖ if display has not updated We prefer to use the 1.00 threshold as this points us to trees with very high consistency that decision makers like to see. It does point to rather small trees.
  • 97. TTC: Display for 6 node tree © Copyright Salford Systems 2013 Much more defensible tree as train and test data align very well
  • 98. Summary • TTC focuses on two types of train-test disagreement • DIRECTION: Is this node a response node or not? – We regard disagreement on this fundamental topic to be fatal • RANK ORDER: Are the richest nodes as identified by the training data confirmed in test data – Without this we cannot defend deployment of a tree • TTC allows us to quickly identify which tree in the pruning sequence is the largest satisfying train/test consistency • TTC optimal tree is often rather close in size to Breiman’s 1 SE rule tree – But 1 SE rule does not look inside nodes at all – 1 SE rule is available for cross-validation while TTC is not © Copyright Salford Systems 2013
  • 99. Controlling Node Sizes In CART With ATOM and MINCHILD • Today’s topic is on the technical side but very easy to understand • Concepts are relevant to all Salford tree-based tools including TreeNet and Random Forests • Controlling the sizes of terminal nodes is a practical matter • If you are using CART, for example, to segment a database you might want to make it impossible to create segments that are too small • Altering terminal node size can also influence performance details of the optimal tree © Copyright Salford Systems 2013
  • 100. Background: Obtaining Optimal Trees • CART theory teaches us that we cannot arrive at the optimal tree via a stopping rule • The CART authors devoted quite a bit of energy to researching this topic • For any stopping rule it is possible to construct data sets for which that stopping rule will not work • We will end up stopping too early and we will miss important data structure • Result discovered both by experimentation and via mathematical construction © Copyright Salford Systems 2013
  • 101. Grow First Then Prune • CART methodology is thus to start with an unlimited growing phase • Grow the largest possible tree first • Think of this as a search engine for discovering possibly valuable trees • THEN use pruning to arrive at the optimal tree or a set of trees that yield both acceptable predictive performance and simplicity • CART also insists that we have a test method to make our final tree selection. That is the topic of another session. © Copyright Salford Systems 2013
  • 102. Maximum Tree Size • CART theory tells us that trees should be grown to their maximum size during the growing phase • Thus, trees should be grown until we either – Run out of data (1 record left and thus there is nothing to split) – Node impossible to split because pure (all GOOD or all BAD) – Node impossible to split because all records have identical values for predictors • Experience tells us that if you start with 1,000 records in a typical binary classification problem you should expect about 500 terminal nodes in the largest possible tree – But could be many less • Let’s try for biggest possible tree with the GB2000.xls data © Copyright Salford Systems 2013
  • 103. An Unlimited Tree Using GB2000.xls © Copyright Salford Systems 2013 To get 349 nodes we set the test method to EXPLORE, MINCHILD=2, ATOM=1
  • 104. Terminal Node Sample Sizes © Copyright Salford Systems 2013 We obtain this frequency chart by clicking the graph icon in the center left area of the navigator. We can see that many but not all terminal nodes are small.
  • 105. Bottom Left Most Part of Tree © Copyright Salford Systems 2013 We get a relatively large node to the extreme left (all class 2) Remaining three terminal nodes in this snippet are also all ―pure‖ but much smaller Obvious why the tree has to stop here as there is nothing left to do once a node is pure Obtained by right clicking the node of interest and selecting ―Display Tree‖
  • 106. Practical Maximal Trees • In real world practice it may not be necessary to push the tree growth to the literal maximum • Essential to grow a large tree – Large enough to include the optimal tree • We can control the size of the maximal CART tree in a number of ways – Some controls tell CART to stop early – Other controls limit CART’s freedom to produce small nodes © Copyright Salford Systems 2013
  • 107. Key Controls over Splits: ATOM and MINCHILD • ATOM – ATOM terminates splitting along a branch of the tree when the node sample size is to small – If a node contain fewer than ATOM data records then STOP – 10 is commonly used but you might set this much larger • MINCHILD – MINCHILD prevents creation of child nodes that are too small – The smallest possible value is 1 meaning that in splitting a node we would be permitted to send 1 solitary record to a child node and all other records to the other child node – Larger values are sensible and desirable. Values such as 5, 10, 20, 30, 50 could work well depending on the data. We have used values as large as 200 © Copyright Salford Systems 2013
  • 108. Setting ATOM and MINCHILD © Copyright Salford Systems 2013 On Advanced Tab of Model Setup Parent control (ATOM) Terminal node min (MINCHILD)
  • 109. Setting ATOM and MINCHILD • ATOM: Minimum size required for a node to be a parent • MINCHILD: Minimum size allowed for a child • We recommend that ATOM be set to three times MINCHILD • ATOM must be at least twice MINCHILD to allow a split consistent with MINCHILD • If you set inconsistent values for ATOM and MINCHILD they will be reset automatically to be consistent • To get the control you want be sure that ATOM is at least twice MINCHILD © Copyright Salford Systems 2013
  • 110. ATOM and MINCHILD • ATOM controls the right to be a parent • Parent must generate two children • Parent must contain enough data to be able to fill two child nodes • So parent must have at least 2*MINCHILD records © Copyright Salford Systems 2013
  • 111. ATOM and MINCHILD • By allowing ATOM to be three times MINCHILD you give CART some flexibility in finding the split 10 records 10 records • Min-------------------------------|-----------------------------------Max split Suppose ATOM=20 and MINCHILD=10. Then we must split this node into two exactly equal child nodes of 10 records each. There is no flexibility here • If no such split can be found because of clumping of values of the variable then the node cannot be split on that variable © Copyright Salford Systems 2013
  • 112. ATOM is 3 times MINCHILD 10 records 10 records 10 records • Min------------------*------|--------------------*--------------------Max left child ..….. split region…... right child • In the example above ATOM=30 and the region of possible splitting points lies in between the two asterisks • There can be just one split point. So long as the smaller side has at least 10 records (in this example of MINCHILD=10) there is freedom to choose • To give CART flexibility as to where to locate this last split (at the bottom of the tree) when need to have ATOM> 2*MINCHILD • Not mandatory but worth keeping in mind. So first choose MINCHILD and then set ATOM sensibly © Copyright Salford Systems 2013
  • 113. An Unappealing Node Split: Could be prevented by using a larger MINCHILD © Copyright Salford Systems 2013 Only one record is sent to the right and the remaining 1999 records go left Can prevent such splits with a control which does not allow a child to be created with fewer than the specified number of records
  • 114. Experiment to get Best Settings © Copyright Salford Systems 2013 SPM PRO EX Battery Tab Model Setop Select ATOM and MINCHILD Modify values to be tested, optionally We used a 50% random sample for testing
  • 115. Choosing ATOM and MINCHILD © Copyright Salford Systems 2013 Settings of ATOM=10 and MINCHILD=5 yield a Rel. error within 1% of the literal best
  • 116. Direct Control Over Tree Size (Almost) • You also have the option of LIMITing tree in a variety of ways including limiting the DEPTH of the tree • To get to the LIMITS menu item you must first go to the Classic Output © Copyright Salford Systems 2013
  • 117. Growing Limits Dialog © Copyright Salford Systems 2013 DEPTH=1 will allow just one split Controlling tree size via a DEPTH limit may yield Inferior results We tend to use it only When wanting extremely small trees such as one split
  • 118. LIMITS Details • A tree of depth=1 can have only two terminal nodes • With each additional depth level we allow for a doubling of the number of terminal nodes • Potential sizes are then 2,4,8,16 etc. • However, depth limits do not guarantee a specific number of terminal nodes only that no terminal node will deeper than was allowed © Copyright Salford Systems 2013
  • 119. LIMIT DEPTH=1 © Copyright Salford Systems 2013 We sometimes want to start a CART analysis by splitting just the ROOT node and then reviewing the entire ranked list of potential splitters Mostly useful for very large data sets as this reduces compute time substantially
  • 120. LIMIT DEPTH=2 © Copyright Salford Systems 2013 Maximum length of any branch will allow two splits between the root node and any terminal node. But some branches might stop early due to pre-pruning.
  • 121. Depth Limit=3 Method GINI © Copyright Salford Systems 2013 With METHOD GINI you may not get every branch of the tree exhibited to the full depth you wanted (due a technical matter – ―pre-pruning‖
  • 122. Depth Limit=3 METHOD PROB © Copyright Salford Systems 2013 You have a better chance of getting every branch grown out to full depth using METHOD PROB
  • 123. Concluding Remarks • Setting ATOM (smallest legal parent) and MINCHILD (smallest legal child) can help to speed up large database runs • Modest limitation will not harm performance if we take care with the settings • Can and should use experimentation to find best settings • In some circumstances setting these controls to values larger than their minimums can improve performance on test data © Copyright Salford Systems 2013
  • 124. CART and the PRIORS Parameter • If you are a casual user of CART you probably can get by without knowing anything about PRIORS • The default settings of CART handle PRIORS in a way that is well suited for almost all classification problems • A casual user will probably not want to review or understand the more technical output‖ which is printed to the plain text ―classic output‖ window • BUT there are some very effective uses of CART that require judicious manipulation of the PRIORS settings • Therefore a basic understanding of PRIORS may be helpful and worth the effort © Copyright Salford Systems 2013
  • 125. Classic Reference • The original CART monograph published in 1984, remains one of the great classics of machine learning • Classification and Regression Trees by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone, CRC Press • Available also in paperback and as e-book from Amazon: • http://www.amazon.com/Classification-Regression-Trees-Leo-Breiman/dp/0412048418/ • Not the easiest reading but well worth having as a reference and contains fascinating discussions regarding the decisions the authors made in crafting CART • Contains extensive discussion of priors as well as all major concepts relevant to CART. Still wothwhile reading. © Copyright Salford Systems 2013
  • 126. CART Monograph Details © Copyright Salford Systems 2013
  • 127. For The Casual User • Thinking about a binary 0/1 classification problem we have two ways of evaluating a CART generated segment – Assign the segment to the majority class (more than 50%) – If there are more 1s then 0s then the segment is labeled ―1‖ – Assign the segment to the class with a LIFT greater than 1 – We start with a baseline event rate (fraction of 1 in the data) – Look at the ratio of event rate in the node to event rate in sample • Ratio of event rate in segment to event rate in root – Any segment with a better than baseline event rate is labeled ―1‖ • CART by default uses the LIFT concept for making decisions (known in CART-speak as PRIORS EQUAL) • You can elect to use the first method via PRIORS DATA © Copyright Salford Systems 2013
  • 128. Example Split: Priors Equal © Copyright Salford Systems 2013 Almost 80% GOOD (Class 0) Remainder BAD (Class 1) Left child is considered a BAD dominant node because 36% BAD > 21.4% BAD Priors equal simply ensures that we think in these ―relative to what we started with‖ terms
  • 129. PRIORS EQUAL or PRIORS DATA • PRIORS EQUAL is almost always the right choice – Is the DEFAULT and almost always yields useful results • PRIORS DATA focuses on absolute majority and not relative counts in the data – Will rarely work with highly unbalanced data (eg 10:1 ratio of 0 to 1) • PRIORS can be expressed as a ratio – Default 1:1 – You can set priors to whatever ratio you like • 1.2:1 as we did in the previous example • 5:1 • 10:1 – Changing priors usually changes results, sometimes dramatically – Extreme priors often make getting any tree impossible © Copyright Salford Systems 2013
  • 130. Setting PRIORS Mechanics © Copyright Salford Systems 2013 To set your own PRIORS first click the SPECIFY option The default settings of 1:1 can now be changed To the left the dialog is allowing me to alter the entry for Class 0 Once entered I will be given the opportunity to make an new entry for Class 1
  • 131. If PRIORS can change results then what is right? • The results CART gives you are intended to reflect what you consider important and what makes sense given your objectives • PRIORS EQUAL usually reflects what most people want • If tweaking the PRIORS and changing them gives you better results given your objectives then use the tweaked priors © Copyright Salford Systems 2013
  • 132. Advice on PRIORS • Start with the default of EQUAL – Most users never get beyond this! • BATTERY PRIORS – CART PRO EX runs an automatic sweep across dozens of different settings to display the consequences of tweaking the priors – Results are then summarized in tables and charts – Useful when you want to achieve a specific balance of accuracy across the dependent variable classes – Choose the setting that is practically best • Otherwise, you can experiment manually to measure the impact of a change © Copyright Salford Systems 2013
  • 133. PRIORS: Under the Hood • To understand how PRIORS affect core CART calculations we need to start with a brief review of splitting rules • We will only discuss the Gini to illustrate the key concepts © Copyright Salford Systems 2013
  • 134. Start With Gini Splitting Rule: Two classes • Very simple formula for the two class (binary) dependent variable • Label the classes as Class 0 and Class 1 and in a specific node in a tree we represent the shares of the data for the two classes as p0 and p1 These two must sum to 1 (p0 + p1 = 1) • The measure of diversity (or impurity) in a given subset of data (e.g. a node) is given by Impurity = 1 – p0*p0 – p1*p1 • Impurity will equal 0 if either sample share is equal to 1 (100%) • Impurity will equal 0.50 when both sample shares are equal (50%) 1 – (.5*.5) – (.5*.5) = 1 - .25 - .25 = .50 © Copyright Salford Systems 2013
  • 135. Splitting Criteria and Impurity • The Gini measure is just a sensible way to represent how diverse the data is in a node (for a classification problem) – Extensive experience shows it works well, a good measure – You do have a choice of 6 different splitting methods in CART • Useful because it can be used for any number of classes – Every class has a share – Square the shares and subtract them all from 1 • We use the Gini measure as a way to rank competing splits • Split A will be considered better it produces child nodes with less diversity (on average) than does split B • We measure the goodness of split by looking at the reduction in impurity relative to the node being split (the parent) © Copyright Salford Systems 2013
  • 136. Improvement Calculation • Hypothetical Example © Copyright Salford Systems 2013 Parent Node Impurity = 0.50 Left Child Impurity = .30 Right Child Impurity=.20 20% of data 80% of data Left child improves diversity by 0.20 (0.50 – 0.30) Right child improves diversity by 0.30 (0.50 – 0.20) Weighted average impurity is .2*.3 + .8*.2=.22 Improvement from parent is .5 - .22 = .28
  • 137. Graphing Gini Impurity (2 classes) • Impurity formula here simplifies to 2p(1-p) • Impurity is greatest when p=(1-p)= 0.5 • Impurity is low when p is near either extreme of 0 or 1 as the node is dominated by one class • Declines slowly near p=.5 and accelerates as it approaches 0 or 1 1 0 0.5 1 Graph is of 2*[2*p*(1-p)] to make it easier to read © Copyright Salford Systems 2013
  • 138. Split Improvement Measurement (No Missing Values for Splitter) © Copyright Salford Systems 2013 Parent Node N Percent Left Child N Percent Right Child N Percent Parent Impurity = 0.50 Left Child Impurity = 0.3967 Fraction of data in left child = 55% Right Child Impurity=0.3457 Fraction of data in right child = 45% Weighted average of child node diversity = .3737 Overall improvement of split = .1262
  • 139. As expressed in the CART monograph Parent node impurity minus weighted average of the impurities in each child node • pL = probability of case going left (fraction of node going left) • pR = probability of case going right (fraction of node going right) • t = node • s = splitting rule • i = impurity ( , ) ( ) ( ) ( )t s i t p i t p i tL L R R impurityL impurityR Impurity Parent probL probR © Copyright Salford Systems 2013
  • 140. Unbalanced Data and PRIORS EQUAL • Calculations for all key quantities become weighted when we use the CART default and the original data is unbalanced • Weighting is used to calculate – Fraction of the data belonging to each class – Fraction of the data in the left and right child nodes – Gini impurity in each node – Resulting improvement of the split (reduction in impurity) • We no longer can use simple ratios • Good news is that the mechanism for weighting is very simple and easy to remember – All counts are expressed as count in the node divided by the corresponding count in the root node © Copyright Salford Systems 2013
  • 141. Calculations for Priors • Our training sample starts with N0 examples of class 0 and N1 examples of class 1 • Now look at any node t in the CART tree – N0(t) examples of class 0 – N1(t) examples of class 1 • Fraction of class 0 will now be calculated as (simplified) • In other words we convert every count to ratio of a count in a node (t) to the corresponding count in the root (sample) • Then the math is the same as usual © Copyright Salford Systems 2013 ( 0N t( )/ 0N ) ( 0N t( )/ 0N )+( 1N t( )/ 1N )
  • 142. What fraction of the data is in a node • Again we use ratios instead of counts to calculate • For priors equal we just average – Fraction of all the Class 0 in a node – Frcation of all the Class 1 in a node • If the priors are not equal then all ratios are first multiplied by the corresponding prior (which acts as a weight) • When priors are equal the terms all cancel out © Copyright Salford Systems 2013 ( 0P0 N t( )/ 0N ) ( 0P0 N t( )/ 0N )+ ( 1P1N t( )/ 1N )
  • 143. • pi(t) = Proportion of class i in node t • If priors DATA then • Proportions of class i in node t with data priors • Otherwise proportions are always calculated as weighted shares using priors adjusted pi Priors Incorporated Into Splitting Gini = 1 - pi 2 (i)= N N i N(t) (t)N (t)N (t)N =t i j i )p( Nj Nj(t) (j) Ni Ni(t) (i) tp )( © Copyright Salford Systems 2013
  • 144. Run a Real World Example 79% Class 0 (Good) 21% Class 1 (Bad) © Copyright Salford Systems 2013 Data set BAD_RARE_X.XLS MODEL BAD = X15 just one predictor
  • 145. Test method: 20% random sample for test © Copyright Salford Systems 2013 We only want to look at the root node split. But tree is quite predictive!
  • 146. Root Node Split: Under PRIORS EQUAL © Copyright Salford Systems 2013 Main splitter improvement is reported to be .06264 Observe that the left hand child is considered to be Class 1 because the node Class 1 share of 41% is greater than the root share of 21.4%
  • 147. Classic Output Typical user rarely consults classic output © Copyright Salford Systems 2013 Start by confirming the total record counts in the parent and child nodes Agrees with previous diagram in GUI
  • 148. Next Confirm Target Class Breakdown © Copyright Salford Systems 2013 Here we see the same counts for Class 0 and Class 1 as in GUI
  • 149. Priors Adjusted Computations © Copyright Salford Systems 2013 Note first that the parent node is reported to have 50% class 0 and 50% class 1 This is guaranteed for the root node under priors equal With 2 classes each is treated as if it represented half the data With 3 classes each would be treated as if it represented 1/3 of the data Our calculations of the Gini impurity would be based on these priors adjusted shares of the data (or node) The class breakdowns in the child nodes (left and right) are priors adjusted using the formulas presented earlier
  • 150. Spreadsheet to Reproduce Results © Copyright Salford Systems 2013 Column C contains the counts for each class in the parent and child nodes Column H at the top records the priors Column G displays the priors adjusted shares (raw shares are in Column D) Column F displays raw and priors adjusted child node probabilities Column J displays the Gini diversity in the parent and child nodes and the improvement generated by the weighted average of the child diversities All we need to input are the class counts and the priors and formulas do the rest
  • 151. Conclusion • Priors are an advanced control that the casual user need not worry about • The default setting is almost always reasonable and almost always yields valuable results • Tweaking the priors can change the details of the tree and can alter results – Sometimes considerably – Can be worth running some experiments • Further discussion in another tutorial © Copyright Salford Systems 2013
  • 152. Modeling automation Report Develop model using a variety of strategies Here we display results for each of the 6 major tree growing methods. Entropy yields best performance here. This one of 18 different automation schemes.© Copyright Salford Systems 2013
  • 153. Summary of Variable Importance Results Across alternative modeling strategies © Copyright Salford Systems 2013
  • 154. Performance Curves of Alternative Models Error plotted against model complexity Four strategies yield similar results; one yields much worse© Copyright Salford Systems 2013
  • 155. Alternative Modeling Automation Strategies Analyst Can Run All Strategies if desired © Copyright Salford Systems 2013
  • 156. Automated Modeling: Vary Penalty on False Positives © Copyright Salford Systems 2013
  • 157. Accuracy among YES and NO groups As penalty on false positive is varied (automatically) © Copyright Salford Systems 2013
  • 158. Automatic Shaving: Backwards Elimination of Least Important Feature • © Copyright Salford Systems 2013
  • 159. Hot Spot Detection: Search many trees for high value segments Lift in node plotted against sample size: Examination of individual nodes from many different trees to find best segments © Copyright Salford Systems 2013
  • 160. Tabular detail: Hot spot search for special nodes Tree 18 Node 25 defines a segment with 85.3% of the target class Sample size in this segment is N=265 in the test set Clicking on any row brings up tree for examination and review © Copyright Salford Systems 2013
  • 161. Constrained Trees • Many predictive models can benefit from Salford’s patent pending ―Structured Trees‖ • Trees constrained in how they are grown to reflect decision support requirements • In mobile phone example: want tree to first segment on customer characteristics and then complete using price variables – Price variables are under the control of the company – Customer characteristics are not under company control © Copyright Salford Systems 2013
  • 162. Visualizing separate regions of tree © Copyright Salford Systems 2013
  • 163. Constraint Dialog Model set up specifying allowable ranges for predictors Green indicates where in the tree variables of group are allowed to appear © Copyright Salford Systems 2013
  • 164. Constrained Tree Mobile Phone Price variables appear only at bottom Demographic and spend information at top of tree Handset (HANDPRIC) and per minute pricing (USEPRICE) at bottom © Copyright Salford Systems 2013
  • 165. Model Deployment -1 Translate Model into Reusable Programming Code New version supports JAVA, C, PMML, SQL, SAS® © Copyright Salford Systems 2013
  • 166. Automatically Generated Code Can be deployed directly © Copyright Salford Systems 2013
  • 167. Deployment –II Use Salford Scoring Engine/Server Controllable via scripting can be deployed in batch mode on server © Copyright Salford Systems 2013
  • 168. Cross-Validation: Part 1 • Built-in automatic method of self testing a model for reliability • Honest assessment of the performance characteristics of a model – Will model perform as expected on previously unseen (new) data • Available for all principal Salford data mining engines • CART monograph 1984 was decisive in introducing cross- validation into data mining • Many important details relevant to decision trees and sequences of models developed in the monograph for the first time © Copyright Salford Systems 2013
  • 169. Cross-Validation is a Testing Method • Why go through special trouble to construct a sophisticated testing method when we can just hold back some test data? • When working with plentiful data it makes perfect sense to reserve a good portion for testing – E.g. Credit risk data set with 150,000 training records and 100,000 test records, real world example – Direct Marketing data sets with 300,000 training records and 50,000 test records • Not all analytical projects have access to large volumes of data © Copyright Salford Systems 2013
  • 170. Principal Reason for Cross-Validation Data Scarcity • When relevant data is scarce we face a data allocation dilemma – If we reserve sufficient data to conduct a reliable test we find ourselves lacking training data – If we insist on having enough training data to build a good model we will have little or nothing left for testing • Train Test • o---------------------------------------------------------------|-------------o • A common division of data is 80% train 20% test • With 300 data records in total this would amount to 240 train and 60 test © Copyright Salford Systems 2013
  • 171. Tough decision: How much data to allocate to test • Train Test • o---------|-------------------------------------------------------------------o • Train Test • o------------------------------|----------------------------------------------o • Train Test • o-------------------------------------------------|---------------------------o • Train Test • o------------------------------------------------------------------------|----o © Copyright Salford Systems 2013
  • 172. Unbalanced Target Data • In most classification studies the target (dependent variable) data distribution is unbalanced • Usually one large data segment (non-event) and a smaller data segment (event) which is the subject of the analysis – Who purchases on an e-commerce website? – Who clicks on a banner ad? – Who benefits from a given medical treatment? – What conditions lead to a manufacturing flaw? • When the data is substantially unbalanced the sample size problem is magnified dramatically – Think of your sample size as being equal to the smaller class – If you only have 100 clicks that is your data set size – Does not matter much that you have 1 million non-clicks. © Copyright Salford Systems 2013
  • 173. Cross-Validation Strategy: Sample Re-use • Any one train/test partition of the data that leaves enough data for training will yield weak test results – based on just a fragment of the available data • But what if we were to repeat this process many times – using different test partitions? • Imagine the following: we divide the data into many 90/10 train/test partitions and repeat the modeling and testing • Suppose that in every trial we get at least 75% of the test data events classified correctly • This would increase our confidence dramatically in the reliability of the model performance – Because we have multiple at least slightly different tests © Copyright Salford Systems 2013
  • 174. Cross-Validation Technical Details • Cross-Validation requires a specialized preparation of the data somewhat different than our example of repeated train/test partitioning • We start by dividing the data into K partitions. In the original CART monongraph Breiman, Friedman, Olshen, and Stone set K=10 • K=10 has become an industry standard due both to Breiman et. al. and other studies that followed (see final slides for details) • The K partitions should all have the same distribution of the target variable (same fraction of events) and if possible be equal in size Takes care to get this right when data cannot be evenly divided into K parts • This is all done automatically for you in SPM software © Copyright Salford Systems 2013
  • 175. Cross-Validation Train/Test Procedure: K mutually exclusive partitions, 1 Test, K-1 Train 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 Test Test Test Test Learn Learn Learn LearnLearn ETC... Learn Above each partition is in the train sample 9 times and in the test sample 1 time
  • 176. Build K Models • Once the data has been partitioned into the K parts we are ready to build K models – If we have 10 data partitions and we will build 10 models • Each model is constructed by reserving one part for test and the remaining K-1 parts for training – If K=5 then each model will based on an 80/20 split of data – If K=10 then each model will be based on a 90/10 split – There is nothing wrong with considering K=15 or K=20 or more • In this strategy it is important to observe that each of the K blocks of data is used as a test sample exactly once • If we could somehow combine all the test results we would have an aggregated test sample equal in size to that of the training data © Copyright Salford Systems 2013
  • 177. Euro_Telco_Mini.xls Data Set Class=0 Class=1 CVCycle Learn Test Learn Test CVW 1 634 70 113 13 0.1026161 2 633 71 114 12 0.0960758 3 634 70 113 13 0.1026161 4 633 71 114 12 0.0960758 5 634 70 113 13 0.1026161 6 633 71 114 12 0.0960758 7 634 70 113 13 0.1026161 8 634 70 113 13 0.1026161 9 633 71 114 12 0.0960758 10 634 70 113 13 0.1026161 • Here we see the breakdown of the 830 record data set into the 10 CV folds • Table shows sample counts for majority and minority classes for learn and test partitions for each fold • Observe that CART has succeeded in making each fold almost identical in the learn/test division and in the balance between TARGET=0 and TARGET-1 • Last column is the WEIGHT that CART uses on each fold for certain calculations
  • 178. Confusion Matrix Prediction Success Matrix • In two-class (e.g. Yes/No) classification test results can be represented via the 2x2 confusion matrix © Copyright Salford Systems 2013 Predicted Y=0 Predicted Y=1 Actual Y=0 20 4 Actual Y=1 1 5 Hypothetical results for the test set of a single Cross-validation fold Note test sample is quite small but there will be a number of these (e.g. 10)
  • 179. Aligning the CV Trees All automatic and the user never sees this Main CV1 CV2 CV3 CV4 CV5 CV6 CV7 CV8 CV9 CV10 Nodes 2 2 3 2 2 2 2 2 2 2 2 Complexity 0.01523 0.11543 0.04915 0.12949 0.08684 0.1178 0.09157 0.11464 0.11911 0.11201 0.10531 Nodes 4 6 4 4 4 5 4 4 5 4 4 Complexity 0.01487 0.01736 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.02285 Nodes 5 7 4 4 4 5 4 4 5 4 7 Complexity 0.01189 0.01455 0.02034 0.01598 0.03128 0.01518 0.03642 0.02188 0.01815 0.02083 0.01342 Nodes 9 8 4 8 4 9 4 9 6 8 10 Complexity 0.00893 0.01118 0.02034 0.01042 0.03128 0.01219 0.03642 0.01229 0.0114 0.01259 0.01157 • We would expect that the trees would be aligned by number of nodes and this is approximately what happens • CART aligns the trees by a measure of ―complexity‖ discussed in other sessions • Alignment is required to determine the estimated error rate of the main tree when it has been pruned to a specific size (complexity) • Thus when the main tree is pruned to 4 terminal nodes align each CV trees appropriately. Eight of the CV trees are also pruned to 4 nodes, but one CV tree is pruned to 5 nodes and one to 6 nodes
  • 180. Summing the Confusion Matrices • Each CV fold generates a test confusion matrix based on a completely separate subset of data • When summed the test partitions are equal to the entire original training data • Summing the confusion matrices yields an aggregate matrix that is based on a sample equal to the original data set • If we started with 300 records the assembled confusion matrix consists of 300 test records • Not a ―trick‖. Each record was genuinely reserved for test one time and was classified correctly or incorrectly in its fold • We have thus arrived at the largest possible test sample we could create: as if 100% of the data was used for test! © Copyright Salford Systems 2013
  • 181. Test Results Extracted From Cross-Validation • Cross-validation is not a method for building a model • Cross-validation is a method for indirectly testing a model that on its own has no test performance results • In classic cross-validation we throw away the K models built on parts of the data. We keep only test results. • Modern options for using these K different models exist and you can save them in SPM – Could be used in a committee or ensemble of models – One of the CV models might turn out to be more interesting than the main model © Copyright Salford Systems 2013
  • 182. Does Cross-Validation Really Work? • We have tested CV by extracting a small training data from a much larger database • We used CV to obtain a ―simulated‖ test performance • We then tested our main model against a genuine large test sample extracted from the larger database • Our results were always remarkably in agreement. CV gave essentially the same results as the true test set method • The CART monograph also discusses similar experiments conducted by Breiman Friedman Olshen and Stone (BFOS) • They come to the same conclusion while observing that 5- fold cross-validation tends to understate model performance and that 20-fold may be slightly more accurate than 10-fold © Copyright Salford Systems 2013
  • 183. How Many Folds? • How many folds do we need to run to obtain reliable results • Think about 2 fold CV – Divide the data into two parts – First train on part 1 and test on part 2 – Then reverse roles of train and test – Assemble results • Problem with 2-fold CV is that we train on only half the available data – This is a severe disadvantage to the learning process unless we have a large amount of data • The spirit of CV is to use as much training as possible © Copyright Salford Systems 2013
  • 184. How many CV folds? • In the original CART monograph the authors Breiman, Friedman, Olshen and Stone discussed some experiments • Using small numbers such as 5-fold was typically pessimistic – Results suggested the model was not as good as it really was • Using a substantial number of folds such as 20 was generally only slightly more accurate than 10-fold – CART authors suggested 10-fold as a default – Results hold for classification problems • These classification model results re-confirmed in a 1995 paper by Ronny Kohavi – A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In International Joint Conference on Artificial Intelligence (IJCAI 1995) © Copyright Salford Systems 2013
  • 185. Creating Your Own Folds: Needs to be done with care with smaller samples • Suppose you have 100 records divided as – 92 records Y=0 – 8 records Y=1 • Each fold must have at least one record for each target class • Best we can do then is to have 8 folds • But we cannot divide 92 into 8 equal parts – 7 parts with 11 records Y=0 (response rate=.0833) – 1 part with 15 records Y=0 ( response rate=.0625) • Better to divide as – 4 parts with 11 records Y=0 (response rate=.0833) – 4 parts with 12 records Y=0 (response rate=.0769) – More equal balance across the folds yields more stable results © Copyright Salford Systems 2013
  • 186. Points to Remember • The ―main‖ model in CV is always built on all the training data – Nothing is held back for testing • If you were to run CV in several different ways – Vary the number of folds – Vary construction of CV folds by varying random number seed • You would always get the exact same main model – Only the estimates of test performance could differ • Are the results sensitive to these parameters? – BATTERY CV re-runs the analysis with different numbers of folds • Larger numbers should converge – BATTERY CVR using the same number of folds but creates the K partitions based on different random number seeds • Is expected to yield reasonably stable results • Unstable results suggest considerable uncertainty regarding your model © Copyright Salford Systems 2013
  • 187. Cross-Validation: Part II • In part I we reveiwed the main ideas behind cross-validation • We pointed out that CV is a method for testing a model • Especially useful when there is a shortage of data but can be used in any circumstance • A main model is built on all training data with nothing held back for testing • An additional set of K different models are built on different partitions of the data holding back some of the data for test • The test results for the K models are aggregated and then used as an estimate of the test set performance of the ―main‖ model © Copyright Salford Systems 2013
  • 188. Cross-Validation Train/Test Procedure: K mutually exclusive partitions, 1 Test, K-1 Train 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 1 102 93 4 5 6 7 8 Test Test Test Test Learn Learn Learn LearnLearn ETC... Learn Above each partition is in the train sample 9 times and in the test sample 1 time
  • 189. Alignment of Results • In this session we discuss a somewhat technical topic related to the mechanics of aligning test results from K CV models and the main model • Recall that CART grows a large tree and then prunes it back • Back pruning is conducted via ―cost-complexity‖ • Back pruning might prune off more than one terminal node at a time • Back pruning might prune back several nodes along the same branch • CV generates K different models each with its own maximal tree and its own sequence of back-pruned trees © Copyright Salford Systems 2013
  • 190. CV Mechanics • Main Model Has No Test Data Each CV Model has test Data © Copyright Salford Systems 2013 Main Model CV Model 1 CV Model 2 CV Model 3 CV Model 10 Combine test results from all CV folds and attribute to main model
  • 191. CART and CV Details • A CART tree model is actually a family of progressively smaller tree models one of which is normally deemed ―optimal‖ • So we don’t just have a main model and K CV models • We have a main tree sequence and K CV tree sequences • For every tree in the main sequence we need to match it up with its corresponding tree in each CV sequence • The most obvious way to do this is by tree size • To estimate the error rate of the 2-node tree in the main tree sequence match it up with the K 2-node trees found via CV • Then proceed to match up every other tree size found © Copyright Salford Systems 2013
  • 192. CART Tree Alignment • Matching up trees from the different sequences is much more complicated than this • Each CV tree has its own sequence and its own maximal size • These sequences may not all contain the same tree sizes • The main tree might contain a subtree with 8 terminal nodes but not every CV tree will contain an 8 node tree – Back pruning sometimes skips over certain sizes jumping directly say from 9 terminal nodes to 7 • Not all tree sequences will have the same number of nodes in the maximal tree © Copyright Salford Systems 2013
  • 193. Alignment via Cost Complexity • Cost complexity prunes trees by examining a trade off between error rate (cost) and size of the tree (complexity) • Error rate can be taken to be misclassification rate for this discussion (on the training data) • Suppose our maximal tree has a training data misclassification rate of .00 (not uncommon on training data) but that the tree is very large (e.g. 1000 terminal nodes) • Suppose we penalized terminal nodes at the rate of .0001 • Then the error rate of 0 would be counterbalanced by a penalty of 1000*(.0001)=0.10 • If we could prune off 500 nodes we would reduce the penalty to .05 but of course our misclassification would probably increase • If the increase in misclassification rate were say .04 then the total of misclass rate + penalty would be only .04 + .05 = .09 a benefit! © Copyright Salford Systems 2013
  • 194. CART Cost Complexity Pruning • CART automatically tests different penalties to try to induce a smaller tree • We always start with a penalty of 0 and then start gradually increasing it • To prune back we prune off the so-called ―weakest-link‖ which is the node that increases the misclassification rate of the whole tree the least • Means that sample size of node is taken into account • A progressive search algorithm for finding the next penalty is described in the CART monograph © Copyright Salford Systems 2013
  • 195. Cost-Complexity is the key to Alignment • For every CART tree sequence a specific penalty on nodes (e.g. .001) leads immediately to exactly one tree of a specific size • We can only find this tree by going through the pruning sequence (no shortcuts) • We align the CART CV trees by the penalty (complexity) rather than the tree size • So for a given penalty we find the tree that corresponds to it both in the main tree sequence and also in each CV tree • These aligned trees are used to extract the performance measures that will finally be assigned to the main tree of that size © Copyright Salford Systems 2013
  • 196. Table of Alignments: Special extract report not automatically generated © Copyright Salford Systems 2013 • Table displays the aligned trees corresponding to each tree in the main sequence • In the first row the main tree has been pruned to 2 nodes as have all but one of the CV trees • When the main tree is pruned to 7 nodes it is aligned with trees of varying sizes ranging from 4 to 7 terminal nodes • The complexity penalties appear under the terminal node counts • Complexity penalties always increase as the tree becomes smaller