3. ID3 (1/2)
ID3 is a strong system that
Uses hill-climbing search based on the information gain measure
to search through the space of decision trees
Outputs a single hypothesis
Never backtracks.It converges to locally optimal solutions
Uses all training examples at each step, contrary to methods that
make decisions incrementally
Uses statistical properties of all examples:the search is less
sensitive to errors in individual training examples
4. ID3 (2/2)
However, ID3 has some drawbacks
It can only deal with nominal data (e.g. continuous data)
It may be not robust in presence of noise (e.g. overfitting)
It is not able to deal with noisy data sets (e.g. missing values)
5. How C4.5 enhances C4.5 to
Be robust in the presence of noise. Avoid overfitting
Deal with continuous attributes
Deal with missing data
Convert trees to rules
6. What’s Overfitting?
Overfitting = Given a hypothesis space H, a hypothesis hєH is
said to overfit the training data if there exists some
alternative hypothesis h’єH, such that
h has smaller error than h’ over the training examples, but
h’ has a smaller error than h over the entire distribution of
instances.
7. Why May my System Overfit?
In domains with noise or uncertainty
the system may try to decrease the training error by completely
fitting all the training examples
8. How to Avoid Overfitting?
Pre-prune: Stop growing a branch when information
becomes unreliable
Post-prune:Take a fully-grown decision tree and discard
unreliable parts
9. Post-pruning
First, build the full tree
Then, prune it
Fully-grown tree shows all attribute interactions
Two pruning operations:
Subtree replacement
Subtree raising
Possible strategies:
error estimation
significance testing
MDL principle
10. Subtree Replacement
Bottom up approach
Consider replacing a tree after considering all its subtrees
Ex: labor negotiations
11. Subtree Replacement
Algorithm:
Split the data into training and validation set
Do until further pruning is harmful:
Evaluate impact on the validation set of
pruning each possible node
Select the node whose removal most
increases the validation set accuracy
12. Deal with continuous attributes
When dealing with nominal data
We evaluated the grain for each possible value
In continuous data, we have infinite values
What should we do?
Continuous-valued attributes may take infinite values, but we
have a limited number of values in our instances (at most N if we
have N instances)
Therefore, simulate that you have N nominal values
Evaluate information gain for every possible split point of the attribute
Choose the best split point
The information gain of the attribute is the information gain of the best split
14. Deal with continuous attributes
Split on temperature attribute:
E.g.: temperature < 71.5: yes/4, no/2
temperature ≥ 71.5: yes/5, no/3
Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits
Place split points halfway between values
Can evaluate all split points in one pass!
15. Deal with continuous attributes
To speed up
Entropy only needs to be evaluated between points of different
classes
Potential optimal breakpoints
Breakpoints between values of the same class cannot be optimal
16. Deal with Missing Data
Treat missing values as a separate value
Missing value denoted “?” in C4.X
Simple idea: treat missing as a separate value
Split instances with missing values into pieces
A piece going down a branch receives a weight proportional to the
popularity of the branch
weights sum to 1
Info gain works with fractional instances
Use sums of weights instead of counts
During classification, split the instance into pieces in the same way
Merge probability distribution using weights
17. What is CART?
Classification And Regression Trees
Developed by Breiman, Friedman, Olshen, Stone in early 80’s.
Introduced tree-based modeling into the statistical mainstream
Rigorous approach involving cross-validation to select the optimal
tree
One of many tree-based modeling techniques.
CART -- the classic
CHAID
C5.0
Software package variants (SAS, S-Plus, R…)
Note: the “rpart” package in “R” is freely available
18. The Key Idea
Recursive Partitioning
Take all of your data.
Consider all possible values of all variables.
Select the variable/value (X=t1) that produces the greatest
“separation” in the target.
(X=t1) is called a “split”.
If X< t1 then send the data to the “left”; otherwise, send data
point to the “right”.
Now repeat same process on these two “nodes”
You get a “tree”
Note: CART only uses binary splits.
|
19. Splitting Rules
Select the variable value (X=t1) that produces the
greatest “separation” in the target variable.
“Separation” defined in many ways.
Regression Trees (continuous target): use sum of
squared errors.
Classification Trees (categorical target): choice of
entropy, Gini measure, “twoing” splitting rule.
20. RegressionTrees
Tree-based modeling for continuous target variable
most intuitively appropriate method for loss ratio analysis
Find split that produces greatest separation in
∑[y – E(y)]2
i.e.: find nodes with minimal within variance
and therefore greatest between variance
like credibility theory
Every record in a node is assigned the same yhat
model is a step function
21. ClassificationTrees
Tree-based modeling for discrete target variable
In contrast with regression trees, various measures of purity
are used
Common measures of purity:
Gini, entropy, “twoing”
Intuition: an ideal retention model would produce nodes that
contain either defectors only or non-defectors only
completely pure nodes
22. ClassificationTrees vs. RegressionTrees
Splitting Criteria:
Gini, Entropy,Twoing
Goodness of fit measure:
misclassification rates
Prior probabilities and
misclassification costs
available as model
“tuning parameters”
Splitting Criterion:
sum of squared errors
Goodness of fit:
same measure!
sum of squared errors
No priors or
misclassification costs…
… just let it run
23. Cost-Complexity Pruning
Definition: Cost-Complexity Criterion
Rα= MC + αL
MC = misclassification rate
Relative to # misclassifications in root node.
L = # leaves (terminal nodes)
You get a credit for lower MC.
But you also get a penalty for more leaves.
Let T0 be the biggest tree.
Find sub-tree of Tα of T0 that minimizes Rα.
Optimal trade-off of accuracy and complexity.
24. References
Introduction to Machine Learning - Lecture
6,http://www.slideshare.net/aorriols/lecture6-c45
CART fromA to B,
https://www.casact.org/education/specsem/f2005/handouts/
cart.ppt